Как проверить, что мои закладки работают?Я обнаружил, что, когда я запускаю работу сразу после предыдущего завершения, это, похоже, все еще занимает много времени.Это почему?Я думал, что он не будет читать файлы, которые он уже обработал?Сценарий выглядит следующим образом:
import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
## @params: [JOB_NAME]
args = getResolvedOptions(sys.argv, ['JOB_NAME'])
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args['JOB_NAME'], args)
inputGDF = glueContext.create_dynamic_frame_from_options(connection_type = "s3", connection_options = {"paths": ["s3://xxx-glue/testing-csv"], "recurse": True}, format = "csv", format_options = {"withHeader": True}, transformation_ctx="inputGDF")
if bool(inputGDF.toDF().head(1)):
print("Writing ...")
inputGDF.toDF() \
.drop("createdat") \
.drop("updatedat") \
.write \
.mode("append") \
.partitionBy(["querydestinationplace", "querydatetime"]) \
.parquet("s3://xxx-glue/testing-parquet")
else:
print("Nothing to write ...")
job.commit()
import boto3
glue_client = boto3.client('glue', region_name='ap-southeast-1')
glue_client.start_crawler(Name='xxx-testing-partitioned')
Внешний вид выглядит следующим образом:
18/12/11 14:49:03 INFO Client: Application report for application_1544537674695_0001 (state: RUNNING)
18/12/11 14:49:03 DEBUG Client:
client token: N/A
diagnostics: N/A
ApplicationMaster host: 172.31.2.72
ApplicationMaster RPC port: 0
queue: default
start time: 1544539297014
final status: UNDEFINED
tracking URL: http://ip-172-31-0-204.ap-southeast-1.compute.internal:20888/proxy/application_1544537674695_0001/
user: root
18/12/11 14:49:04 INFO Client: Application report for application_1544537674695_0001 (state: RUNNING)
18/12/11 14:49:04 DEBUG Client:
client token: N/A
diagnostics: N/A
ApplicationMaster host: 172.31.2.72
ApplicationMaster RPC port: 0
queue: default
start time: 1544539297014
final status: UNDEFINED
tracking URL: http://ip-172-31-0-204.ap-southeast-1.compute.internal:20888/proxy/application_1544537674695_0001/
user: root
18/12/11 14:49:05 INFO Client: Application report for application_1544537674695_0001 (state: RUNNING)
18/12/11 14:49:05 DEBUG Client:
client token: N/A
diagnostics: N/A
ApplicationMaster host: 172.31.2.72
ApplicationMaster RPC port: 0
queue: default
start time: 1544539297014
final status: UNDEFINED
tracking URL: http://ip-172-31-0-204.ap-southeast-1.compute.internal:20888/proxy/application_1544537674695_0001/
user: root
...
18/12/11 14:42:00 INFO NewHadoopRDD: Input split: s3://pinfare-glue/testing-csv/2018-09-25/DPS/2018-11-15_2018-11-19.csv:0+1194081
18/12/11 14:42:00 INFO S3NativeFileSystem: Opening 's3://pinfare-glue/testing-csv/2018-09-25/DPS/2018-11-14_2018-11-18.csv' for reading
18/12/11 14:42:00 INFO S3NativeFileSystem: Opening 's3://pinfare-glue/testing-csv/2018-09-25/DPS/2018-11-15_2018-11-19.csv' for reading
18/12/11 14:42:00 INFO Executor: Finished task 89.0 in stage 0.0 (TID 89). 2088 bytes result sent to driver
18/12/11 14:42:00 INFO CoarseGrainedExecutorBackend: Got assigned task 92
18/12/11 14:42:00 INFO Executor: Running task 92.0 in stage 0.0 (TID 92)
18/12/11 14:42:00 INFO NewHadoopRDD: Input split: s3://pinfare-glue/testing-csv/2018-09-25/DPS/2018-11-16_2018-11-20.csv:0+1137753
18/12/11 14:42:00 INFO Executor: Finished task 88.0 in stage 0.0 (TID 88). 2088 bytes result sent to driver
18/12/11 14:42:00 INFO CoarseGrainedExecutorBackend: Got assigned task 93
18/12/11 14:42:00 INFO Executor: Running task 93.0 in stage 0.0 (TID 93)
18/12/11 14:42:00 INFO NewHadoopRDD: Input split: s3://pinfare-glue/testing-csv/2018-09-25/DPS/2018-11-17_2018-11-21.csv:0+1346626
18/12/11 14:42:00 INFO S3NativeFileSystem: Opening 's3://pinfare-glue/testing-csv/2018-09-25/DPS/2018-11-16_2018-11-20.csv' for reading
18/12/11 14:42:00 INFO S3NativeFileSystem: Opening 's3://pinfare-glue/testing-csv/2018-09-25/DPS/2018-11-17_2018-11-21.csv' for reading
18/12/11 14:42:00 INFO Executor: Finished task 90.0 in stage 0.0 (TID 90). 2088 bytes result sent to driver
18/12/11 14:42:00 INFO Executor: Finished task 91.0 in stage 0.0 (TID 91). 2088 bytes result sent to driver
18/12/11 14:42:00 INFO CoarseGrainedExecutorBackend: Got assigned task 94
18/12/11 14:42:00 INFO CoarseGrainedExecutorBackend: Got assigned task 95
18/12/11 14:42:00 INFO Executor: Running task 95.0 in stage 0.0 (TID 95)
18/12/11 14:42:00 INFO Executor: Running task 94.0 in stage 0.0 (TID 94)
... Я замечаю, что к паркету добавлено много повторяющихся данных ... Является ли закладка нетза работой?Уже включен