Сбой задания AWS Glue ETL с сообщением «Не удалось удалить ключ: parquet-output / _teorary» - PullRequest
0 голосов
/ 13 мая 2019

Я выполняю задание Glue ETL для таблицы данных CSV, созданной средством поиска клея.Обходчик обращается к каталогу со следующей структурой

   s3 
      -> aggregated output
         ->datafile1.csv
         ->datafile2.csv
     ->datafile3.csv

Эти файлы объединяются в таблицу «агрегированный-вывод», которую можно успешно запросить в Афине.

Я пытаюсь преобразовать это в файл паркета с помощью задания AWL Glue ETL.Работа завершается с ошибкой

"py4j.protocol.Py4JJavaError: An error occurred while calling 
o92.pyWriteDynamicFrame.
: java.io.IOException: Failed to delete key: parquet-output/_temporary"

У меня проблемы с поиском основной причины здесь

Я пытался изменить работу Glue несколькими способами.Я удостоверился, что роль IAM, назначенная заданию, имеет разрешение на удаление папок в данном сегменте.Прямо сейчас я использую папки temp / script по умолчанию, предоставленные AWS.Я пытался использовать папки на моем s3 bucket, но вижу похожую ошибку

s3://aws-glue-temporary-256967298135-us-east-2/admin
s3://aws-glue-scripts-256967298135-us-east-2/admin/rt-5/13-ETL-Tosat

Совместное использование полной трассировки стека ниже

19/05/10 17:57:59 INFO client.RMProxy: Connecting to ResourceManager at ip-172-32-29-36.us-east-2.compute.internal/172.32.29.36:8032
Container: container_1557510304861_0001_01_000002 on ip-172-32-1-101.us-east-2.compute.internal_8041
LogType:stdout
Log Upload Time:Fri May 10 17:57:53 +0000 2019
LogLength:0
Log Contents:
End of LogType:stdout
Container: container_1557510304861_0001_01_000001 on ip-172-32-26-232.us-east-2.compute.internal_8041
LogType:stdout
Log Upload Time:Fri May 10 17:57:54 +0000 2019
LogLength:9048
Log Contents:
null_fields []
Traceback (most recent call last):
File "script_2019-05-10-17-47-48.py", line 40, in <module>
datasink4 = glueContext.write_dynamic_frame.from_options(frame = dropnullfields3, connection_type = "s3", connection_options =
{
    "path": "s3://changed-s3-path-parquet-output/parquet-output"
}
, format = "parquet", transformation_ctx = "datasink4")
File "/mnt/yarn/usercache/root/appcache/application_1557510304861_0001/container_1557510304861_0001_01_000001/PyGlue.zip/awsglue/dynamicframe.py", line 585, in from_options
File "/mnt/yarn/usercache/root/appcache/application_1557510304861_0001/container_1557510304861_0001_01_000001/PyGlue.zip/awsglue/context.py", line 193, in write_dynamic_frame_from_options
File "/mnt/yarn/usercache/root/appcache/application_1557510304861_0001/container_1557510304861_0001_01_000001/PyGlue.zip/awsglue/context.py", line 216, in write_from_options
File "/mnt/yarn/usercache/root/appcache/application_1557510304861_0001/container_1557510304861_0001_01_000001/PyGlue.zip/awsglue/data_sink.py", line 32, in write
File "/mnt/yarn/usercache/root/appcache/application_1557510304861_0001/container_1557510304861_0001_01_000001/PyGlue.zip/awsglue/data_sink.py", line 28, in writeFrame
File "/mnt/yarn/usercache/root/appcache/application_1557510304861_0001/container_1557510304861_0001_01_000001/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 1133, in __call__
File "/mnt/yarn/usercache/root/appcache/application_1557510304861_0001/container_1557510304861_0001_01_000001/pyspark.zip/pyspark/sql/utils.py", line 63, in deco
File "/mnt/yarn/usercache/root/appcache/application_1557510304861_0001/container_1557510304861_0001_01_000001/py4j-0.10.4-src.zip/py4j/protocol.py", line 319, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o92.pyWriteDynamicFrame.
: java.io.IOException: Failed to delete key: parquet-output/_temporary
at com.amazon.ws.emr.hadoop.fs.s3n.S3NativeFileSystem.delete(S3NativeFileSystem.java:689)
at com.amazon.ws.emr.hadoop.fs.EmrFileSystem.delete(EmrFileSystem.java:296)
at org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.cleanupJob(FileOutputCommitter.java:463)
at org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.abortJob(FileOutputCommitter.java:482)
at org.apache.spark.internal.io.HadoopMapReduceCommitProtocol.abortJob(HadoopMapReduceCommitProtocol.scala:156)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply$mcV$sp(FileFormatWriter.scala:212)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:166)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:166)
at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:65)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:166)
at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:145)
at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:58)
at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:56)
at org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:74)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:117)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:117)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:138)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:135)
at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:116)
at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:92)
at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:92)
at org.apache.spark.sql.execution.datasources.DataSource.writeInFileFormat(DataSource.scala:435)
at org.apache.spark.sql.execution.datasources.DataSource.write(DataSource.scala:471)
at org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:50)
at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:58)
at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:56)
at org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:74)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:117)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:117)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:138)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:135)
at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:116)
at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:92)
at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:92)
at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:609)
at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:233)
at com.amazonaws.services.glue.SparkSQLDataSink.writeDynamicFrame(DataSink.scala:123)
at com.amazonaws.services.glue.DataSink.pyWriteDynamicFrame(DataSink.scala:38)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:280)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:214)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.io.IOException: 1 exceptions thrown from 1 batch deletes
at com.amazon.ws.emr.hadoop.fs.s3n.Jets3tNativeFileSystemStore.deleteAll(Jets3tNativeFileSystemStore.java:375)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:191)
at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
at com.sun.proxy.$Proxy56.deleteAll(Unknown Source)
at com.amazon.ws.emr.hadoop.fs.s3n.S3NativeFileSystem.doSingleThreadedBatchDelete(S3NativeFileSystem.java:1369)
at com.amazon.ws.emr.hadoop.fs.s3n.S3NativeFileSystem.delete(S3NativeFileSystem.java:687)
... 50 more
Caused by: java.io.IOException: MultiObjectDeleteException thrown with 26 keys in error: parquet-output/_temporary/0/_temporary/attempt_20190510164114_0000_m_000001_1/part-00001-7288591a-dd6f-404a-9c24-1dd70b3ff4ff-c000.snappy.parquet, parquet-output/_temporary/0/_temporary/attempt_20190510164315_0000_m_000001_2/part-00001-7288591a-dd6f-404a-9c24-1dd70b3ff4ff-c000.snappy.parquet, parquet-output/_temporary/0/_temporary_$folder$
at com.amazon.ws.emr.hadoop.fs.s3n.Jets3tNativeFileSystemStore.deleteAll(Jets3tNativeFileSystemStore.java:360)
... 59 more
Caused by: com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.services.s3.model.MultiObjectDeleteException: One or more objects could not be deleted (Service: null; Status Code: 200; Error Code: null; Request ID: 071747B6992918B9), S3 Extended Request ID: oTBHx76MMI70zuD86AWeq4+rXa3GalW6ptQFM91ceQwhB9SBJV9Z6qw1yG2Ar2DBOr06soLL5dE=
at com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.services.s3.AmazonS3Client.deleteObjects(AmazonS3Client.java:2084)
at com.amazon.ws.emr.hadoop.fs.s3.lite.call.DeleteObjectsCall.perform(DeleteObjectsCall.java:25)
at com.amazon.ws.emr.hadoop.fs.s3.lite.call.DeleteObjectsCall.perform(DeleteObjectsCall.java:11)
at com.amazon.ws.emr.hadoop.fs.s3.lite.executor.GlobalS3Executor.execute(GlobalS3Executor.java:80)
at com.amazon.ws.emr.hadoop.fs.s3.lite.AmazonS3LiteClient.invoke(AmazonS3LiteClient.java:176)
at com.amazon.ws.emr.hadoop.fs.s3.lite.AmazonS3LiteClient.deleteObjects(AmazonS3LiteClient.java:125)
at com.amazon.ws.emr.hadoop.fs.s3n.Jets3tNativeFileSystemStore.deleteAll(Jets3tNativeFileSystemStore.java:355)
... 59 more

End of LogType:stdout

1 Ответ

0 голосов
/ 13 мая 2019

Исправлено это путем создания новой роли IAM с разрешениями для S3 и AWSGlueServiceRole

...