gzFiles: чтение файлов csv.gz из корзины s3 в Spark - PullRequest
0 голосов
/ 27 ноября 2018

Я пытаюсь прочитать файлы данных Part-xxxx.csv.gz из корзины s3 и могу записать вывод в корзину s3 с помощью Intellij.

Та же программа, если я запускаю через EMR (с помощью jarфайл), то я получаю ниже ошибку.

Exception in thread "main" org.apache.spark.SparkException: Application application_1543327349114_0001 finished with failed status

Кажется, он не может прочитать файлы gz в EMR.Но если входной файл - csv, то он читает данные без проблем.

Мой код:

val df = spark.read.format("csv").option("header","true").option("inferSchema","true").load("s3a://test-system/Samplefile.csv")
df.createOrReplaceTempView("data")
val res = spark.sql("select count(*),id,geo_id from data group by id,geo_id")
res.coalesce(1).write.format("csv").option("header","true").mode("Overwrite")
      .save("s3a://test-system/Output/Sampleoutput")

Я использую spark 2.3.0 и Hadoop 2.7.3

Пожалуйста, помогите мне в этом вопросе, как читать *.csv.gz файлы в EMR?

stderr Log:

18/11/28 07:41:22 INFO RMProxy: Connecting to ResourceManager at ip-172-30-3-95.ap-northeast-1.compute.internal/172.30.3.95:8032
18/11/28 07:41:23 INFO Client: Requesting a new application from cluster with 2 NodeManagers
18/11/28 07:41:23 INFO Client: Verifying our application has not requested more than the maximum memory capability of the cluster (106496 MB per container)
18/11/28 07:41:23 INFO Client: Will allocate AM container, with 1408 MB memory including 384 MB overhead
18/11/28 07:41:23 INFO Client: Setting up container launch context for our AM
18/11/28 07:41:23 INFO Client: Setting up the launch environment for our AM container
18/11/28 07:41:23 INFO Client: Preparing resources for our AM container
18/11/28 07:41:25 WARN Client: Neither spark.yarn.jars nor spark.yarn.archive is set, falling back to uploading libraries under SPARK_HOME.
18/11/28 07:41:29 INFO Client: Uploading resource file:/mnt/tmp/spark-d10f886a-bf7b-4a0a-a91f-2f0353bb7b67/__spark_libs__1058363571489040863.zip -> hdfs://ip-172-30-3-95.ap-northeast-1.compute.internal:8020/user/hadoop/.sparkStaging/application_1543390729790_0001/__spark_libs__1058363571489040863.zip
18/11/28 07:41:33 INFO Client: Uploading resource s3://test-system/SparkApps/jar/rxsicheck.jar -> hdfs://ip-172-30-3-95.ap-northeast-1.compute.internal:8020/user/hadoop/.sparkStaging/application_1543390729790_0001/rxsicheck.jar
18/11/28 07:41:33 INFO S3NativeFileSystem: Opening 's3://test-system/SparkApps/jar/rxsicheck.jar' for reading
18/11/28 07:41:33 INFO Client: Uploading resource file:/mnt/tmp/spark-d10f886a-bf7b-4a0a-a91f-2f0353bb7b67/__spark_conf__1080415411630926230.zip -> hdfs://ip-172-30-3-95.ap-northeast-1.compute.internal:8020/user/hadoop/.sparkStaging/application_1543390729790_0001/__spark_conf__.zip
18/11/28 07:41:33 INFO SecurityManager: Changing view acls to: hadoop
18/11/28 07:41:33 INFO SecurityManager: Changing modify acls to: hadoop
18/11/28 07:41:33 INFO SecurityManager: Changing view acls groups to: 
18/11/28 07:41:33 INFO SecurityManager: Changing modify acls groups to: 
18/11/28 07:41:33 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users  with view permissions: Set(hadoop); groups with view permissions: Set(); users  with modify permissions: Set(hadoop); groups with modify permissions: Set()
18/11/28 07:41:33 INFO Client: Submitting application application_1543390729790_0001 to ResourceManager
18/11/28 07:41:33 INFO YarnClientImpl: Submitted application application_1543390729790_0001
18/11/28 07:41:34 INFO Client: Application report for application_1543390729790_0001 (state: ACCEPTED)
18/11/28 07:41:34 INFO Client: 
     client token: N/A
     diagnostics: N/A
     ApplicationMaster host: N/A
     ApplicationMaster RPC port: -1
     queue: default
     start time: 1543390893662
     final status: UNDEFINED
     tracking URL: http://ip-172-30-3-95.ap-northeast-1.compute.internal:20888/proxy/application_1543390729790_0001/
     user: hadoop
18/11/28 07:41:35 INFO Client: Application report for application_1543390729790_0001 (state: ACCEPTED)
18/11/28 07:41:36 INFO Client: Application report for application_1543390729790_0001 (state: ACCEPTED)
18/11/28 07:41:37 INFO Client: Application report for application_1543390729790_0001 (state: ACCEPTED)
18/11/28 07:41:38 INFO Client: Application report for application_1543390729790_0001 (state: ACCEPTED)
18/11/28 07:41:39 INFO Client: Application report for application_1543390729790_0001 (state: ACCEPTED)
18/11/28 07:41:40 INFO Client: Application report for application_1543390729790_0001 (state: ACCEPTED)
18/11/28 07:41:41 INFO Client: Application report for application_1543390729790_0001 (state: ACCEPTED)
18/11/28 07:41:42 INFO Client: Application report for application_1543390729790_0001 (state: ACCEPTED)
18/11/28 07:41:43 INFO Client: Application report for application_1543390729790_0001 (state: ACCEPTED)
18/11/28 07:41:44 INFO Client: Application report for application_1543390729790_0001 (state: ACCEPTED)
18/11/28 07:41:45 INFO Client: Application report for application_1543390729790_0001 (state: ACCEPTED)
18/11/28 07:41:46 INFO Client: Application report for application_1543390729790_0001 (state: ACCEPTED)
18/11/28 07:41:47 INFO Client: Application report for application_1543390729790_0001 (state: ACCEPTED)
18/11/28 07:41:48 INFO Client: Application report for application_1543390729790_0001 (state: ACCEPTED)
18/11/28 07:41:49 INFO Client: Application report for application_1543390729790_0001 (state: ACCEPTED)
18/11/28 07:41:50 INFO Client: Application report for application_1543390729790_0001 (state: ACCEPTED)
18/11/28 07:41:51 INFO Client: Application report for application_1543390729790_0001 (state: ACCEPTED)
18/11/28 07:41:52 INFO Client: Application report for application_1543390729790_0001 (state: ACCEPTED)
18/11/28 07:41:53 INFO Client: Application report for application_1543390729790_0001 (state: ACCEPTED)
18/11/28 07:41:54 INFO Client: Application report for application_1543390729790_0001 (state: ACCEPTED)
18/11/28 07:41:55 INFO Client: Application report for application_1543390729790_0001 (state: ACCEPTED)
18/11/28 07:41:56 INFO Client: Application report for application_1543390729790_0001 (state: ACCEPTED)
18/11/28 07:41:57 INFO Client: Application report for application_1543390729790_0001 (state: ACCEPTED)
18/11/28 07:41:58 INFO Client: Application report for application_1543390729790_0001 (state: ACCEPTED)
18/11/28 07:41:59 INFO Client: Application report for application_1543390729790_0001 (state: ACCEPTED)
18/11/28 07:42:00 INFO Client: Application report for application_1543390729790_0001 (state: ACCEPTED)
18/11/28 07:42:01 INFO Client: Application report for application_1543390729790_0001 (state: ACCEPTED)
18/11/28 07:42:02 INFO Client: Application report for application_1543390729790_0001 (state: ACCEPTED)
18/11/28 07:42:03 INFO Client: Application report for application_1543390729790_0001 (state: ACCEPTED)
18/11/28 07:42:04 INFO Client: Application report for application_1543390729790_0001 (state: ACCEPTED)
18/11/28 07:42:05 INFO Client: Application report for application_1543390729790_0001 (state: ACCEPTED)
18/11/28 07:42:06 INFO Client: Application report for application_1543390729790_0001 (state: ACCEPTED)
18/11/28 07:42:07 INFO Client: Application report for application_1543390729790_0001 (state: FAILED)
18/11/28 07:42:07 INFO Client: 
     client token: N/A
     diagnostics: Application application_1543390729790_0001 failed 2 times due to AM Container for appattempt_1543390729790_0001_000002 exited with  exitCode: 15
For more detailed output, check application tracking page:http://ip-172-30-3-95.ap-northeast-1.compute.internal:8088/cluster/app/application_1543390729790_0001Then, click on links to logs of each attempt.
Diagnostics: Exception from container-launch.
Container id: container_1543390729790_0001_02_000001
Exit code: 15
Stack trace: ExitCodeException exitCode=15: 
    at org.apache.hadoop.util.Shell.runCommand(Shell.java:582)
    at org.apache.hadoop.util.Shell.run(Shell.java:479)
    at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:773)
    at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:212)
    at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:302)
    at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:82)
    at java.util.concurrent.FutureTask.run(FutureTask.java:266)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:748)


Container exited with a non-zero exit code 15
Failing this attempt. Failing the application.
     ApplicationMaster host: N/A
     ApplicationMaster RPC port: -1
     queue: default
     start time: 1543390893662
     final status: FAILED
     tracking URL: http://ip-172-30-3-95.ap-northeast-1.compute.internal:8088/cluster/app/application_1543390729790_0001
     user: hadoop
Exception in thread "main" org.apache.spark.SparkException: Application application_1543390729790_0001 finished with failed status
    at org.apache.spark.deploy.yarn.Client.run(Client.scala:1122)
    at org.apache.spark.deploy.yarn.Client$.main(Client.scala:1168)
    at org.apache.spark.deploy.yarn.Client.main(Client.scala)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:775)
    at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:180)
    at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:205)
    at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:119)
    at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
18/11/28 07:42:07 INFO ShutdownHookManager: Shutdown hook called
18/11/28 07:42:07 INFO ShutdownHookManager: Deleting directory /mnt/tmp/spark-d10f886a-bf7b-4a0a-a91f-2f0353bb7b67
Command exiting with ret '1'
...