PySpark JVM не может загрузить пакет - PullRequest
0 голосов
/ 04 апреля 2019

Я хочу использовать Apache-Spline с pyspark. Apache-Spline - это модуль Scala для передачи данных. Spark JVM не может найти пакет

Вот так я запускаю работу:

spark-submit --deploy-mode cluster --packages org.json4s:json4s-native_2.11:3.6.5,org.apache.spark:spark-sql-kafka-0-10_2.11:2.2.1,za.co.absa.spline:spline-core:0.3.6,za.co.absa.spline:spline-persistence-mongo:0.3.6,za.co.absa.spline:spline-core-spark-adapter-2.2:0.3.6 s3://mybucket/myscript.py 

По логам все зависимости найдены и загружены в /home/hadoop/.ivy2/jars.

Я пытался добавить --driver-library-path /home/hadoop/.ivy2/jars, но это не сработало

это скрипт, вызывающий jvm:

SparkContext.setSystemProperty("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
SparkContext.setSystemProperty("spline.persistence.factory", "za.co.absa.spline.persistence.mongo.MongoPersistenceFactory")
SparkContext.setSystemProperty("spline.mongodb.url", "mongodb://1.2.3.4")
SparkContext.setSystemProperty("spline.mongodb.name", "local")

sc = SparkContext()
spark = SparkSession(sc)
sc._jvm.za.co.absa.spline.core.SparkLineageInitializer.enableLineageTracking(spark._jsparkSession)

Вывод ошибки:

Traceback (most recent call last):
  File "index_aggregation_job.py", line 36, in <module>
    sc._jvm.za.co.absa.spline.core.SparkLineageInitializer.enableLineageTracking(spark._jsparkSession)
  File "/mnt/yarn/usercache/hadoop/appcache/application_1554297790159_0003/container_1554297790159_0003_01_000001/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 1363, in __getattr__
py4j.protocol.Py4JError: za.co.absa.spline.core.SparkLineageInitializer.enableLineageTracking does not exist in the JVM

Наконец, некоторые журналы, которые могут помочь вам понять проблему:

Ivy Default Cache set to: /home/hadoop/.ivy2/cache
The jars for the packages stored in: /home/hadoop/.ivy2/jars
:: loading settings :: url = jar:file:/usr/lib/spark/jars/ivy-2.4.0.jar!/org/apache/ivy/core/settings/ivysettings.xml
org.json4s#json4s-native_2.11 added as a dependency
org.apache.spark#spark-sql-kafka-0-10_2.11 added as a dependency
za.co.absa.spline#spline-core added as a dependency
za.co.absa.spline#spline-persistence-mongo added as a dependency
za.co.absa.spline#spline-core-spark-adapter-2.2 added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent;1.0
    confs: [default]
    found org.json4s#json4s-native_2.11;3.6.5 in central
    found org.json4s#json4s-core_2.11;3.6.5 in central
    found org.json4s#json4s-ast_2.11;3.6.5 in central
    found org.json4s#json4s-scalap_2.11;3.6.5 in central
    found com.thoughtworks.paranamer#paranamer;2.8 in central
    found org.apache.spark#spark-sql-kafka-0-10_2.11;2.2.1 in central
    found org.apache.kafka#kafka-clients;0.10.0.1 in central
    found net.jpountz.lz4#lz4;1.3.0 in central
    found org.xerial.snappy#snappy-java;1.1.2.6 in central
    found org.slf4j#slf4j-api;1.7.16 in central
    found org.apache.spark#spark-tags_2.11;2.2.1 in central
    found org.spark-project.spark#unused;1.0.0 in central
    found za.co.absa.spline#spline-core;0.3.6 in central
    found za.co.absa.spline#spline-core-spark-adapter-api;0.3.6 in central
    found za.co.absa.spline#spline-commons;0.3.6 in central
    found commons-configuration#commons-configuration;1.10 in central
    .
    .
    .   
    za.co.absa.spline#spline-core-spark-adapter-2.2;0.3.6 from central in [default]
    za.co.absa.spline#spline-core-spark-adapter-api;0.3.6 from central in [default]
    za.co.absa.spline#spline-model;0.3.6 from central in [default]
    za.co.absa.spline#spline-persistence-api;0.3.6 from central in [default]
    za.co.absa.spline#spline-persistence-mongo;0.3.6 from central in [default]
    :: evicted modules:
    org.slf4j#slf4j-api;1.7.16 by [org.slf4j#slf4j-api;1.7.25] in [default]
    org.apache.spark#spark-sql-kafka-0-10_2.11;${spark.version} by [org.apache.spark#spark-sql-kafka-0-10_2.11;2.2.1] in [default]
    org.json4s#json4s-native_2.11;${json4s.version} by [org.json4s#json4s-native_2.11;3.6.5] in [default]
    ---------------------------------------------------------------------
    |                  |            modules            ||   artifacts   |
    |       conf       | number| search|dwnlded|evicted|| number|dwnlded|
    ---------------------------------------------------------------------
    |      default     |   44  |   2   |   2   |   3   ||   41  |   0   |
    ---------------------------------------------------------------------
:: retrieving :: org.apache.spark#spark-submit-parent
    confs: [default]
    0 artifacts copied, 41 already retrieved (0kB/21ms)
19/04/03 15:51:41 INFO RMProxy: Connecting to ResourceManager at ip-1-2-3-4.eu-west-1.compute.internal/172.31.47.8:8032
19/04/03 15:51:41 INFO Client: Requesting a new application from cluster with 1 NodeManagers
19/04/03 15:51:41 INFO Client: Verifying our application has not requested more than the maximum memory capability of the cluster (12288 MB per container)
19/04/03 15:51:41 INFO Client: Will allocate AM container, with 1408 MB memory including 384 MB overhead
19/04/03 15:51:41 INFO Client: Setting up container launch context for our AM
19/04/03 15:51:41 INFO Client: Setting up the launch environment for our AM container
19/04/03 15:51:41 INFO Client: Preparing resources for our AM container
19/04/03 15:51:42 WARN Client: Neither spark.yarn.jars nor spark.yarn.archive is set, falling back to uploading libraries under SPARK_HOME.
19/04/03 15:51:44 INFO Client: Uploading resource file:/mnt/tmp/spark-96401806-4919-40aa-9e67-03281ae2f820/__spark_libs__4238720884036489921.zip -> hdfs://ip-1-2-3-4.eu-west-1.compute.internal:8020/user/hadoop/.sparkStaging/application_1554297790159_0003/__spark_libs__4238720884036489921.zip
19/04/03 15:51:45 INFO Client: Uploading resource file:/home/hadoop/.ivy2/jars/org.json4s_json4s-native_2.11-3.6.5.jar -> hdfs://ip-1-2-3-4.eu-west-1.compute.internal:8020/user/hadoop/.sparkStaging/application_1554297790159_0003/org.json4s_json4s-native_2.11-3.6.5.jar
19/04/03 15:51:45 INFO Client: Uploading resource file:/home/hadoop/.ivy2/jars/org.apache.spark_spark-sql-kafka-0-10_2.11-2.2.1.jar -> hdfs://ip-1-2-3-4.eu-west-1.compute.internal:8020/user/hadoop/.sparkStaging/application_1554297790159_0003/org.apache.spark_spark-sql-kafka-0-10_2.11-2.2.1.jar
.
.
.
19/04/03 15:51:51 INFO Client: Uploading resource s3://mybucket/myscript.py -> hdfs://ip-1-2-3-4.eu-west-1.compute.internal:8020/user/hadoop/.sparkStaging/application_1554297790159_0003/index_aggregation_job.py
19/04/03 15:51:51 INFO S3NativeFileSystem: Opening 's3://mybucket/myscript.py' for reading
19/04/03 15:51:51 INFO Client: Uploading resource file:/usr/lib/spark/python/lib/pyspark.zip -> hdfs://ip-1-2-3-4.eu-west-1.compute.internal:8020/user/hadoop/.sparkStaging/application_1554297790159_0003/pyspark.zip
19/04/03 15:51:52 INFO Client: Uploading resource file:/usr/lib/spark/python/lib/py4j-0.10.4-src.zip -> hdfs://ip-1-2-3-4.eu-west-1.compute.internal:8020/user/hadoop/.sparkStaging/application_1554297790159_0003/py4j-0.10.4-src.zip
19/04/03 15:51:52 WARN Client: Same path resource file:/home/hadoop/.ivy2/jars/org.json4s_json4s-native_2.11-3.6.5.jar added multiple times to distributed cache.
19/04/03 15:51:52 WARN Client: Same path resource file:/home/hadoop/.ivy2/jars/org.apache.spark_spark-sql-kafka-0-10_2.11-2.2.1.jar added multiple times to distributed cache.
19/04/03 15:51:52 WARN Client: Same path resource file:/home/hadoop/.ivy2/jars/org.slf4s_slf4s-api_2.11-1.7.25.jar added multiple times to distributed cache.
19/04/03 15:51:52 WARN Client: Same path resource file:/home/hadoop/.ivy2/jars/commons-lang_commons-lang-2.6.jar added multiple times to distributed cache.
19/04/03 15:51:52 WARN Client: Same path resource file:/home/hadoop/.ivy2/jars/commons-logging_commons-logging-1.1.1.jar added multiple times to distributed cache.
19/04/03 15:51:52 WARN Client: Same path resource file:/home/hadoop/.ivy2/jars/org.slf4j_slf4j-api-1.7.25.jar added multiple times to distributed cache.
19/04/03 15:51:52 WARN Client: Same path resource file:/home/hadoop/.ivy2/jars/com.github.salat_salat-util_2.11-1.11.2.jar added multiple times to distributed cache.
19/04/03 15:51:52 INFO Client: Uploading resource file:/mnt/tmp/spark-96401806-4919-40aa-9e67-03281ae2f820/__spark_conf__4569872418253512159.zip -> hdfs://ip-1-2-3-4.eu-west-1.compute.internal:8020/user/hadoop/.sparkStaging/application_1554297790159_0003/__spark_conf__.zip
19/04/03 15:51:52 INFO SecurityManager: Changing view acls to: hadoop
19/04/03 15:51:52 INFO SecurityManager: Changing modify acls to: hadoop
19/04/03 15:51:52 INFO SecurityManager: Changing view acls groups to: 
19/04/03 15:51:52 INFO SecurityManager: Changing modify acls groups to: 
19/04/03 15:51:52 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users  with view permissions: Set(hadoop); groups with view permissions: Set(); users  with modify permissions: Set(hadoop); groups with modify permissions: Set()
19/04/03 15:51:52 INFO Client: Submitting application application_1554297790159_0003 to ResourceManager
19/04/03 15:51:52 INFO YarnClientImpl: Submitted application application_1554297790159_0003
19/04/03 15:51:53 INFO Client: Application report for application_1554297790159_0003 (state: ACCEPTED)
19/04/03 15:51:53 INFO Client: 
     client token: N/A
     diagnostics: AM container is launched, waiting for AM container to Register with RM
     ApplicationMaster host: N/A
     ApplicationMaster RPC port: -1
     queue: default
     start time: 1554306712594
     final status: UNDEFINED
     tracking URL: http://ip-1-2-3-4.eu-west-1.compute.internal:20888/proxy/application_1554297790159_0003/
     user: hadoop
19/04/03 15:51:54 INFO Client: Application report for application_1554297790159_0003 (state: ACCEPTED)
19/04/03 15:51:55 INFO Client: Application report for application_1554297790159_0003 (state: ACCEPTED)
19/04/03 15:51:56 INFO Client: Application report for application_1554297790159_0003 (state: ACCEPTED)
19/04/03 15:51:57 INFO Client: Application report for application_1554297790159_0003 (state: ACCEPTED)
19/04/03 15:51:58 INFO Client: Application report for application_1554297790159_0003 (state: ACCEPTED)
19/04/03 15:51:59 INFO Client: Application report for application_1554297790159_0003 (state: RUNNING)
19/04/03 15:51:59 INFO Client: 
     client token: N/A
     diagnostics: N/A
     ApplicationMaster host: 172.31.38.225
     ApplicationMaster RPC port: 0
     queue: default
     start time: 1554306712594
     final status: UNDEFINED
     tracking URL: http://ip-1-2-3-4.eu-west-1.compute.internal:20888/proxy/application_1554297790159_0003/
     user: hadoop
19/04/03 15:52:00 INFO Client: Application report for application_1554297790159_0003 (state: ACCEPTED)
19/04/03 15:52:00 INFO Client: 
     client token: N/A
     diagnostics: [Wed Apr 03 15:51:59 +0000 2019] Application is Activated, waiting for resources to be assigned for AM.  Details : AM Partition = <DEFAULT_PARTITION> ; Partition Resource = <memory:12288, vCores:8> ; Queue's Absolute capacity = 100.0 % ; Queue's Absolute used capacity = 0.0 % ; Queue's Absolute max capacity = 100.0 % ; 
     ApplicationMaster host: N/A
     ApplicationMaster RPC port: -1
     queue: default
     start time: 1554306712594
     final status: UNDEFINED
     tracking URL: http://ip-1-2-3-4.eu-west-1.compute.internal:20888/proxy/application_1554297790159_0003/
     user: hadoop
19/04/03 15:52:01 INFO Client: Application report for application_1554297790159_0003 (state: ACCEPTED)
19/04/03 15:52:02 INFO Client: Application report for application_1554297790159_0003 (state: ACCEPTED)
19/04/03 15:52:03 INFO Client: Application report for application_1554297790159_0003 (state: ACCEPTED)
19/04/03 15:52:04 INFO Client: Application report for application_1554297790159_0003 (state: ACCEPTED)
19/04/03 15:52:05 INFO Client: Application report for application_1554297790159_0003 (state: ACCEPTED)
19/04/03 15:52:06 INFO Client: Application report for application_1554297790159_0003 (state: FINISHED)
19/04/03 15:52:06 INFO Client: 
     client token: N/A
     diagnostics: User application exited with status 1
     ApplicationMaster host: 172.31.38.225
     ApplicationMaster RPC port: 0
     queue: default
     start time: 1554306712594
     final status: FAILED
     tracking URL: http://ip-1-2-3-4.eu-west-1.compute.internal:20888/proxy/application_1554297790159_0003/
     user: hadoop
Exception in thread "main" org.apache.spark.SparkException: Application application_1554297790159_0003 finished with failed status
    at org.apache.spark.deploy.yarn.Client.run(Client.scala:1122)
    at org.apache.spark.deploy.yarn.Client$.main(Client.scala:1168)
    at org.apache.spark.deploy.yarn.Client.main(Client.scala)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:775)
    at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:180)
    at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:205)
    at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:119)
    at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
19/04/03 15:52:06 INFO ShutdownHookManager: Shutdown hook called
19/04/03 15:52:06 INFO ShutdownHookManager: Deleting directory /mnt/tmp/spark-96401806-4919-40aa-9e67-03281ae2f820
Command exiting with ret '1'
...