pySpark 1.6 не может выполнить код Java через py4j, несмотря на то, что он работает в pySpark 2.0 - PullRequest
0 голосов
/ 16 января 2020

Может ли кто-нибудь предложить возможный обходной путь, кроме обновления версии Spark? Мне не удалось обнаружить причину root с включенной регистрацией отладки в искре и в python.

Шаги для воспроизведения

ДАЛИ искру 1.6 .3 установка https://archive.apache.org/dist/spark/spark-1.6.3/spark-1.6.3-bin-hadoop2.6.tgz

КОГДА запускается pyspark spark-1.6.3-bin-hadoop2.6 / bin / pyspark --packages org.springframework: spring-core: 2.5.6 И попробуйте выполнить некоторый код Java, используя py4j

from py4j.java_gateway import java_import
java_import(sc._jvm, "org.springframework:spring-core:2.5.6")
print(sc._jvm.org.springframework.util.StringUtils)
print(sc._jvm.org.springframework.util.StringUtils.capitalize("azaza"))

ТОГДА я получаю сообщение об ошибке TypeError: 'JavaPackage' object is not callable

Тот же код отлично работает с использованием предварительного просмотра Spark 2.0.0 и следующие версии https://archive.apache.org/dist/spark/spark-2.0.0-preview/spark-2.0.0-preview-bin-hadoop2.6.tgz

Полный журнал ошибок:

root@eb18eac5046f:/# spark-1.6.3-bin-hadoop2.6/bin/pyspark --packages org.springframework:spring-core:2.5.6
Python 2.7.17 (default, Nov  7 2019, 10:07:09) 
[GCC 7.4.0] on linux2
Type "help", "copyright", "credits" or "license" for more information.
Ivy Default Cache set to: /root/.ivy2/cache
The jars for the packages stored in: /root/.ivy2/jars
:: loading settings :: url = jar:file:/spark-1.6.3-bin-hadoop2.6/lib/spark-assembly-1.6.3-hadoop2.6.0.jar!/org/apache/ivy/core/settings/ivysettings.xml
org.springframework#spring-core added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent;1.0
        confs: [default]
        found org.springframework#spring-core;2.5.6 in central
        found commons-logging#commons-logging;1.1.1 in central
:: resolution report :: resolve 134ms :: artifacts dl 3ms
        :: modules in use:
        commons-logging#commons-logging;1.1.1 from central in [default]
        org.springframework#spring-core;2.5.6 from central in [default]
        ---------------------------------------------------------------------
        |                  |            modules            ||   artifacts   |
        |       conf       | number| search|dwnlded|evicted|| number|dwnlded|
        ---------------------------------------------------------------------
        |      default     |   2   |   0   |   0   |   0   ||   2   |   0   |
        ---------------------------------------------------------------------
:: retrieving :: org.apache.spark#spark-submit-parent
        confs: [default]
        0 artifacts copied, 2 already retrieved (0kB/5ms)
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
20/01/16 14:32:29 INFO SparkContext: Running Spark version 1.6.3
20/01/16 14:32:29 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
20/01/16 14:32:29 INFO SecurityManager: Changing view acls to: root
20/01/16 14:32:29 INFO SecurityManager: Changing modify acls to: root
20/01/16 14:32:29 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(root); users with modify permissions: Set(root)
20/01/16 14:32:29 INFO Utils: Successfully started service 'sparkDriver' on port 45239.
20/01/16 14:32:30 INFO Slf4jLogger: Slf4jLogger started
20/01/16 14:32:30 INFO Remoting: Starting remoting
20/01/16 14:32:30 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://sparkDriverActorSystem@172.17.0.2:35719]
20/01/16 14:32:30 INFO Utils: Successfully started service 'sparkDriverActorSystem' on port 35719.
20/01/16 14:32:30 INFO SparkEnv: Registering MapOutputTracker
20/01/16 14:32:30 INFO SparkEnv: Registering BlockManagerMaster
20/01/16 14:32:30 INFO DiskBlockManager: Created local directory at /tmp/blockmgr-44d21f55-81c9-4589-9e4d-3642f8faa425
20/01/16 14:32:30 INFO MemoryStore: MemoryStore started with capacity 511.1 MB
20/01/16 14:32:30 INFO SparkEnv: Registering OutputCommitCoordinator
20/01/16 14:32:30 INFO Utils: Successfully started service 'SparkUI' on port 4040.
20/01/16 14:32:30 INFO SparkUI: Started SparkUI at http://172.17.0.2:4040
20/01/16 14:32:30 INFO HttpFileServer: HTTP File server directory is /tmp/spark-811d62dc-5561-4702-af86-5689445501dd/httpd-03afe64f-fc5d-43f0-8735-dba8f027a0f2
20/01/16 14:32:30 INFO HttpServer: Starting HTTP Server
20/01/16 14:32:30 INFO Utils: Successfully started service 'HTTP file server' on port 45277.
20/01/16 14:32:30 INFO SparkContext: Added JAR file:/root/.ivy2/jars/org.springframework_spring-core-2.5.6.jar at http://172.17.0.2:45277/jars/org.springframework_spring-core-2.5.6.jar with timestamp 1579185150506
20/01/16 14:32:30 INFO SparkContext: Added JAR file:/root/.ivy2/jars/commons-logging_commons-logging-1.1.1.jar at http://172.17.0.2:45277/jars/commons-logging_commons-logging-1.1.1.jar with timestamp 1579185150507
20/01/16 14:32:30 INFO Utils: Copying /root/.ivy2/jars/org.springframework_spring-core-2.5.6.jar to /tmp/spark-811d62dc-5561-4702-af86-5689445501dd/userFiles-cbd86fc3-db4a-47aa-90cb-24007ac3090d/org.springframework_spring-core-2.5.6.jar
20/01/16 14:32:30 INFO SparkContext: Added file file:/root/.ivy2/jars/org.springframework_spring-core-2.5.6.jar at file:/root/.ivy2/jars/org.springframework_spring-core-2.5.6.jar with timestamp 1579185150577
20/01/16 14:32:30 INFO Utils: Copying /root/.ivy2/jars/commons-logging_commons-logging-1.1.1.jar to /tmp/spark-811d62dc-5561-4702-af86-5689445501dd/userFiles-cbd86fc3-db4a-47aa-90cb-24007ac3090d/commons-logging_commons-logging-1.1.1.jar
20/01/16 14:32:30 INFO SparkContext: Added file file:/root/.ivy2/jars/commons-logging_commons-logging-1.1.1.jar at file:/root/.ivy2/jars/commons-logging_commons-logging-1.1.1.jar with timestamp 1579185150584
20/01/16 14:32:30 INFO Executor: Starting executor ID driver on host localhost
20/01/16 14:32:30 INFO Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 43963.
20/01/16 14:32:30 INFO NettyBlockTransferService: Server created on 43963
20/01/16 14:32:30 INFO BlockManagerMaster: Trying to register BlockManager
20/01/16 14:32:30 INFO BlockManagerMasterEndpoint: Registering block manager localhost:43963 with 511.1 MB RAM, BlockManagerId(driver, localhost, 43963)
20/01/16 14:32:30 INFO BlockManagerMaster: Registered BlockManager
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /__ / .__/\_,_/_/ /_/\_\   version 1.6.3
      /_/

Using Python version 2.7.17 (default, Nov  7 2019 10:07:09)
SparkContext available as sc, HiveContext available as sqlContext.
>>> from py4j.java_gateway import java_import
>>> java_import(sc._jvm, "org.springframework:spring-core:2.5.6")
>>> print(sc._jvm.org.springframework.util.StringUtils)
<py4j.java_gateway.JavaPackage object at 0x7efd49761990>
>>> print(sc._jvm.org.springframework.util.StringUtils.capitalize("azaza"))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: 'JavaPackage' object is not callable

Полный журнал успеха:

root@eb18eac5046f:/# spark-2.0.0-preview-bin-hadoop2.6/bin/pyspark  --packages org.springframework:spring-core:2.5.6
Python 2.7.17 (default, Nov  7 2019, 10:07:09) 
[GCC 7.4.0] on linux2
Type "help", "copyright", "credits" or "license" for more information.
Ivy Default Cache set to: /root/.ivy2/cache
The jars for the packages stored in: /root/.ivy2/jars
:: loading settings :: url = jar:file:/spark-2.0.0-preview-bin-hadoop2.6/jars/ivy-2.4.0.jar!/org/apache/ivy/core/settings/ivysettings.xml
org.springframework#spring-core added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent;1.0
        confs: [default]
        found org.springframework#spring-core;2.5.6 in central
        found commons-logging#commons-logging;1.1.1 in central
:: resolution report :: resolve 140ms :: artifacts dl 4ms
        :: modules in use:
        commons-logging#commons-logging;1.1.1 from central in [default]
        org.springframework#spring-core;2.5.6 from central in [default]
        ---------------------------------------------------------------------
        |                  |            modules            ||   artifacts   |
        |       conf       | number| search|dwnlded|evicted|| number|dwnlded|
        ---------------------------------------------------------------------
        |      default     |   2   |   0   |   0   |   0   ||   2   |   0   |
        ---------------------------------------------------------------------
:: retrieving :: org.apache.spark#spark-submit-parent
        confs: [default]
        0 artifacts copied, 2 already retrieved (0kB/6ms)
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel).
20/01/16 14:41:12 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
20/01/16 14:41:13 WARN AbstractHandler: No Server set for org.spark_project.jetty.server.handler.ErrorHandler@736970d3
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /__ / .__/\_,_/_/ /_/\_\   version 2.0.0-preview
      /_/

Using Python version 2.7.17 (default, Nov  7 2019 10:07:09)
SparkSession available as 'spark'.
>>> from py4j.java_gateway import java_import
>>> java_import(sc._jvm, "org.springframework:spring-core:2.5.6")
>>> print(sc._jvm.org.springframework.util.StringUtils)
<py4j.java_gateway.JavaClass object at 0x7f68d9f4fa50>
>>> print(sc._jvm.org.springframework.util.StringUtils.capitalize("azaza"))
Azaza
>>> 

1 Ответ

1 голос
/ 21 января 2020

Кажется, что pyspark во время запуска инициализирует SparkContext sc с _jvm как JVMView, который не содержится во внешних файлах пути classpath из --jars или --packages аргументов. Вы можете решить это, явно экспортировав SPARK_CLASSPATH:

export SPARK_CLASSPATH=/some_folder/spring-core-2.5.6.jar

, а затем запустив pyspark без каких-либо аргументов:

spark-1.6.3-bin-hadoop2.6/bin/pyspark

UPDATE: кажется, что --jars и --packages добавляет внешние jar-файлы только для рабочих узлов в pyspark 1.6, поэтому вам необходимо добавить эти jar-файлы и в драйвер следующим образом:

spark-1.6.3-bin-hadoop2.6/bin/pyspark --driver-class-path ~/Documents/spring-core-2.5.6.jar --jars ~/Documents/spring-core-2.5.6.jar
...