Question

Я использую следующую команду для выполнения скрипта pyspark:

spark-submit \
  --packages org.apache.spark:spark-streaming-kafka-0-8_2.11:2.4.0 \
  pyspark_streaming/spark_streaming.py

Сценарий spark_streaming.py выглядит следующим образом:

from pyspark import SparkContext
from pyspark.streaming import StreamingContext
from pyspark.streaming.kafka import KafkaUtils
import json

if __name__ == "__main__":
    sc = SparkContext('local[*]')
    ssc = StreamingContext(sc, 10)
    kafkaStream = KafkaUtils.createStream(ssc, 'localhost:2181', 'spark-streaming', {'twitter':1})
    parsed = kafkaStream.map(lambda v: json.loads(v[1]))
    user_counts = parsed.map(lambda tweet: (tweet['user']["screen_name"], 1)).reduceByKey(lambda x,y: x + y)
    user_counts.pprint()
    ssc.start()
    ssc.awaitTermination()

Это выдает следующую ошибку:

ubuntu@ubuntu:/mnt/vmware_shared_folder$ spark-submit --packages org.apache.spark:spark-streaming-kafka-0-8_2.11:2.4.0 pyspark_streaming/spark_streaming.py
Ivy Default Cache set to: /home/ubuntu/.ivy2/cache
The jars for the packages stored in: /home/ubuntu/.ivy2/jars
:: loading settings :: url = jar:file:/home/ubuntu/spark-2.4.0-bin-hadoop2.7/jars/ivy-2.4.0.jar!/org/apache/ivy/core/settings/ivysettings.xml
org.apache.spark#spark-streaming-kafka-0-8_2.11 added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent-e1001739-51dd-43bc-83b7-52a7484328f1;1.0
    confs: [default]
    found org.apache.spark#spark-streaming-kafka-0-8_2.11;2.4.0 in central
    found org.apache.kafka#kafka_2.11;0.8.2.1 in local-m2-cache
    found org.scala-lang.modules#scala-xml_2.11;1.0.2 in local-m2-cache
    found com.yammer.metrics#metrics-core;2.2.0 in local-m2-cache
    found org.slf4j#slf4j-api;1.7.16 in local-m2-cache
    found org.scala-lang.modules#scala-parser-combinators_2.11;1.1.0 in central
    found com.101tec#zkclient;0.3 in local-m2-cache
    found log4j#log4j;1.2.17 in local-m2-cache
    found org.apache.kafka#kafka-clients;0.8.2.1 in local-m2-cache
    found net.jpountz.lz4#lz4;1.2.0 in local-m2-cache
    found org.xerial.snappy#snappy-java;1.1.7.1 in local-m2-cache
    found org.spark-project.spark#unused;1.0.0 in local-m2-cache
:: resolution report :: resolve 3367ms :: artifacts dl 142ms
    :: modules in use:
    com.101tec#zkclient;0.3 from local-m2-cache in [default]
    com.yammer.metrics#metrics-core;2.2.0 from local-m2-cache in [default]
    log4j#log4j;1.2.17 from local-m2-cache in [default]
    net.jpountz.lz4#lz4;1.2.0 from local-m2-cache in [default]
    org.apache.kafka#kafka-clients;0.8.2.1 from local-m2-cache in [default]
    org.apache.kafka#kafka_2.11;0.8.2.1 from local-m2-cache in [default]
    org.apache.spark#spark-streaming-kafka-0-8_2.11;2.4.0 from central in [default]
    org.scala-lang.modules#scala-parser-combinators_2.11;1.1.0 from central in [default]
    org.scala-lang.modules#scala-xml_2.11;1.0.2 from local-m2-cache in [default]
    org.slf4j#slf4j-api;1.7.16 from local-m2-cache in [default]
    org.spark-project.spark#unused;1.0.0 from local-m2-cache in [default]
    org.xerial.snappy#snappy-java;1.1.7.1 from local-m2-cache in [default]
    ---------------------------------------------------------------------
    |                  |            modules            ||   artifacts   |
    |       conf       | number| search|dwnlded|evicted|| number|dwnlded|
    ---------------------------------------------------------------------
    |      default     |   12  |   0   |   0   |   0   ||   12  |   0   |
    ---------------------------------------------------------------------
:: retrieving :: org.apache.spark#spark-submit-parent-e1001739-51dd-43bc-83b7-52a7484328f1
    confs: [default]
    0 artifacts copied, 12 already retrieved (0kB/88ms)
2019-10-29 04:57:49 WARN  Utils:66 - Your hostname, ubuntu resolves to a loopback address: 127.0.1.1; using 192.168.10.138 instead (on interface ens33)
2019-10-29 04:57:50 WARN  Utils:66 - Set SPARK_LOCAL_IP if you need to bind to another address
Traceback (most recent call last):
  File "/mnt/vmware_shared_folder/pyspark_streaming/spark_streaming.py", line 8, in <module>
    from pyspark import SparkContext
ImportError: cannot import name 'SparkContext' from 'pyspark' (/mnt/vmware_shared_folder/pyspark_streaming/pyspark.py)
2019-10-29 04:57:55 INFO  ShutdownHookManager:54 - Shutdown hook called
2019-10-29 04:57:55 INFO  ShutdownHookManager:54 - Deleting directory /tmp/spark-0ad03060-792a-42cb-81b3-62fa69d66463

Я пробовал код из скрипта в оболочке pyspark, и он работает нормально.

Я добавил следующее в файл bashrc

export SPARK_HOME=/home/ubuntu/spark-2.4.0-bin-hadoop2.7
export PYTHONPATH=$SPARK_HOME/python/:$SPARK_HOME/python/lib/py4j-0.10.7-src.zip:$PYTHONPATH
export PYSPARK_PYTHON=python3

Любойбыла бы признательна за помощь

ошибка pypsark-плагина ImportError: невозможно импортировать имя 'SparkContext'

Пожалуйста, войдите или зарегистрируйтесь чтобы ответить на этот вопрос.

Ответы [ 0 ]

ошибка pypsark-плагина ImportError: невозможно импортировать имя 'SparkContext'

Пожалуйста, войдите или зарегистрируйтесь чтобы ответить на этот вопрос.

Ответы [ 0 ]

Похожие темы