Я использую следующую команду для выполнения скрипта pyspark:
spark-submit \
--packages org.apache.spark:spark-streaming-kafka-0-8_2.11:2.4.0 \
pyspark_streaming/spark_streaming.py
Сценарий spark_streaming.py
выглядит следующим образом:
from pyspark import SparkContext
from pyspark.streaming import StreamingContext
from pyspark.streaming.kafka import KafkaUtils
import json
if __name__ == "__main__":
sc = SparkContext('local[*]')
ssc = StreamingContext(sc, 10)
kafkaStream = KafkaUtils.createStream(ssc, 'localhost:2181', 'spark-streaming', {'twitter':1})
parsed = kafkaStream.map(lambda v: json.loads(v[1]))
user_counts = parsed.map(lambda tweet: (tweet['user']["screen_name"], 1)).reduceByKey(lambda x,y: x + y)
user_counts.pprint()
ssc.start()
ssc.awaitTermination()
Это выдает следующую ошибку:
ubuntu@ubuntu:/mnt/vmware_shared_folder$ spark-submit --packages org.apache.spark:spark-streaming-kafka-0-8_2.11:2.4.0 pyspark_streaming/spark_streaming.py
Ivy Default Cache set to: /home/ubuntu/.ivy2/cache
The jars for the packages stored in: /home/ubuntu/.ivy2/jars
:: loading settings :: url = jar:file:/home/ubuntu/spark-2.4.0-bin-hadoop2.7/jars/ivy-2.4.0.jar!/org/apache/ivy/core/settings/ivysettings.xml
org.apache.spark#spark-streaming-kafka-0-8_2.11 added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent-e1001739-51dd-43bc-83b7-52a7484328f1;1.0
confs: [default]
found org.apache.spark#spark-streaming-kafka-0-8_2.11;2.4.0 in central
found org.apache.kafka#kafka_2.11;0.8.2.1 in local-m2-cache
found org.scala-lang.modules#scala-xml_2.11;1.0.2 in local-m2-cache
found com.yammer.metrics#metrics-core;2.2.0 in local-m2-cache
found org.slf4j#slf4j-api;1.7.16 in local-m2-cache
found org.scala-lang.modules#scala-parser-combinators_2.11;1.1.0 in central
found com.101tec#zkclient;0.3 in local-m2-cache
found log4j#log4j;1.2.17 in local-m2-cache
found org.apache.kafka#kafka-clients;0.8.2.1 in local-m2-cache
found net.jpountz.lz4#lz4;1.2.0 in local-m2-cache
found org.xerial.snappy#snappy-java;1.1.7.1 in local-m2-cache
found org.spark-project.spark#unused;1.0.0 in local-m2-cache
:: resolution report :: resolve 3367ms :: artifacts dl 142ms
:: modules in use:
com.101tec#zkclient;0.3 from local-m2-cache in [default]
com.yammer.metrics#metrics-core;2.2.0 from local-m2-cache in [default]
log4j#log4j;1.2.17 from local-m2-cache in [default]
net.jpountz.lz4#lz4;1.2.0 from local-m2-cache in [default]
org.apache.kafka#kafka-clients;0.8.2.1 from local-m2-cache in [default]
org.apache.kafka#kafka_2.11;0.8.2.1 from local-m2-cache in [default]
org.apache.spark#spark-streaming-kafka-0-8_2.11;2.4.0 from central in [default]
org.scala-lang.modules#scala-parser-combinators_2.11;1.1.0 from central in [default]
org.scala-lang.modules#scala-xml_2.11;1.0.2 from local-m2-cache in [default]
org.slf4j#slf4j-api;1.7.16 from local-m2-cache in [default]
org.spark-project.spark#unused;1.0.0 from local-m2-cache in [default]
org.xerial.snappy#snappy-java;1.1.7.1 from local-m2-cache in [default]
---------------------------------------------------------------------
| | modules || artifacts |
| conf | number| search|dwnlded|evicted|| number|dwnlded|
---------------------------------------------------------------------
| default | 12 | 0 | 0 | 0 || 12 | 0 |
---------------------------------------------------------------------
:: retrieving :: org.apache.spark#spark-submit-parent-e1001739-51dd-43bc-83b7-52a7484328f1
confs: [default]
0 artifacts copied, 12 already retrieved (0kB/88ms)
2019-10-29 04:57:49 WARN Utils:66 - Your hostname, ubuntu resolves to a loopback address: 127.0.1.1; using 192.168.10.138 instead (on interface ens33)
2019-10-29 04:57:50 WARN Utils:66 - Set SPARK_LOCAL_IP if you need to bind to another address
Traceback (most recent call last):
File "/mnt/vmware_shared_folder/pyspark_streaming/spark_streaming.py", line 8, in <module>
from pyspark import SparkContext
ImportError: cannot import name 'SparkContext' from 'pyspark' (/mnt/vmware_shared_folder/pyspark_streaming/pyspark.py)
2019-10-29 04:57:55 INFO ShutdownHookManager:54 - Shutdown hook called
2019-10-29 04:57:55 INFO ShutdownHookManager:54 - Deleting directory /tmp/spark-0ad03060-792a-42cb-81b3-62fa69d66463
Я пробовал код из скрипта в оболочке pyspark, и он работает нормально.
Я добавил следующее в файл bashrc
export SPARK_HOME=/home/ubuntu/spark-2.4.0-bin-hadoop2.7
export PYTHONPATH=$SPARK_HOME/python/:$SPARK_HOME/python/lib/py4j-0.10.7-src.zip:$PYTHONPATH
export PYSPARK_PYTHON=python3
Любойбыла бы признательна за помощь