Я использую Spark через pyspark. Я запускаю следующий пример игрушки (в Jupyter Notebook):
import findspark
findspark.init()
import pyspark
import random
sc = pyspark.SparkContext(appName="Pi")
num_samples = 10000
def inside(p):
x, y = random.random(), random.random()
return x*x + y*y < 1
count = sc.parallelize(range(0, num_samples)).filter(inside).count()
pi = 4 * count / num_samples
print(pi)
sc.stop()
, который работает нормально при использовании num_samples = 100 или аналогичных, но для данного числа возвращает ошибку, касающуюся Python Workers:
Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.collectAndServe.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 2 in stage 0.0 failed 1 times, most recent failure: Lost task 2.0 in stage 0.0 (TID 2, localhost, executor driver): org.apache.spark.SparkException: Python worker failed to connect back.
[...]
Caused by: org.apache.spark.SparkException: Python worker failed to connect back.
[...]
Caused by: java.net.SocketTimeoutException: Accept timed out
[...]