Pyspark: java.lang.OutOfMemoryError: пространство кучи Java при сохранении кадра данных в parquet / csv - PullRequest
0 голосов
/ 28 января 2019

Я использую pyspark 2.3 на ноутбуке Jupyter на ПК Lenovo (Windows 10 и Ram 48 G), я пытался сохранить данные в формате паркет или CSV, но я получил эту ошибку:

Py4JJavaError: An error occurred while calling o394.csv.
...
Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: Task 6 in stage 4.0 failed 1 times, most recent failure: Lost task 6.0 in stage 4.0 (TID 17, localhost, executor driver): java.lang.OutOfMemoryError: Java heap space
...
Driver stacktrace:
    at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1599)
...
Caused by: java.lang.OutOfMemoryError: Java heap space
...

----------------------------------------
Exception happened during processing of request from ('127.0.0.1', 50525)
Traceback (most recent call last):
  File "C:\Users\****\Anaconda3\lib\socketserver.py", line 317, in _handle_request_noblock
    self.process_request(request, client_address)
  File "C:\Users\****\Anaconda3\lib\socketserver.py", line 348, in process_request
    self.finish_request(request, client_address)
  File "C:\Users\****\Anaconda3\lib\socketserver.py", line 361, in finish_request
    self.RequestHandlerClass(request, client_address, self)
  File "C:\Users\****\Anaconda3\lib\socketserver.py", line 696, in __init__
    self.handle()
  File "F:\spark\spark\python\pyspark\accumulators.py", line 235, in handle
    num_updates = read_int(self.rfile)
  File "F:\spark\spark\python\pyspark\serializers.py", line 683, in read_int
    length = stream.read(4)
  File "C:\Users\****\Anaconda3\lib\socket.py", line 586, in readinto
    return self._sock.recv_into(b)
ConnectionResetError: [WinError 10054] An existing connection was forcibly closed by the remote host

ПоСледуя одной из рекомендаций из других вопросов, которые были размещены здесь, я создал sparkSession следующим образом:

SparkSession \
  .builder \
  .master("local[*]")\
  .appName("Python Spark SQL basic example") \
  .config("spark.memory.fraction", 0.8) \
  .config("spark.executor.memory", "16g") \
  .config("spark.driver.memory", "16g")\
  .config("spark.sql.shuffle.partitions" , "800") \
  .config("spark.memory.offHeap.enabled",'true')\
  .config("spark.memory.offHeap.size","16g")\
  .getOrCreate()
...