преобразование фрейма данных в CSV вызывает ошибку pyspark - PullRequest
0 голосов
/ 26 мая 2020

У меня огромный фрейм данных около 7 ГБ записей. Я пытаюсь получить количество данных и загрузить его как csv Оба они приводят к ошибке ниже. есть ли другой способ загрузки фрейма данных без нескольких разделов

print(df.count())
df.coalesce(1).write.option("header", "true").csv('/user/ABC/Output.csv')



Error:
java.io.IOException: Stream is corrupted
    at net.jpountz.lz4.LZ4BlockInputStream.refill(LZ4BlockInputStream.java:202)
    at net.jpountz.lz4.LZ4BlockInputStream.refill(LZ4BlockInputStream.java:228)
    at net.jpountz.lz4.LZ4BlockInputStream.read(LZ4BlockInputStream.java:157)
    at org.apache.spark.io.ReadAheadInputStream$1.run(ReadAheadInputStream.java:168)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:748)
20/05/26 18:15:44 ERROR scheduler.TaskSetManager: Task 8 in stage 360.0 failed 1 times; aborting job
[Stage 360:=======>                                                (8 + 1) / 60]
Py4JJavaError: An error occurred while calling o18867.count.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 8 in stage 360.0 failed 1 times, most recent failure: Lost task 8.0 in stage 360.0 (TID 13986, localhost, executor driver): java.io.IOException: Stream is corrupted
    at net.jpountz.lz4.LZ4BlockInputStream.refill(LZ4BlockInputStream.java:202)
    at net.jpountz.lz4.LZ4BlockInputStream.refill(LZ4BlockInputStream.java:228)
    at net.jpountz.lz4.LZ4BlockInputStream.read(LZ4BlockInputStream.java:157)
    at org.apache.spark.io.ReadAheadInputStream$1.run(ReadAheadInputStream.java:168)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:748)
...