Фрейм данных Pyspark некорректно сгруппирован - PullRequest
0 голосов
/ 09 октября 2019

Приложение успешно работает в течение нескольких месяцев. Недавно он начал давать сбой, потому что метод, который принимает DataFrame pyspark, группирует его и выводит ответ, каким-то образом выводит поврежденный фрейм данных.

Вот пример кода того, что я делаю, когда все терпит неудачу:

from pyspark.sql.functions import sum, avg
group_by = pyspark_df_in.groupBy("dimension1", "dimension2", "dimension3")
pyspark_df_out = group_by.agg(sum("metric1").alias("MyMetric1"), sum("metric2").alias("MyMetric2")

Если я печатаю print(pyspark_df_in.head(1)), я правильно получаю первую строку моего набора данных. После группировки по нескольким измерениям и печати print(pyspark_df_out.head(2)) я получаю следующую ошибку. Я получаю аналогичную ошибку, пытаясь сделать что-нибудь в основном с этим новым сгруппированным набором данных (и я знаю, что группировка по должна приводить к данным, как я их проверял, чтобы убедиться).

19/10/08 15:13:00 WARN TaskSetManager: Stage 9 contains a task of very large size (282 KB). The maximum recommended task size is 100 KB.
19/10/08 15:13:00 ERROR Executor: Exception in task 1.0 in stage 9.0 (TID 40)
java.util.NoSuchElementException
    at java.util.ArrayList$Itr.next(ArrayList.java:862)
    at org.apache.arrow.vector.VectorLoader.loadBuffers(VectorLoader.java:76)
    at org.apache.arrow.vector.VectorLoader.load(VectorLoader.java:61)
    at org.apache.spark.sql.execution.arrow.ArrowConverters$$anon$2.nextBatch(ArrowConverters.scala:167)
    at org.apache.spark.sql.execution.arrow.ArrowConverters$$anon$2.<init>(ArrowConverters.scala:144)
    at org.apache.spark.sql.execution.arrow.ArrowConverters$.fromBatchIterator(ArrowConverters.scala:143)
    at org.apache.spark.sql.execution.arrow.ArrowConverters$$anonfun$3.apply(ArrowConverters.scala:203)
    at org.apache.spark.sql.execution.arrow.ArrowConverters$$anonfun$3.apply(ArrowConverters.scala:201)
    at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$23.apply(RDD.scala:801)
    at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$23.apply(RDD.scala:801)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
    at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
    at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:55)
    at org.apache.spark.scheduler.Task.run(Task.scala:121)
    at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
    at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:748)
19/10/08 15:13:00 ERROR Executor: Exception in task 2.0 in stage 9.0 (TID 41)
java.util.NoSuchElementException
    at java.util.ArrayList$Itr.next(ArrayList.java:862)
    at org.apache.arrow.vector.VectorLoader.loadBuffers(VectorLoader.java:76)
    at org.apache.arrow.vector.VectorLoader.load(VectorLoader.java:61)
    at org.apache.spark.sql.execution.arrow.ArrowConverters$$anon$2.nextBatch(ArrowConverters.scala:167)
    at org.apache.spark.sql.execution.arrow.ArrowConverters$$anon$2.<init>(ArrowConverters.scala:144)
    at org.apache.spark.sql.execution.arrow.ArrowConverters$.fromBatchIterator(ArrowConverters.scala:143)
    at org.apache.spark.sql.execution.arrow.ArrowConverters$$anonfun$3.apply(ArrowConverters.scala:203)
    at org.apache.spark.sql.execution.arrow.ArrowConverters$$anonfun$3.apply(ArrowConverters.scala:201)
    at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$23.apply(RDD.scala:801)
    at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$23.apply(RDD.scala:801)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
    at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
    at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:55)
    at org.apache.spark.scheduler.Task.run(Task.scala:121)
    at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
    at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:748)
19/10/08 15:13:00 WARN TaskSetManager: Lost task 1.0 in stage 9.0 (TID 40, localhost, executor driver): java.util.NoSuchElementException
    at java.util.ArrayList$Itr.next(ArrayList.java:862)
    at org.apache.arrow.vector.VectorLoader.loadBuffers(VectorLoader.java:76)
    at org.apache.arrow.vector.VectorLoader.load(VectorLoader.java:61)
    at org.apache.spark.sql.execution.arrow.ArrowConverters$$anon$2.nextBatch(ArrowConverters.scala:167)
    at org.apache.spark.sql.execution.arrow.ArrowConverters$$anon$2.<init>(ArrowConverters.scala:144)
    at org.apache.spark.sql.execution.arrow.ArrowConverters$.fromBatchIterator(ArrowConverters.scala:143)
    at org.apache.spark.sql.execution.arrow.ArrowConverters$$anonfun$3.apply(ArrowConverters.scala:203)
    at org.apache.spark.sql.execution.arrow.ArrowConverters$$anonfun$3.apply(ArrowConverters.scala:201)
    at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$23.apply(RDD.scala:801)
    at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$23.apply(RDD.scala:801)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
    at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
    at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:55)
    at org.apache.spark.scheduler.Task.run(Task.scala:121)
    at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
    at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:748)

19/10/08 15:13:00 ERROR TaskSetManager: Task 1 in stage 9.0 failed 1 times; aborting job
19/10/08 15:13:00 WARN TaskSetManager: Lost task 0.0 in stage 9.0 (TID 39, localhost, executor driver): TaskKilled (Stage cancelled)
19/10/08 15:13:00 WARN TaskSetManager: Lost task 3.0 in stage 9.0 (TID 42, localhost, executor driver): TaskKilled (Stage cancelled)
19/10/08 15:13:00 WARN TaskSetManager: Lost task 10.0 in stage 9.0 (TID 49, localhost, executor driver): TaskKilled (Stage cancelled)
19/10/08 15:13:00 WARN TaskSetManager: Lost task 7.0 in stage 9.0 (TID 46, localhost, executor driver): TaskKilled (Stage cancelled)
19/10/08 15:13:00 WARN TaskSetManager: Lost task 6.0 in stage 9.0 (TID 45, localhost, executor driver): TaskKilled (Stage cancelled)
19/10/08 15:13:00 WARN TaskSetManager: Lost task 11.0 in stage 9.0 (TID 50, localhost, executor driver): TaskKilled (Stage cancelled)
19/10/08 15:13:00 WARN TaskSetManager: Lost task 9.0 in stage 9.0 (TID 48, localhost, executor driver): TaskKilled (Stage cancelled)
19/10/08 15:13:00 WARN TaskSetManager: Lost task 8.0 in stage 9.0 (TID 47, localhost, executor driver): TaskKilled (Stage cancelled)
19/10/08 15:13:00 WARN TaskSetManager: Lost task 4.0 in stage 9.0 (TID 43, localhost, executor driver): TaskKilled (Stage cancelled)
19/10/08 15:13:00 WARN TaskSetManager: Lost task 5.0 in stage 9.0 (TID 44, localhost, executor driver): TaskKilled (Stage cancelled)

Информация о моей среде:

  • Версия Spark Context = 2.4.3
  • Версия Python = 3.7
  • ОС = Linux CentOS7

Кто-нибудь сталкивался с этой проблемой или есть какие-либо идеи, как я могу отладить / исправить ее?

...