спарк написать паркет с перегородкой очень медленно - PullRequest
0 голосов
/ 31 января 2020

при записи в паркет с помощью partitionBY это занимает больше времени. Анализируя журналы, которые я обнаружил, искра перечисляет файлы в каталоге, а при перечислении файлов я наблюдаю приведенное ниже поведение, когда оно занимает более одного часа и кажется бездействующим и снова начинается.

20/01/30 07:33:09 INFO ShuffleBlockFetcherIterator: Started 0 remote fetches in 0 ms
20/01/30 07:33:09 INFO ShuffleBlockFetcherIterator: Getting 0 non-empty blocks out of 1 blocks
20/01/30 07:33:09 INFO Executor: Finished task 195.0 in stage 241.0 (TID 15820). 18200 bytes result sent to driver
20/01/30 07:33:09 INFO ShuffleBlockFetcherIterator: Started 0 remote fetches in 0 ms
20/01/30 07:33:09 INFO Executor: Finished task 198.0 in stage 241.0 (TID 15823). 18200 bytes result sent to driver
20/01/30 07:55:22 INFO metrics: type=GAUGE, name=application_1577238363313_38955.3.NettyBlockTransfer.shuffle-client.usedDirectMemory, value=50331648
20/01/30 07:55:22 INFO metrics: type=GAUGE, name=application_1577238363313_38955.3.NettyBlockTransfer.shuffle-client.usedHeapMemory, value=50331648
20/01/30 07:55:22 INFO metrics: type=GAUGE, name=application_1577238363313_38955.3.NettyBlockTransfer.shuffle-server.usedDirectMemory, value=50331648
20/01/30 07:55:22 INFO metrics: type=GAUGE, name=application_1577238363313_38955.3.NettyBlockTransfer.shuffle-server.usedHeapMemory, value=50331648

и снова

20/01/30 07:55:22 INFO metrics: type=HISTOGRAM, name=application_1577238363313_38955.3.CodeGenerator.compilationTime, count=484, min=2, max=622, mean=16.558694661661132, stddev=13.859676272407238, median=12.0, p75=20.0, p95=47.0, p98=62.0, p99=64.0, p999=70.0
20/01/30 07:55:22 INFO metrics: type=HISTOGRAM, name=application_1577238363313_38955.3.CodeGenerator.generatedClassSize, count=990, min=546, max=97043, mean=2058.574386565769, stddev=2153.50835266105, median=1374.0, p75=2693.0, p95=5009.0, p98=11509.0, p99=11519.0, p999=11519.0
20/01/30 07:55:22 INFO metrics: type=HISTOGRAM, name=application_1577238363313_38955.3.CodeGenerator.generatedMethodSize, count=4854, min=1, max=1574, mean=95.19245880884911, stddev=158.289763457333, median=39.0, p75=142.0, p95=339.0, p98=618.0, p99=873.0, p999=1234.0
20/01/30 07:55:22 INFO metrics: type=HISTOGRAM, name=application_1577238363313_38955.3.CodeGenerator.sourceCodeSize, count=484, min=430, max=467509, mean=4743.632894656119, stddev=5893.941708479697, median=2346.0, p75=4946.0, p95=24887.0, p98=24890.0, p99=24890.0, p999=24890.0
20/01/30 08:55:22 INFO metrics: type=GAUGE, name=application_1577238363313_38955.3.NettyBlockTransfer.shuffle-client.usedDirectMemory, value=50331648
20/01/30 08:55:22 INFO metrics: type=GAUGE, name=application_1577238363313_38955.3.NettyBlockTransfer.shuffle-client.usedHeapMemory, value=50331648
20/01/30 08:55:22 INFO metrics: type=GAUGE, name=application_1577238363313_38955.3.NettyBlockTransfer.shuffle-server.usedDirectMemory, value=50331648
20/01/30 08:55:22 INFO metrics: type=GAUGE, name=application_1577238363313_38955.3.NettyBlockTransfer.shuffle-server.usedHeapMemory, value=50331648
20/01/30 08:55:22 INFO metrics: type=GAUGE, name=application_1577238363313_38955.3.executor.filesystem.file.largeRead_ops, value=0
20/01/30 08:55:22 INFO metrics: type=GAUGE, name=application_1577238363313_38955.3.executor.filesystem.file.read_bytes, value=0

и снова

20/01/30 08:55:28 INFO TaskMemoryManager: Memory used in task 15249
20/01/30 08:55:28 INFO TaskMemoryManager: Acquired by org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter@3cadc5a3: 65.0 MB
20/01/30 08:55:28 INFO TaskMemoryManager: Acquired by HybridRowQueue(org.apache.spark.memory.TaskMemoryManager@7c64db53,/mnt/resource/hadoop/yarn/local/usercache/livy/appcache/application_1577238363313_38955/spark-487c8d3d-391c-47b3-9a1b-d816d9505f5c,11,org.apache.spark.serializer.SerializerManager@55a990cc): 4.2 GB
20/01/30 08:55:28 INFO TaskMemoryManager: Acquired by org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter@785b4080: 65.0 MB
20/01/30 08:55:28 INFO TaskMemoryManager: 0 bytes of memory were used by task 15249 but are not associated with specific consumers
20/01/30 08:55:28 INFO TaskMemoryManager: 4643196305 bytes of memory are used for execution and 608596591 bytes of memory are used for storage
20/01/30 09:55:22 INFO metrics: type=GAUGE, name=application_1577238363313_38955.3.NettyBlockTransfer.shuffle-client.usedDirectMemory, value=50331648
20/01/30 09:55:22 INFO metrics: type=GAUGE, name=application_1577238363313_38955.3.NettyBlockTransfer.shuffle-client.usedHeapMemory, value=50331648
20/01/30 09:55:22 INFO metrics: type=GAUGE, name=application_1577238363313_38955.3.NettyBlockTransfer.shuffle-server.usedDirectMemory, value=50331648
20/01/30 09:55:22 INFO metrics: type=GAUGE, name=application_1577238363313_38955.3.NettyBlockTransfer.shuffle-server.usedHeapMemory, value=50331648
2

Теперь на выполнение задания уходит больше 3 часов. Любые способы повышения производительности

1 Ответ

0 голосов
/ 31 января 2020

Я заметил такое же поведение, когда писал dataframe в hdfs с использованием метода partitionBy. Позже я обнаружил, что должен применить in-memory partitioning до disk-partitioning.

Итак, сначала repartition ваш dataframe на тех же столбцах, которые вы хотите использовать в partitionBy, как показано ниже

df2=df1.repartition($"year",$"month",$"day")
df2.repartition(3).mode("overwrite").partitionBy("year","month","day").save("path to hdfs")
...