операция искрового сбора на больших узлах EMR с разрывом фрейма данных - PullRequest
0 голосов
/ 16 июня 2020

Я запускаю Spark на большом фрейме данных (800G) на EMR (1 мастер + 8 точечных экземпляров). Когда я делаю «подсчет», консоль начинает устранять ошибки, как показано ниже, и 3 из 8 узлов становятся «неработоспособными» в консоли has oop. Ошибка 1/4 local-dirs are bad: /mnt/yarn; 1/1 log-dirs are bad: /var/log/hadoop-yarn/containers. Как избежать этой ошибки и как восстановить эти сломанные узлы?

scala> df1.count
[Stage 13:===================>                                (312 + 540) / 852]20/06/16 06:38:14 WARN BlockManagerMasterEndpoint: No more replicas available for rdd_46_0 !
[Stage 13:===================>                                (314 + 538) / 852]20/06/16 06:38:14 WARN BlockManagerMasterEndpoint: No more replicas available for rdd_46_5 !
20/06/16 06:38:14 WARN BlockManagerMasterEndpoint: No more replicas available for rdd_46_4 !
20/06/16 06:38:14 WARN BlockManagerMasterEndpoint: No more replicas available for rdd_46_11 !
20/06/16 06:38:14 WARN BlockManagerMasterEndpoint: No more replicas available for rdd_46_10 !
...
[Stage 13:===================>                                (315 + 537) / 852]20/06/16 06:38:15 WARN BlockManagerMasterEndpoint: No more replicas available for rdd_46_66 !
20/06/16 06:38:15 WARN BlockManagerMasterEndpoint: No more replicas available for rdd_46_67 !
20/06/16 06:38:15 WARN YarnSchedulerBackend$YarnSchedulerEndpoint: Requesting driver to remove executor 228 for reason Container marked as failed: container_1592285107819_0001_01_000243 on host: ip-10-10-3-62.i.xxxx.com. Exit status: -100. Diagnostics: Container released on a *lost* node.
20/06/16 06:38:15 WARN YarnSchedulerBackend$YarnSchedulerEndpoint: Requesting driver to remove executor 292 for reason Container marked as failed: container_1592285107819_0001_01_000307 on host: ip-10-10-3-62.i.xxxx.com. Exit status: -100. Diagnostics: Container released on a *lost* node.
...
20/06/16 06:38:15 ERROR YarnScheduler: Lost executor 228 on ip-10-10-3-62.i.xxxx.com: Container marked as failed: container_1592285107819_0001_01_000243 on host: ip-10-10-3-62.i.xxx.com. Exit status: -100. Diagnostics: Container released on a *lost* node.
....

Было oop enter image description here

...