YARN + SPARK + Не удалось заменить неверный датодан на существующем конвейере из-за того, что больше нет хороших датододов, доступных для попытки - PullRequest
0 голосов
/ 23 февраля 2020

У меня небольшой кластер из 6 узлов с данными, и у меня полный сбой при выполнении заданий Spark.

ошибка:

ERROR [SparkListenerBus][driver][] [org.apache.spark.scheduler.LiveListenerBus] Listener EventLoggingListener threw an exception
java.io.IOException: Failed to replace a bad datanode on the existing pipeline due to no more good datanodes being available to try. (Nodes: current=[DatanodeInfoWithStorage[42.3.44.157:50010,DS-87cdbf42-3995-4313-8fab-2bf6877695f6,DISK], DatanodeInfoWithStorage[42.3.44.154:50010,DS-60eb1276-11cc-4cb8-a844-f7f722de0e15,DISK]], original=[DatanodeInfoWithStorage[42.3.44.157:50010,DS-87cdbf42-3995-4313-8fab-2bf6877695f6,DISK], DatanodeInfoWithStorage[42.3.44.154:50010,DS-60eb1276-11cc-4cb8-a844-f7f722de0e15,DISK]]). The current failed datanode replacement policy is DEFAULT, and a client may configure this via 'dfs.client.block.write.replace-datanode-on-failure.policy' in its configuration.
    at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.findNewDatanode(DFSOutputStream.java:1059)
    at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.addDatanode2ExistingPipeline(DFSOutputStream.java:1122)
    at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.setupPipelineForAppendOrRecovery(DFSOutputStream.java:1280)
    at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.processDatanodeError(DFSOutputStream.java:1005)
    at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:512)
---T08:18:07.007 ERROR [SparkListenerBus][driver][] [STATISTICS] [onQueryTerminated] queryId:

Я нашел следующий обходной путь, установив следующие значения в конфигурации HDFS

dfs.client.block.write.replace-datanode-on-failure.policy=true
dfs.client.block.write.replace-datanode-on-failure.policy=NEVER

Два свойства dfs.client.block.write.replace-datanode-on-failure.policy и dfs.client.block.write.replace-data node-on-failure.enable влияет на поведение на стороне клиента для восстановления конвейера, и эти свойства могут быть добавлены в качестве пользовательских свойств в конфигурации «hdfs-site».

установка дозы, оба параметра могут быть хорошим решением?

dfs.client.block.write.replace-datanode-on-failure.enable   true    If there is a datanode/network failure in the write pipeline, DFSClient will try to remove the failed datanode from the pipeline and then continue writing with the remaining datanodes. As a result, the number of datanodes in the pipeline is decreased. The feature is to add new datanodes to the pipeline. This is a site-wide property to enable/disable the feature. When the cluster size is extremely small, e.g. 3 nodes or less, cluster administrators may want to set the policy to NEVER in the default configuration file or disable this feature. Otherwise, users may experience an unusually high rate of pipeline failures since it is impossible to find new datanodes for replacement. See also dfs.client.block.write.replace-datanode-on-failure.policy
dfs.client.block.write.replace-datanode-on-failure.policy   DEFAULT This property is used only if the value of dfs.client.block.write.replace-datanode-on-failure.enable is true. ALWAYS: always add a new datanode when an existing datanode is removed. NEVER: never add a new datanode. DEFAULT: Let r be the replication number. Let n be the number of existing datanodes. Add a new datanode only if r is greater than or equal to 3 and either (1) floor(r/2) is greater than or equal to n; or (2) r is greater than n and the block is hflushed/appended.
...