В моем кластере hadoop у меня есть
1 активное имя узла
1 резервный узел имени
3 узла журнала
4 узла данных
До моего анализа, Active Namenode в down, потому что он не может писать правки для большинства узлов журнала. резервный узел имени не вступил во владение после сбоя активного Namenode, потому что не был разрешен доступ без пароля в Namenodes.
Вход в активный наменод
2018-07-22 00:49:05,496 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: JOURNALNODE_IP/JOURNALNODE_IP:8485. Already tried 0 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2018-07-22 00:49:07,490 INFO org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 6001 ms (timeout=20000 ms) for a response for sendEdits. Succeeded so far: [JOURNALNODE_IP:8485]
2018-07-22 00:49:08,491 INFO org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 7003 ms (timeout=20000 ms) for a response for sendEdits. Succeeded so far: [JOURNALNODE_IP:8485]
2018-07-22 00:49:08,500 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: JOURNALNODE_IP/JOURNALNODE_IP:8485. Already tried 1 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2018-07-22 00:49:09,493 INFO org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 8004 ms (timeout=20000 ms) for a response for sendEdits. Succeeded so far: [JOURNALNODE_IP:8485]
2018-07-22 00:49:10,493 INFO org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 9005 ms (timeout=20000 ms) for a response for sendEdits. Succeeded so far: [JOURNALNODE_IP:8485]
2018-07-22 00:49:11,495 INFO org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 10006 ms (timeout=20000 ms) for a response for sendEdits. Succeeded so far: [JOURNALNODE_IP:8485]
2018-07-22 00:49:11,506 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: JOURNALNODE_IP/JOURNALNODE_IP:8485. Already tried 2 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2018-07-22 00:49:12,495 INFO org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 11007 ms (timeout=20000 ms) for a response for sendEdits. Succeeded so far: [JOURNALNODE_IP:8485]
2018-07-22 00:49:13,496 INFO org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 12008 ms (timeout=20000 ms) for a response for sendEdits. Succeeded so far: [JOURNALNODE_IP:8485]
2018-07-22 00:49:14,498 INFO org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 13009 ms (timeout=20000 ms) for a response for sendEdits. Succeeded so far: [JOURNALNODE_IP:8485]
2018-07-22 00:49:14,512 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: JOURNALNODE_IP/JOURNALNODE_IP:8485. Already tried 3 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2018-07-22 00:49:15,498 WARN org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 14010 ms (timeout=20000 ms) for a response for sendEdits. Succeeded so far: [JOURNALNODE_IP:8485]
2018-07-22 00:49:16,500 WARN org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 15011 ms (timeout=20000 ms) for a response for sendEdits. Succeeded so far: [JOURNALNODE_IP:8485]
2018-07-22 00:49:17,500 WARN org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 16012 ms (timeout=20000 ms) for a response for sendEdits. Succeeded so far: [JOURNALNODE_IP:8485]
2018-07-22 00:49:17,518 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: JOURNALNODE_IP/JOURNALNODE_IP:8485. Already tried 4 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2018-07-22 00:49:18,502 WARN org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 17013 ms (timeout=20000 ms) for a response for sendEdits. Succeeded so far: [JOURNALNODE_IP:8485]
2018-07-22 00:49:19,503 WARN org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 18015 ms (timeout=20000 ms) for a response for sendEdits. Succeeded so far: [JOURNALNODE_IP:8485]
2018-07-22 00:49:20,504 WARN org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 19016 ms (timeout=20000 ms) for a response for sendEdits. Succeeded so far: [JOURNALNODE_IP:8485]
2018-07-22 00:49:20,524 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: JOURNALNODE_IP/JOURNALNODE_IP:8485. Already tried 5 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2018-07-22 00:49:21,489 FATAL org.apache.hadoop.hdfs.server.namenode.FSEditLog: Error: flush failed for required journal (JournalAndStream(mgr=QJM to [JOURNALNODE_IP:8485, JOURNALNODE_IP:8485, JOURNALNODE_IP:8485], stream=QuorumOutputStream starting at txid 203478))
java.io.IOException: Timed out waiting 20000ms for a quorum of nodes to respond.
at org.apache.hadoop.hdfs.qjournal.client.AsyncLoggerSet.waitForWriteQuorum(AsyncLoggerSet.java:137)
at org.apache.hadoop.hdfs.qjournal.client.QuorumOutputStream.flushAndSync(QuorumOutputStream.java:107)
at org.apache.hadoop.hdfs.server.namenode.EditLogOutputStream.flush(EditLogOutputStream.java:113)
at org.apache.hadoop.hdfs.server.namenode.EditLogOutputStream.flush(EditLogOutputStream.java:107)
at org.apache.hadoop.hdfs.server.namenode.JournalSet$JournalSetOutputStream$8.apply(JournalSet.java:533)
at org.apache.hadoop.hdfs.server.namenode.JournalSet.mapJournalsAndReportErrors(JournalSet.java:393)
at org.apache.hadoop.hdfs.server.namenode.JournalSet.access$100(JournalSet.java:57)
at org.apache.hadoop.hdfs.server.namenode.JournalSet$JournalSetOutputStream.flush(JournalSet.java:529)
at org.apache.hadoop.hdfs.server.namenode.FSEditLog.logSync(FSEditLog.java:647)
at org.apache.hadoop.hdfs.server.namenode.FSEditLog.endCurrentLogSegment(FSEditLog.java:1266)
at org.apache.hadoop.hdfs.server.namenode.FSEditLog.rollEditLog(FSEditLog.java:1203)
at org.apache.hadoop.hdfs.server.namenode.FSImage.rollEditLog(FSImage.java:1300)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.rollEditLog(FSNamesystem.java:5836)
at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.rollEditLog(NameNodeRpcServer.java:1122)
at org.apache.hadoop.hdfs.protocolPB.NamenodeProtocolServerSideTranslatorPB.rollEditLog(NamenodeProtocolServerSideTranslatorPB.java:142)
at org.apache.hadoop.hdfs.protocol.proto.NamenodeProtocolProtos$NamenodeProtocolService$2.callBlockingMethod(NamenodeProtocolProtos.java:12025)
at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:616)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:982)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2049)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2045)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1698)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2043)
2018-07-22 00:49:21,491 WARN org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Aborting QuorumOutputStream starting at txid 203478
2018-07-22 00:49:21,494 INFO org.apache.hadoop.util.ExitUtil: Exiting with status 1
2018-07-22 00:49:21,496 INFO org.apache.hadoop.hdfs.server.namenode.NameNode: SHUTDOWN_MSG:
/
SHUTDOWN_MSG: Shutting down NameNode at ACTIVE_NAMENODE_IP/ACTIVE_NAMENODE_IP
/
Вход в режим ожидания Наменода
2018-07-22 00:43:51,605 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: JOURNALNODE_IP/JOURNALNODE_IP:8485. Already tried 0 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2018-07-22 00:43:53,341 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: JOURNALNODE_IP/JOURNALNODE_IP:8485. Already tried 0 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2018-07-22 00:43:53,341 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: JOURNALNODE_IP/JOURNALNODE_IP:8485. Already tried 0 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2018-07-22 00:43:54,609 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: JOURNALNODE_IP/JOURNALNODE_IP:8485. Already tried 1 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2018-07-22 00:43:55,336 INFO org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 6001 ms (timeout=20000 ms) for a response for selectInputStreams. No responses yet.
2018-07-22 00:43:56,336 INFO org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 7002 ms (timeout=20000 ms) for a response for selectInputStreams. No responses yet.
2018-07-22 00:43:56,347 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: JOURNALNODE_IP/JOURNALNODE_IP:8485. Already tried 1 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2018-07-22 00:43:56,347 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: JOURNALNODE_IP/JOURNALNODE_IP:8485. Already tried 1 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2018-07-22 00:43:57,338 INFO org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 8003 ms (timeout=20000 ms) for a response for selectInputStreams. No responses yet.
2018-07-22 00:43:57,615 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: JOURNALNODE_IP/JOURNALNODE_IP:8485. Already tried 2 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2018-07-22 00:43:58,339 INFO org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 9005 ms (timeout=20000 ms) for a response for selectInputStreams. No responses yet.
2018-07-22 00:43:59,340 INFO org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 10006 ms (timeout=20000 ms) for a response for selectInputStreams. No responses yet.
2018-07-22 00:43:59,353 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: JOURNALNODE_IP/JOURNALNODE_IP:8485. Already tried 2 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2018-07-22 00:43:59,353 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: JOURNALNODE_IP/JOURNALNODE_IP:8485. Already tried 2 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2018-07-22 00:44:00,342 INFO org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 11007 ms (timeout=20000 ms) for a response for selectInputStreams. No responses yet.
2018-07-22 00:44:00,621 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: JOURNALNODE_IP/JOURNALNODE_IP:8485. Already tried 3 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2018-07-22 00:44:01,342 INFO org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 12008 ms (timeout=20000 ms) for a response for selectInputStreams. No responses yet.
2018-07-22 00:44:02,343 INFO org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 13009 ms (timeout=20000 ms) for a response for selectInputStreams. No responses yet.
Retrying connect to server: JOURNALNODE_IP/JOURNALNODE_IP:8485. Already tried 3 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2018-07-22 00:44:01,342 INFO org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 12008 ms (timeout=20000 ms) for a response for selectInputStreams. No responses yet.
2018-07-22 00:44:02,343 INFO org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 13009 ms (timeout=20000 ms) for a response for selectInputStreams. No responses yet.
2018-07-22 00:44:02,359 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: JOURNALNODE_IP/JOURNALNODE_IP:8485. Already tried 3 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2018-07-22 00:44:02,359 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: JOURNALNODE_IP/JOURNALNODE_IP:8485. Already tried 3 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2018-07-22 00:44:03,345 WARN org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 14010 ms (timeout=20000 ms) for a response for selectInputStreams. No responses yet.
2018-07-22 00:44:03,627 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: JOURNALNODE_IP/JOURNALNODE_IP:8485. Already tried 4 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2018-07-22 00:44:04,345 WARN org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 15011 ms (timeout=20000 ms) for a response for selectInputStreams. No responses yet.
2018-07-22 00:44:05,347 WARN org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 16012 ms (timeout=20000 ms) for a response for selectInputStreams. No responses yet.
2018-07-22 00:44:05,365 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: JOURNALNODE_IP/JOURNALNODE_IP:8485. Already tried 4 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2018-07-22 00:44:05,365 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: JOURNALNODE_IP/JOURNALNODE_IP:8485. Already tried 4 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2018-07-22 00:44:06,347 WARN org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 17013 ms (timeout=20000 ms) for a response for selectInputStreams. No responses yet.
2018-07-22 00:44:06,633 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: JOURNALNODE_IP/JOURNALNODE_IP:8485. Already tried 5 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2018-07-22 00:44:07,348 WARN org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 18014 ms (timeout=20000 ms) for a response for selectInputStreams. No responses yet.
2018-07-22 00:44:08,350 WARN org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 19015 ms (timeout=20000 ms) for a response for selectInputStreams. No responses yet.
2018-07-22 00:44:08,371 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: JOURNALNODE_IP/JOURNALNODE_IP:8485. Already tried 5 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2018-07-22 00:44:08,371 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: JOURNALNODE_IP/JOURNALNODE_IP:8485. Already tried 5 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2018-07-22 00:44:09,336 WARN org.apache.hadoop.hdfs.server.namenode.FSEditLog: Unable to determine input streams from QJM to [JOURNALNODE_IP:8485, JOURNALNODE_IP:8485, JOURNALNODE_IP:8485]. Skipping.
java.io.IOException: Timed out waiting 20000ms for a quorum of nodes to respond.
at org.apache.hadoop.hdfs.qjournal.client.AsyncLoggerSet.waitForWriteQuorum(AsyncLoggerSet.java:137)
at org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager.selectInputStreams(QuorumJournalManager.java:471)
at org.apache.hadoop.hdfs.server.namenode.JournalSet.selectInputStreams(JournalSet.java:278)
at org.apache.hadoop.hdfs.server.namenode.FSEditLog.selectInputStreams(FSEditLog.java:1508)
at org.apache.hadoop.hdfs.server.namenode.FSEditLog.selectInputStreams(FSEditLog.java:1532)
at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer.doTailEdits(EditLogTailer.java:214)
at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.doWork(EditLogTailer.java:331)
at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.access$200(EditLogTailer.java:284)
at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread$1.run(EditLogTailer.java:301)
at org.apache.hadoop.security.SecurityUtil.doAsLoginUserOrFatal(SecurityUtil.java:415)
at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.run(EditLogTailer.java:297)
Вход в узел журнала
2018-07-22 02:43:04,209 INFO org.apache.hadoop.ipc.Server: Socket Reader #1 for port 8485: readAndProcess from client ACTIVE_NAMENODE_IP threw exception [java.io.IOException: Connection reset by peer]
java.io.IOException: Connection reset by peer
at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)
at sun.nio.ch.IOUtil.read(IOUtil.java:197)
at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:380)
at org.apache.hadoop.ipc.Server.channelRead(Server.java:2603)
at org.apache.hadoop.ipc.Server.access$2800(Server.java:136)
at org.apache.hadoop.ipc.Server$Connection.readAndProcess(Server.java:1481)
at org.apache.hadoop.ipc.Server$Listener.doRead(Server.java:771)
at org.apache.hadoop.ipc.Server$Listener$Reader.doRunLoop(Server.java:637)
at org.apache.hadoop.ipc.Server$Listener$Reader.run(Server.java:608)
2018-07-22 02:43:04,212 INFO org.apache.hadoop.ipc.Server: Socket Reader #1 for port 8485: readAndProcess from client STANBY_NAMENODE_IP threw exception [java.io.IOException: Connection reset by peer]
java.io.IOException: Connection reset by peer
at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)
at sun.nio.ch.IOUtil.read(IOUtil.java:197)
at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:380)
at org.apache.hadoop.ipc.Server.channelRead(Server.java:2603)
at org.apache.hadoop.ipc.Server.access$2800(Server.java:136)
at org.apache.hadoop.ipc.Server$Connection.readAndProcess(Server.java:1481)
at org.apache.hadoop.ipc.Server$Listener.doRead(Server.java:771)
at org.apache.hadoop.ipc.Server$Listener$Reader.doRunLoop(Server.java:637)
at org.apache.hadoop.ipc.Server$Listener$Reader.run(Server.java:608)
Примечание
Я изменил IP-адрес namenode (активный), namenode (резервный) и journalnode на ACTIVE_NAMENODE_IP, STANDBY_NAMENODE_IP и JOURNALNODE_IP соответственно в журналах.
Так в чем причина неудачи Наменода?