Медленное начало датаноды и наменода - PullRequest
0 голосов
/ 23 октября 2019

Я относительно новичок в настройке и администрировании HDFS и обнаружил, что администрирую проверку концепции установки HDFS. На одной машине (128 ядер, 512 ГБ, 15 ТБ памяти) работает узел имен, узел данных, узел вторичного имени и т. Д. На этом сервере находится около 23 миллионов блоков, занимающих чуть менее 1 ТБ дискового пространства. Количество блоков необычно велико, поскольку записи данных довольно малы. Проблема, с которой мы сталкиваемся, заключается в том, что запуск namenode / datanode занимает очень много времени (6 часов). Я предполагаю, что датоде требуется слишком много времени, чтобы подготовить блочный отчет и синхронизировать его с датодой, но это только мое предположение. При этом загрузка ЦП довольно низкая, а куча именных узлов (128 ГБ) остается около 30% все время. В журналах namenode я вижу следующее:

2019-10-23 12:39:15,761 INFO  ipc.Server (Server.java:logException(2428)) - IPC Server handler 2330 on 8020, call org.apache.hadoop.hdfs.server.protocol.NamenodeProtocol.rollEditLog from 22.243.33.26:41190 Call#1303 Retry#0: org.apache.hadoop.hdfs.server.namenode.SafeModeException: Log not rolled. Name node is in safe mode.
The reported blocks 0 needs additional 23891420 blocks to reach the threshold 1.0000 of total blocks 23891419.
The number of live datanodes 0 has reached the minimum number 0. Safe mode will be turned off automatically once the thresholds have been reached.
2019-10-23 12:40:15,784 INFO  ipc.Server (Server.java:logException(2428)) - IPC Server handler 2330 on 8020, call org.apache.hadoop.hdfs.server.protocol.NamenodeProtocol.rollEditLog from 22.243.33.26:41322 Call#1305 Retry#0: org.apache.hadoop.hdfs.server.namenode.SafeModeException: Log not rolled. Name node is in safe mode.
The reported blocks 0 needs additional 23891420 blocks to reach the threshold 1.0000 of total blocks 23891419.
The number of live datanodes 0 has reached the minimum number 0. Safe mode will be turned off automatically once the thresholds have been reached.
2019-10-23 12:41:15,807 INFO  ipc.Server (Server.java:logException(2428)) - IPC Server handler 898 on 8020, call org.apache.hadoop.hdfs.server.protocol.NamenodeProtocol.rollEditLog from 22.243.33.26:41488 Call#1307 Retry#0: org.apache.hadoop.hdfs.server.namenode.SafeModeException: Log not rolled. Name node is in safe mode.
The reported blocks 0 needs additional 23891420 blocks to reach the threshold 1.0000 of total blocks 23891419.
The number of live datanodes 0 has reached the minimum number 0. Safe mode will be turned off automatically once the thresholds have been reached.

В то время как журналы датоде показывают следующее:

2019-10-23 12:42:06,445 INFO  util.JvmPauseMonitor (JvmPauseMonitor.java:run(196)) - Detected pause in JVM or host machine (eg GC): pause of approximately 1241ms
GC pool 'ParNew' had collection(s): count=1 time=1624ms
2019-10-23 12:42:08,209 INFO  util.JvmPauseMonitor (JvmPauseMonitor.java:run(196)) - Detected pause in JVM or host machine (eg GC): pause of approximately 1264ms
GC pool 'ParNew' had collection(s): count=1 time=1616ms
2019-10-23 12:42:10,209 INFO  util.JvmPauseMonitor (JvmPauseMonitor.java:run(196)) - Detected pause in JVM or host machine (eg GC): pause of approximately 1499ms
GC pool 'ParNew' had collection(s): count=1 time=1858ms
2019-10-23 12:42:12,405 INFO  util.JvmPauseMonitor (JvmPauseMonitor.java:run(196)) - Detected pause in JVM or host machine (eg GC): pause of approximately 1695ms
GC pool 'ParNew' had collection(s): count=1 time=2054ms
2019-10-23 12:42:14,270 INFO  util.JvmPauseMonitor (JvmPauseMonitor.java:run(196)) - Detected pause in JVM or host machine (eg GC): pause of approximately 1364ms
GC pool 'ParNew' had collection(s): count=1 time=1742ms
2019-10-23 12:42:16,152 INFO  util.JvmPauseMonitor (JvmPauseMonitor.java:run(196)) - Detected pause in JVM or host machine (eg GC): pause of approximately 1381ms
GC pool 'ParNew' had collection(s): count=1 time=1756ms
2019-10-23 12:42:18,128 INFO  util.JvmPauseMonitor (JvmPauseMonitor.java:run(196)) - Detected pause in JVM or host machine (eg GC): pause of approximately 1475ms
GC pool 'ParNew' had collection(s): count=1 time=1802ms
2019-10-23 12:42:20,126 INFO  util.JvmPauseMonitor (JvmPauseMonitor.java:run(196)) - Detected pause in JVM or host machine (eg GC): pause of approximately 1498ms
GC pool 'ParNew' had collection(s): count=1 time=1872ms
2019-10-23 12:42:21,999 INFO  util.JvmPauseMonitor (JvmPauseMonitor.java:run(196)) - Detected pause in JVM or host machine (eg GC): pause of approximately 1372ms
GC pool 'ParNew' had collection(s): count=1 time=1746ms
2019-10-23 12:42:24,108 INFO  util.JvmPauseMonitor (JvmPauseMonitor.java:run(196)) - Detected pause in JVM or host machine (eg GC): pause of approximately 1608ms
GC pool 'ParNew' had collection(s): count=1 time=1985ms

В течение этого времени процесс обработки данных, кажется, выполняется в Ambari, но неlive: enter image description here

Редактировать 1 Как только датодан завершает свою работу, я вижу следующее в файлах журнала датоде:

GC pool 'ParNew' had collection(s): count=1 time=1864ms
2019-10-23 13:05:26,724 INFO  util.JvmPauseMonitor (JvmPauseMonitor.java:run(196)) - Detected pause in JVM or host machine (eg GC): pause of approximately 1597ms
GC pool 'ParNew' had collection(s): count=1 time=1979ms
2019-10-23 13:05:26,747 INFO  impl.FsDatasetImpl (FsVolumeList.java:run(193)) - Time to add replicas to map for block pool BP-814497463-127.0.0.1-1558792659773 on volume /hadoop/hdfs/data/current: 12658576ms
2019-10-23 13:05:26,748 INFO  impl.FsDatasetImpl (FsVolumeList.java:getAllVolumesMap(219)) - Total time to add all replicas to map: 12658578ms
2019-10-23 13:05:26,794 INFO  datanode.VolumeScanner (VolumeScanner.java:findNextUsableBlockIter(383)) - Now scanning bpid BP-814497463-127.0.0.1-1558792659773 on volume /hadoop/hdfs/data
2019-10-23 13:05:26,796 INFO  datanode.DirectoryScanner (DirectoryScanner.java:start(334)) - Periodic Directory Tree Verification scan starting at 1571848750796 with interval 21600000
2019-10-23 13:05:26,800 INFO  datanode.DataNode (BPServiceActor.java:register(715)) - Block pool BP-814497463-127.0.0.1-1558792659773 (Datanode Uuid 0aff4a22-3f1a-485b-9aec-46fd881dfab0) service to xxxx.corp.com/22.243.33.26:8020 beginning handshake with NN
2019-10-23 13:05:26,939 INFO  datanode.DataNode (BPServiceActor.java:register(734)) - Block pool Block pool BP-814497463-127.0.0.1-1558792659773 (Datanode Uuid 0aff4a22-3f1a-485b-9aec-46fd881dfab0) service to xxxx.corp.com/22.243.33.26:8020 successfully registered with NN
2019-10-23 13:05:26,940 INFO  datanode.DataNode (DataNode.java:registerBlockPoolWithSecretManager(1443)) - Block token params received from NN: for block pool BP-814497463-127.0.0.1-1558792659773 keyUpdateInterval=600 min(s), tokenLifetime=600 min(s)
2019-10-23 13:05:26,952 INFO  block.BlockTokenSecretManager (BlockTokenSecretManager.java:addKeys(193)) - Setting block keys
2019-10-23 13:05:26,952 INFO  datanode.DataNode (BPServiceActor.java:offerService(569)) - For namenode xxxx.corp.com/22.243.33.26:8020 using BLOCKREPORT_INTERVAL of 21600000msec CACHEREPORT_INTERVAL of 10000msec Initial delay: 120000msec; heartBeatInterval=3000
2019-10-23 13:05:27,066 INFO  datanode.DataNode (BPOfferService.java:updateActorStatesFromHeartbeat(513)) - Namenode Block pool BP-814497463-127.0.0.1-1558792659773 (Datanode Uuid 0aff4a22-3f1a-485b-9aec-46fd881dfab0) service to xxxx.corp.com/22.243.33.26:8020 trying to claim ACTIVE state with txid=198275130
2019-10-23 13:05:27,066 INFO  datanode.DataNode (BPOfferService.java:updateActorStatesFromHeartbeat(525)) - Acknowledging ACTIVE Namenode Block pool BP-814497463-127.0.0.1-1558792659773 (Datanode Uuid 0aff4a22-3f1a-485b-9aec-46fd881dfab0) service to xxxx.corp.com/22.243.33.26:8020
2019-10-23 13:05:33,542 INFO  util.JvmPauseMonitor (JvmPauseMonitor.java:run(196)) - Detected pause in JVM or host machine (eg GC): pause of approximately 1316ms
GC pool 'ParNew' had collection(s): count=1 time=1645ms
2019-10-23 13:05:41,591 INFO  util.JvmPauseMonitor (JvmPauseMonitor.java:run(196)) - Detected pause in JVM or host machine (eg GC): pause of approximately 1547ms
GC pool 'ParNew' had collection(s): count=1 time=1859ms
2019-10-23 13:05:49,487 INFO  util.JvmPauseMonitor (JvmPauseMonitor.java:run(196)) - Detected pause in JVM or host machine (eg GC): pause of approximately 1393ms
GC pool 'ParNew' had collection(s): count=1 time=1802ms
2019-10-23 13:05:57,902 INFO  util.JvmPauseMonitor (JvmPauseMonitor.java:run(196)) - Detected pause in JVM or host machine (eg GC): pause of approximately 1413ms
GC pool 'ParNew' had collection(s): count=1 time=1889ms
2019-10-23 13:06:01,296 INFO  util.JvmPauseMonitor (JvmPauseMonitor.java:run(196)) - Detected pause in JVM or host machine (eg GC): pause of approximately 1391ms
GC pool 'ParNew' had collection(s): count=1 time=1801ms
2019-10-23 13:06:04,065 INFO  util.JvmPauseMonitor (JvmPauseMonitor.java:run(196)) - Detected pause in JVM or host machine (eg GC): pause of approximately 1268ms
GC pool 'ParNew' had collection(s): count=1 time=1761ms
2019-10-23 13:06:05,775 INFO  util.JvmPauseMonitor (JvmPauseMonitor.java:run(196)) - Detected pause in JVM or host machine (eg GC): pause of approximately 1209ms
GC pool 'ParNew' had collection(s): count=1 time=1633ms
2019-10-23 13:06:05,785 INFO  web.DatanodeHttpServer (SimpleHttpProxyHandler.java:exceptionCaught(147)) - Proxy for / failed. cause:
java.io.IOException: Connection reset by peer
        at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
        at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
        at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)
        at sun.nio.ch.IOUtil.read(IOUtil.java:192)
        at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:380)
        at io.netty.buffer.UnpooledUnsafeDirectByteBuf.setBytes(UnpooledUnsafeDirectByteBuf.java:368)
        at io.netty.buffer.AbstractByteBuf.writeBytes(AbstractByteBuf.java:891)
        at io.netty.channel.socket.nio.NioSocketChannel.doReadBytes(NioSocketChannel.java:243)
        at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:119)
        at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:646)
        at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:581)
        at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:498)
        at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:460)
        at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:131)
        at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:138)
        at java.lang.Thread.run(Thread.java:745)
2019-10-23 13:06:07,425 INFO  util.JvmPauseMonitor (JvmPauseMonitor.java:run(196)) - Detected pause in JVM or host machine (eg GC): pause of approximately 1148ms
GC pool 'ParNew' had collection(s): count=1 time=1567ms
2019-10-23 13:06:10,114 INFO  util.JvmPauseMonitor (JvmPauseMonitor.java:run(196)) - Detected pause in JVM or host machine (eg GC): pause of approximately 1688ms
GC pool 'ParNew' had collection(s): count=1 time=1888ms
2019-10-23 13:06:10,253 INFO  datanode.DataNode (BPServiceActor.java:blockReport(395)) - Unsuccessfully sent block report 0x31d5073bda87,  containing 1 storage report(s), of which we sent 0. The reports had 23891422 total blocks and used 0 RPC(s). This took 5603 msec to generate and 6144 msecs for RPC and NN processing. Got back no commands.
2019-10-23 13:06:10,254 WARN  datanode.DataNode (BPServiceActor.java:offerService(681)) - IOException in offerService
java.io.EOFException: End of File Exception between local host is: "xxxx.corp.com/hidden.private.ip.address"; destination host is: "xxxx.corp.com":8020; : java.io.EOFException; For more details see:  http://wiki.apache.org/hadoop/EOFException
        at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
        at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
        at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
        at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
        at org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:801)
        at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:765)
        at org.apache.hadoop.ipc.Client.getRpcResponse(Client.java:1558)
        at org.apache.hadoop.ipc.Client.call(Client.java:1498)
        at org.apache.hadoop.ipc.Client.call(Client.java:1398)
        at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:233)
        at com.sun.proxy.$Proxy15.blockReport(Unknown Source)
        at org.apache.hadoop.hdfs.protocolPB.DatanodeProtocolClientSideTranslatorPB.blockReport(DatanodeProtocolClientSideTranslatorPB.java:216)
        at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.blockReport(BPServiceActor.java:377)
        at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.offerService(BPServiceActor.java:653)
        at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:793)
        at java.lang.Thread.run(Thread.java:745)
Caused by: java.io.EOFException
        at java.io.DataInputStream.readInt(DataInputStream.java:392)
        at org.apache.hadoop.ipc.Client$Connection.receiveRpcResponse(Client.java:1119)
        at org.apache.hadoop.ipc.Client$Connection.run(Client.java:1014)
2019-10-23 13:06:13,307 INFO  util.JvmPauseMonitor (JvmPauseMonitor.java:run(196)) - Detected pause in JVM or host machine (eg GC): pause of approximately 1193ms

Изменить 2 После регистрации датодода в журналах namenode можно увидеть следующее:

2019-10-23 13:05:16,516 INFO  ipc.Server (Server.java:logException(2428)) - IPC Server handler 986 on 8020, call org.apache.hadoop.hdfs.server.protocol.NamenodeProtocol.rollEditLog from hidden.private.ip.address:46066 Call#1355 Retry#0: org.apache.hadoop.hdfs.server.namenode.SafeModeException: Log not rolled. Name node is in safe mode.
The reported blocks 0 needs additional 23891420 blocks to reach the threshold 1.0000 of total blocks 23891419.
The number of live datanodes 0 has reached the minimum number 0. Safe mode will be turned off automatically once the thresholds have been reached.
2019-10-23 13:05:26,854 INFO  hdfs.StateChange (DatanodeManager.java:registerDatanode(954)) - BLOCK* registerDatanode: from DatanodeRegistration(hidden.private.ip.address:50010, datanodeUuid=0aff4a22-3f1a-485b-9aec-46fd881dfab0, infoPort=50075, infoSecurePort=0, ipcPort=8010, storageInfo=lv=-56;cid=CID-49b9105e-fc0d-4ea4-9d2f-caceb95ce4bb;nsid=875919329;c=0) storage 0aff4a22-3f1a-485b-9aec-46fd881dfab0
2019-10-23 13:05:26,860 INFO  blockmanagement.DatanodeDescriptor (DatanodeDescriptor.java:updateHeartbeatState(401)) - Number of failed storage changes from 0 to 0
2019-10-23 13:05:26,916 INFO  net.NetworkTopology (NetworkTopology.java:add(427)) - Adding a new node: /default-rack/hidden.private.ip.address:50010
2019-10-23 13:05:26,917 INFO  blockmanagement.BlockReportLeaseManager (BlockReportLeaseManager.java:registerNode(205)) - Registered DN 0aff4a22-3f1a-485b-9aec-46fd881dfab0 (hidden.private.ip.address:50010).
2019-10-23 13:05:26,919 INFO  hdfs.StateChange (FSNamesystem.java:reportStatus(5952)) - STATE* Safe mode ON.
The reported blocks 0 needs additional 23891420 blocks to reach the threshold 1.0000 of total blocks 23891419.
The number of live datanodes 1 has reached the minimum number 0. Safe mode will be turned off automatically once the thresholds have been reached.
2019-10-23 13:05:27,028 INFO  blockmanagement.DatanodeDescriptor (DatanodeDescriptor.java:updateHeartbeatState(401)) - Number of failed storage changes from 0 to 0
2019-10-23 13:05:27,032 INFO  blockmanagement.DatanodeDescriptor (DatanodeDescriptor.java:updateStorage(854)) - Adding new storage ID DS-ea7ed3be-90ad-4424-a00c-577601814d81 for DN hidden.private.ip.address:50010
2019-10-23 13:06:08,024 WARN  ipc.Server (Server.java:checkDataLength(1728)) - Requested data length 291562685 is longer than maximum configured RPC length 291562661.  RPC came from hidden.private.ip.address
2019-10-23 13:06:08,026 INFO  ipc.Server (Server.java:doRead(1006)) - Socket Reader #1 for port 8020: readAndProcess from client hidden.private.ip.address threw exception [java.io.IOException: Requested data length 291562685 is longer than maximum configured RPC length 291562661.  RPC came from hidden.private.ip.address]
java.io.IOException: Requested data length 291562685 is longer than maximum configured RPC length 291562661.  RPC came from hidden.private.ip.address
        at org.apache.hadoop.ipc.Server$Connection.checkDataLength(Server.java:1729)
        at org.apache.hadoop.ipc.Server$Connection.readAndProcess(Server.java:1791)
        at org.apache.hadoop.ipc.Server$Listener.doRead(Server.java:995)
        at org.apache.hadoop.ipc.Server$Listener$Reader.doRunLoop(Server.java:852)
        at org.apache.hadoop.ipc.Server$Listener$Reader.run(Server.java:823)
2019-10-23 13:06:16,541 INFO  ipc.Server (Server.java:logException(2428)) - IPC Server handler 2890 on 8020, call org.apache.hadoop.hdfs.server.protocol.NamenodeProtocol.rollEditLog from hidden.private.ip.address:46214 Call#1357 Retry#0: org.apache.hadoop.hdfs.server.namenode.SafeModeException: Log not rolled. Name node is in safe mode.
The reported blocks 0 needs additional 23891420 blocks to reach the threshold 1.0000 of total blocks 23891419.
The number of live datanodes 1 has reached the minimum number 0. Safe mode will be turned off automatically once the thresholds have been reached.
2019-10-23 13:06:22,081 WARN  ipc.Server (Server.java:checkDataLength(1728)) - Requested data length 291562685 is longer than maximum configured RPC length 291562661.  RPC came from hidden.private.ip.address
2019-10-23 13:06:22,081 INFO  ipc.Server (Server.java:doRead(1006)) - Socket Reader #1 for port 8020: readAndProcess from client hidden.private.ip.address threw exception [java.io.IOException: Requested data length 291562685 is longer than maximum configured RPC length 291562661.  RPC came from hidden.private.ip.address]
java.io.IOException: Requested data length 291562685 is longer than maximum configured RPC length 291562661.  RPC came from hidden.private.ip.address
        at org.apache.hadoop.ipc.Server$Connection.checkDataLength(Server.java:1729)
        at org.apache.hadoop.ipc.Server$Connection.readAndProcess(Server.java:1791)
        at org.apache.hadoop.ipc.Server$Listener.doRead(Server.java:995)
        at org.apache.hadoop.ipc.Server$Listener$Reader.doRunLoop(Server.java:852)
        at org.apache.hadoop.ipc.Server$Listener$Reader.run(Server.java:823)
2019-10-23 13:06:33,432 WARN  ipc.Server (Server.java:checkDataLength(1728)) - Requested data length 291562685 is longer than maximum configured RPC length 291562661.  RPC came from hidden.private.ip.address
2019-10-23 13:06:33,432 INFO  ipc.Server (Server.java:doRead(1006)) - Socket Reader #1 for port 8020: readAndProcess from client hidden.private.ip.address threw exception [java.io.IOException: Requested data length 291562685 is longer than maximum configured RPC length 291562661.  RPC came from hidden.private.ip.address]
java.io.IOException: Requested data length 291562685 is longer than maximum configured RPC length 291562661.  RPC came from hidden.private.ip.address
        at org.apache.hadoop.ipc.Server$Connection.checkDataLength(Server.java:1729)
        at org.apache.hadoop.ipc.Server$Connection.readAndProcess(Server.java:1791)
        at org.apache.hadoop.ipc.Server$Listener.doRead(Server.java:995)
        at org.apache.hadoop.ipc.Server$Listener$Reader.doRunLoop(Server.java:852)
        at org.apache.hadoop.ipc.Server$Listener$Reader.run(Server.java:823)

Вопросы:

  1. Является ли это время запуска нормальным для такихобъемы данных?
  2. Могу ли я что-нибудь сделать, чтобы, возможно, отладить или устранить неполадки? На данный момент я предполагаю, что это может указывать на проблему кучи, поскольку сообщение пула GC продолжает появляться каждые несколько секунд?
  3. Есть ли какие-либо рекомендации по улучшению времени запуска?
  4. Есть ли какие-либорекомендации по оценке / измерению времени запуска дают размер данных или блоков?
  5. Почему размер блока ipc продолжает изменяться. Я должен увеличить его в два раза раньше. Это связано с размером данных?

Заранее спасибо.

...