JobManager не перенаправляет все запросы автоматически на оставшийся / работающий TaskManager - PullRequest
0 голосов
/ 28 декабря 2018

Описание проблемы

  1. 2 компьютера (203 204)
  2. создали автономный режим кластера HA Flink v1.6.1
  3. и запустите диспетчер заданий и диспетчер задач (2 слота задач) на каждом компьютере
  4. После того, как я запускаю задание (примеры SocketWindowWordCount.jar ./flink run ../examples/streaming/SocketWindowWordCount.jar --hostname 10.1.2.9 --port 9000) на узле JobManager, я убиваю работающий экземпляр TaskManager.
  5. Веб-панель управления. Я вижу, что задание отменено, а затем не выполнено. Изображение веб-панели управления

flink-conf.yaml

state.backend: filesystem
state.checkpoints.dir: hdfs://10.1.2.109:8020/wulin/flink-checkpoints
rest.port: 9081
blob.server.port: 6124
query.server.port: 6125
web.tmpdir: /home/flink/deploy/webTmp
web.log.path: /home/flink/deploy/log
io.tmp.dirs: /home/flink/deploy/taskManagerTmp
high-availability: zookeeper
high-availability.zookeeper.quorum: 10.0.1.79:2181
high-availability.zookeeper.path.root: /flink
high-availability.cluster-id: flink
high-availability.storageDir: hdfs://10.1.2.109:8020/wulin
security.kerberos.login.principal: xxxx
security.kerberos.login.keytab: /home/ctu/flink/flink-1.6/conf/user.keytab

полные журналы

log-standalonesession-203 срубы taskexecutor-203 срубы standalonesession-204

исключение

убить рабочую ТМ, получить вот такое исключение

2018-12-28 11:04:27,877 WARN  akka.remote.ReliableDeliverySupervisor                        - Association with remote system [akka.tcp://flink@hz203:42861] has failed, address is now gated for [50] ms. Reason: [Association failed with [akka.tcp://flink@hz203:42861]] Caused by: [Connection refused: hz203/10.0.0.203:42861]
2018-12-28 11:04:28,660 WARN  akka.remote.transport.netty.NettyTransport                    - Remote connection to [null] failed with java.net.ConnectException: Connection refused: hz203/10.0.0.203:42861
2018-12-28 11:04:28,660 WARN  akka.remote.ReliableDeliverySupervisor                        - Association with remote system [akka.tcp://flink@hz203:42861] has failed, address is now gated for [50] ms. Reason: [Association failed with [akka.tcp://flink@hz203:42861]] Caused by: [Connection refused: hz203/10.0.0.203:42861]
2018-12-28 11:04:28,678 INFO  org.apache.flink.runtime.resourcemanager.StandaloneResourceManager  - The heartbeat of TaskManager with id 0f41bca09600cd25000e19801076fa1f timed out.
2018-12-28 11:04:28,678 INFO  org.apache.flink.runtime.resourcemanager.StandaloneResourceManager  - Closing TaskExecutor connection 0f41bca09600cd25000e19801076fa1f because: The heartbeat of TaskManager with id 0f41bca09600cd25000e19801076fa1f  timed out.
2018-12-28 11:04:28,678 INFO  org.apache.flink.runtime.resourcemanager.slotmanager.SlotManager  - Unregister TaskManager dcf3bb5b7ed2208cf45b658d212fd8d2 from the SlotManager.
2018-12-28 11:04:28,678 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph        - Source: Socket Stream -> Flat Map (1/1) (88aa62ad152f4df6b39a969dd32c0249) switched from RUNNING to FAILED.
org.apache.flink.util.FlinkException: The assigned slot 0f41bca09600cd25000e19801076fa1f_0 was removed.
        at org.apache.flink.runtime.resourcemanager.slotmanager.SlotManager.removeSlot(SlotManager.java:786)
        at org.apache.flink.runtime.resourcemanager.slotmanager.SlotManager.removeSlots(SlotManager.java:756)
        at org.apache.flink.runtime.resourcemanager.slotmanager.SlotManager.internalUnregisterTaskManager(SlotManager.java:948)
        at org.apache.flink.runtime.resourcemanager.slotmanager.SlotManager.unregisterTaskManager(SlotManager.java:372)
        at org.apache.flink.runtime.resourcemanager.ResourceManager.closeTaskManagerConnection(ResourceManager.java:803)
        at org.apache.flink.runtime.resourcemanager.ResourceManager$TaskManagerHeartbeatListener$1.run(ResourceManager.java:1116)
        at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRunAsync(AkkaRpcActor.java:332)
        at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(AkkaRpcActor.java:158)
        at org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.handleRpcMessage(FencedAkkaRpcActor.java:70)
        at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.onReceive(AkkaRpcActor.java:142)
        at org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.onReceive(FencedAkkaRpcActor.java:40)
        at akka.actor.UntypedActor$$anonfun$receive$1.applyOrElse(UntypedActor.scala:165)
        at akka.actor.Actor$class.aroundReceive(Actor.scala:502)
        at akka.actor.UntypedActor.aroundReceive(UntypedActor.scala:95)
        at akka.actor.ActorCell.receiveMessage(ActorCell.scala:526)
        at akka.actor.ActorCell.invoke(ActorCell.scala:495)
        at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:257)
        at akka.dispatch.Mailbox.run(Mailbox.scala:224)
        at akka.dispatch.Mailbox.exec(Mailbox.scala:234)
        at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
        at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
        at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
        at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
2018-12-28 11:04:28,680 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph        - Job Socket Window WordCount (61f55876e79934d515c163d095d706a6) switched from state RUNNING to FAILING.

отправить задание

запустить ./bin/flink run -d ./examples/streaming/SocketWindowWordCount.jar --port 9000 --hostname 10.1.2.9, получить журналы JM вот так

2018-12-28 19:20:01,354 INFO  org.apache.flink.runtime.jobmaster.JobMaster                  - Starting execution of job Socket Window WordCount (5cdb91c15ee12ec6e74256eed10b5291)
2018-12-28 19:20:01,354 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph        - Job Socket Window WordCount (5cdb91c15ee12ec6e74256eed10b5291) switched from state CREATED to RUNNING.
2018-12-28 19:20:01,356 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph        - Source: Socket Stream -> Flat Map (1/1) (e30439b9f548c6013d8b8689e30d0dd7) switched from CREATED to SCHEDULED.
2018-12-28 19:20:01,359 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph        - Window(TumblingProcessingTimeWindows(5000), ProcessingTimeTrigger, ReduceFunction$1, PassThroughWindowFunction) -> Sink: Print to Std. Out (1/1) (102d04f5aa6fc50cfe5088e20902c72e) switched from CREATED to SCHEDULED.
2018-12-28 19:20:01,364 INFO  org.apache.flink.runtime.jobmaster.slotpool.SlotPool          - Cannot serve slot request, no ResourceManager connected. Adding as pending request [SlotRequestId{e33a40832a3922897470fb76bcf76b29}]
2018-12-28 19:20:01,367 INFO  org.apache.flink.runtime.jobmaster.JobMaster                  - Connecting to ResourceManager akka.tcp://flink@hz203:46596/user/resourcemanager(b22f96303e74df23645fe4567f884b9e)
2018-12-28 19:20:01,370 INFO  org.apache.flink.runtime.jobmaster.JobMaster                  - Resolved ResourceManager address, beginning registration
2018-12-28 19:20:01,370 INFO  org.apache.flink.runtime.jobmaster.JobMaster                  - Registration at ResourceManager attempt 1 (timeout=100ms)
2018-12-28 19:20:01,371 INFO  org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService  - Starting ZooKeeperLeaderRetrievalService /leader/5cdb91c15ee12ec6e74256eed10b5291/job_manager_lock.
2018-12-28 19:20:01,371 INFO  org.apache.flink.runtime.resourcemanager.StandaloneResourceManager  - Registering job manager 9a31e8b4e8dfbf7b31d6ed3d227648b6@akka.tcp://flink@hz203:46596/user/jobmanager_0 for job 5cdb91c15ee12ec6e74256eed10b5291.
2018-12-28 19:20:01,431 INFO  org.apache.flink.runtime.resourcemanager.StandaloneResourceManager  - Registered job manager 9a31e8b4e8dfbf7b31d6ed3d227648b6@akka.tcp://flink@hz203:46596/user/jobmanager_0 for job 5cdb91c15ee12ec6e74256eed10b5291.
2018-12-28 19:20:01,432 INFO  org.apache.flink.runtime.jobmaster.JobMaster                  - JobManager successfully registered at ResourceManager, leader id: b22f96303e74df23645fe4567f884b9e.
2018-12-28 19:20:01,433 INFO  org.apache.flink.runtime.jobmaster.slotpool.SlotPool          - Requesting new slot [SlotRequestId{e33a40832a3922897470fb76bcf76b29}] and profile ResourceProfile{cpuCores=-1.0, heapMemoryInMB=-1, directMemoryInMB=0, nativeMemoryInMB=0, networkMemoryInMB=0} from resource manager.
2018-12-28 19:20:01,434 INFO  org.apache.flink.runtime.resourcemanager.StandaloneResourceManager  - Request slot with profile ResourceProfile{cpuCores=-1.0, heapMemoryInMB=-1, directMemoryInMB=0, nativeMemoryInMB=0, networkMemoryInMB=0} for job 5cdb91c15ee12ec6e74256eed10b5291 with allocation id AllocationID{f7a24e609e2ec618ccb456076049fa3b}.
2018-12-28 19:20:01,510 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph        - Source: Socket Stream -> Flat Map (1/1) (e30439b9f548c6013d8b8689e30d0dd7) switched from SCHEDULED to DEPLOYING.
2018-12-28 19:20:01,511 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph        - Deploying Source: Socket Stream -> Flat Map (1/1) (attempt #0) to hz203
2018-12-28 19:20:01,515 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph        - Window(TumblingProcessingTimeWindows(5000), ProcessingTimeTrigger, ReduceFunction$1, PassThroughWindowFunction) -> Sink: Print to Std. Out (1/1) (102d04f5aa6fc50cfe5088e20902c72e) switched from SCHEDULED to DEPLOYING.
2018-12-28 19:20:01,515 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph        - Deploying Window(TumblingProcessingTimeWindows(5000), ProcessingTimeTrigger, ReduceFunction$1, PassThroughWindowFunction) -> Sink: Print to Std. Out (1/1) (attempt #0) to hz203
2018-12-28 19:20:01,674 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph        - Window(TumblingProcessingTimeWindows(5000), ProcessingTimeTrigger, ReduceFunction$1, PassThroughWindowFunction) -> Sink: Print to Std. Out (1/1) (102d04f5aa6fc50cfe5088e20902c72e) switched from DEPLOYING to RUNNING.
2018-12-28 19:20:01,708 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph        - Source: Socket Stream -> Flat Map (1/1) (e30439b9f548c6013d8b8689e30d0dd7) switched from DEPLOYING to RUNNING.
2018-12-28 19:20:43,267 INFO  org.apache.flink.runtime.blob.BlobClient                      - Downloading null/t-61808afb630553305c73a0a23f9231ffd6b2b448-513fbe1e6ddf69d10689eccf4c65da97 from hz203/10.0.0.203:6124
2018-12-28 19:20:48,339 INFO  org.apache.flink.runtime.blob.BlobClient                      - Downloading null/t-dd915bb9821ff6ced34dd5e489966b674de5a48f-7ea2600930e5fc5a4fbb7d47ee198789 from hz203/10.0.0.203:6124
2018-12-28 19:20:52,623 INFO  org.apache.flink.runtime.blob.BlobClient                      - Downloading null/t-61808afb630553305c73a0a23f9231ffd6b2b448-0bd1ab86fa4cc54daeb472079bfbea8c from hz203/10.0.0.203:6124

kill TM

Тело ограничено 30000 символами.пожалуйста, прочитайте это JM журналы когда убить TM

1 Ответ

0 голосов
/ 28 декабря 2018

Журналы показывают, что ваш RestartStrategy исчерпал свои попытки перезапуска или что RestartStrategy не было настроено.Пожалуйста, проверьте, указали ли вы RestartStrategy в вашей программе через env.setRestartStrategy(RestartStrategies.fixedDelayRestart(10, 0L)) или flink-conf.yaml через restart-strategy: fixed-delay.Если вы хотите узнать больше о стратегиях перезапуска Flink, ознакомьтесь с документацией .

...