Тайм-аут исключения Flink - PullRequest
1 голос
/ 08 июня 2019

У меня вопрос по поводу Флинка.Я запускаю приложение в локальном кластере с 1 TaskManager и 4 Taskslots.

После некоторого времени запуска приложения я получил сообщение об ошибке Timeout:

java.util.concurrent.TimeoutException: Heartbeat of TaskManager with id feea6a6702a0cf960ae2847b5bd25665 timed out.

Я видел некоторыесообщения с этой темой, но любой ответ на него.Не могли бы вы помочь мне увидеть основную причину или возможные проблемы?

Я использую Flink версии 1.5.3

Кажется, что докер-контейнер менеджеров задач и JobManager останавливается, когда это происходит.

Позвольте мне добавить трассировку ошибок изЖурналы контейнера JobManager:

2019-06-09 13:31:06,300 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph        - Job Socket Window NgsiEvent (ef3a860de48d54544d973754c6170d8b) switched from state FAILING to FAILED.
java.util.concurrent.TimeoutException: Heartbeat of TaskManager with id 63dbab620797b84da023b33578478238 timed out.
    at org.apache.flink.runtime.jobmaster.JobMaster$TaskManagerHeartbeatListener.notifyHeartbeatTimeout(JobMaster.java:1609)
    at org.apache.flink.runtime.heartbeat.HeartbeatManagerImpl$HeartbeatMonitor.run(HeartbeatManagerImpl.java:339)
    at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
    at java.util.concurrent.FutureTask.run(FutureTask.java:266)
    at org.apache.flink.runtime.concurrent.akka.ActorSystemScheduledExecutorAdapter$ScheduledFutureTask.run(ActorSystemScheduledExecutorAdapter.java:154)
    at akka.dispatch.TaskInvocation.run(AbstractDispatcher.scala:39)
    at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:415)
    at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
    at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
    at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
    at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
2019-06-09 13:31:06,308 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph        - Could not restart the job Socket Window NgsiEvent (ef3a860de48d54544d973754c6170d8b) because the restart strategy prevented it.
java.util.concurrent.TimeoutException: Heartbeat of TaskManager with id 63dbab620797b84da023b33578478238 timed out.
    at org.apache.flink.runtime.jobmaster.JobMaster$TaskManagerHeartbeatListener.notifyHeartbeatTimeout(JobMaster.java:1609)
    at org.apache.flink.runtime.heartbeat.HeartbeatManagerImpl$HeartbeatMonitor.run(HeartbeatManagerImpl.java:339)
    at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
    at java.util.concurrent.FutureTask.run(FutureTask.java:266)
    at org.apache.flink.runtime.concurrent.akka.ActorSystemScheduledExecutorAdapter$ScheduledFutureTask.run(ActorSystemScheduledExecutorAdapter.java:154)
    at akka.dispatch.TaskInvocation.run(AbstractDispatcher.scala:39)
    at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:415)
    at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
    at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
    at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
    at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
2019-06-09 13:31:06,317 INFO  org.apache.flink.runtime.checkpoint.CheckpointCoordinator     - Stopping checkpoint coordinator for job ef3a860de48d54544d973754c6170d8b.
2019-06-09 13:31:06,322 INFO  org.apache.flink.runtime.checkpoint.StandaloneCompletedCheckpointStore  - Shutting down
2019-06-09 13:31:06,331 WARN  akka.remote.ReliableDeliverySupervisor                        - Association with remote system [akka.tcp://flink@16363182f31f:36715] has failed, address is now gated for [50] ms. Reason: [Association failed with [akka.tcp://flink@16363182f31f:36715]] Caused by: [16363182f31f]
2019-06-09 13:31:06,351 INFO  org.apache.flink.runtime.dispatcher.StandaloneDispatcher      - Job ef3a860de48d54544d973754c6170d8b reached globally terminal state FAILED.
2019-06-09 13:31:06,434 INFO  org.apache.flink.runtime.jobmaster.JobMaster                  - Stopping the JobMaster for job Socket Window NgsiEvent(ef3a860de48d54544d973754c6170d8b).
2019-06-09 13:31:06,447 INFO  org.apache.flink.runtime.jobmaster.slotpool.SlotPool          - Suspending SlotPool.
2019-06-09 13:31:06,448 INFO  org.apache.flink.runtime.jobmaster.JobMaster                  - Close ResourceManager connection 883e842633b0fd9a2e53ab45778581fe: JobManager is shutting down..
2019-06-09 13:31:06,449 INFO  org.apache.flink.runtime.rpc.akka.AkkaRpcActor                - The rpc endpoint org.apache.flink.runtime.jobmaster.slotpool.SlotPool has not been started yet. Discarding message org.apache.flink.runtime.rpc.messages.LocalRpcInvocation until processing is started.
2019-06-09 13:31:06,457 INFO  org.apache.flink.runtime.resourcemanager.StandaloneResourceManager  - Disconnect job manager 00000000000000000000000000000000@akka.tcp://flink@jobmanager:6123/user/jobmanager_2 for job ef3a860de48d54544d973754c6170d8b from the resource manager.
2019-06-09 13:31:06,459 INFO  org.apache.flink.runtime.jobmaster.slotpool.SlotPool          - Stopping SlotPool.
2019-06-09 13:31:06,460 INFO  org.apache.flink.runtime.jobmaster.JobManagerRunner           - JobManagerRunner already shutdown.
2019-06-09 13:31:16,304 WARN  akka.remote.ReliableDeliverySupervisor                        - Association with remote system [akka.tcp://flink@16363182f31f:36715] has failed, address is now gated for [50] ms. Reason: [Association failed with [akka.tcp://flink@16363182f31f:36715]] Caused by: [16363182f31f: Name or service not known]
2019-06-09 13:31:26,320 WARN  akka.remote.ReliableDeliverySupervisor                        - Association with remote system [akka.tcp://flink@16363182f31f:36715] has failed, address is now gated for [50] ms. Reason: [Association failed with [akka.tcp://flink@16363182f31f:36715]] Caused by: [16363182f31f: Name or service not known]
2019-06-09 13:31:36,286 WARN  akka.remote.ReliableDeliverySupervisor                        - Association with remote system [akka.tcp://flink@16363182f31f:36715] has failed, address is now gated for [50] ms. Reason: [Association failed with [akka.tcp://flink@16363182f31f:36715]] Caused by: [16363182f31f]

Заранее спасибо!

...