Apache Flink Resource Manager (App Master) не может выделить новых диспетчеров задач после того, как первый из них был убит после простоя - PullRequest
1 голос
/ 24 октября 2019

Я использую Apache Flink 1.9.1 на Apache Mesos 1.8.1, используя более или менее стандартные настройки для обоих кластеров из 3 узлов. Первый узел (192.168.10.11) запускает мастер Mesos и мастер приложений Flink. Оба других узла (192.168.10.13, 192.168.10.14) являются несколько подчиненными узлами, на которых выполняются агенты mesos и которые имеют доступ к док-контейнеру Flink Worker. Flink запускается из Docker, но Mesos Master и агенты работают на хосте.

Сначала все работает нормально. Первое пакетное задание, которое я отправляю Flink, правильно занято в App Master, Resource Manager успешно согласовывает с Mesos, и Resource Manager Framework получает новую рабочую задачу, отправленную на один из подчиненных узлов. Задача успешно завершается, и клиентская программа получает свои результаты.

Через несколько минут происходит истечение времени ожидания ResourceManager, и из-за превышения времени ожидания рабочая задача Flink (контейнер Docker) уничтожается. Пока все хорошо.

Но после этого момента, если я снова отправлю ту же самую работу точно так же, как в первый раз, Менеджер ресурсов ничего не сделает и не попытается создать какую-либоновый работник. Работа длится 5 минут и терпит неудачу после тайм-аута.

Если бы кто-то мог указать мне на то, что может происходить неправильно, я был бы очень счастлив.

Ниже выписка изсоответствующие журналы мастера приложений Flink и мастера Mesos:

  1. Запуск мастера приложений Flink
2019-10-24 06:36:05,191 INFO  ....entrypoint.ClusterEntrypoint         - --------------------------------------------------------------------------------
2019-10-24 06:36:05,195 INFO  ....entrypoint.ClusterEntrypoint         -  Starting MesosSessionClusterEntrypoint (Version: 1.9.1, Rev:4d56de8, Date:30.09.2019 @ 11:32:19 CST)
2019-10-24 06:36:05,195 INFO  ....entrypoint.ClusterEntrypoint         -  OS current user: flink
2019-10-24 06:36:07,045 WARN  org.apache.hadoop.util.NativeCodeLoader                       - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
2019-10-24 06:36:07,504 INFO  ....entrypoint.ClusterEntrypoint         -  Current Hadoop/Kerberos user: flink
2019-10-24 06:36:07,504 INFO  ....entrypoint.ClusterEntrypoint         -  JVM: OpenJDK 64-Bit Server VM - Oracle Corporation - 1.8/25.222-b10
2019-10-24 06:36:07,504 INFO  ....entrypoint.ClusterEntrypoint         -  Maximum heap size: 2169 MiBytes
2019-10-24 06:36:07,504 INFO  ....entrypoint.ClusterEntrypoint         -  JAVA_HOME: (not set)
2019-10-24 06:36:07,519 INFO  ....entrypoint.ClusterEntrypoint         -  Hadoop version: 2.8.3
2019-10-24 06:36:07,520 INFO  ....entrypoint.ClusterEntrypoint         -  JVM Options:
2019-10-24 06:36:07,520 INFO  ....entrypoint.ClusterEntrypoint         -     -Dlog.file=/usr/local/lib/flink-1.9.1/log/flink--mesos-appmaster-deb-node1.log
2019-10-24 06:36:07,520 INFO  ....entrypoint.ClusterEntrypoint         -     -Dlog4j.configuration=file:/usr/local/lib/flink-1.9.1/conf/log4j.properties
2019-10-24 06:36:07,520 INFO  ....entrypoint.ClusterEntrypoint         -     -Dlogback.configurationFile=file:/usr/local/lib/flink-1.9.1/conf/logback.xml
2019-10-24 06:36:07,520 INFO  ....entrypoint.ClusterEntrypoint         -  Program Arguments: (none)
(...)
2019-10-24 06:36:07,520 INFO  ....entrypoint.ClusterEntrypoint         - --------------------------------------------------------------------------------
2019-10-24 06:36:07,526 INFO  ....entrypoint.ClusterEntrypoint         - Registered UNIX signal handlers for [TERM, HUP, INT]
2019-10-24 06:36:07,574 INFO  ....configuration.GlobalConfiguration            - Loading configuration property
----
properties summarized :
----
jobmanager.rpc.address, 192.168.10.11
jobmanager.rpc.port, 6123
jobmanager.heap.size, 1024m
taskmanager.heap.size, 1024m
taskmanager.numberOfTaskSlots, 2
parallelism.default, 2
state.checkpoints.dir, file:///var/lib/flink/data/ckeckpoints
state.savepoints.dir, file:///var/lib/flink/data/savepoints
jobmanager.execution.failover-strategy, region
rest.port, 8081
rest.bind-address, 0.0.0.0
jobmanager.archive.fs.dir, file:///var/lib/flink/completed_jobs/
historyserver.archive.fs.dir, file:///var/lib/flink/completed_jobs/
mesos.resourcemanager.tasks.container.type, docker
mesos.resourcemanager.tasks.container.image.name, eskimo:flink-worker
mesos.resourcemanager.tasks.bootstrap-cmd, export FLINK_HOME=/usr/local/lib/flink/
mesos.resourcemanager.tasks.container.volumes, /var/log/flink:/var/log/flink:RW,/var/lib/flink:/var/lib/flink:RW,/etc:/host_etc:RO
mesos.resourcemanager.tasks.cpus, 2
mesos.resourcemanager.tasks.mem, 1024
mesos.master, zk://192.168.10.11:2181/mesos
----
2019-10-24 06:36:07,753 INFO  ....entrypoint.ClusterEntrypoint         - Starting MesosSessionClusterEntrypoint.
(...)
2019-10-24 06:36:09,393 INFO  ....runtime.rpc.akka.AkkaRpcServiceUtils         - Trying to start actor system at 192.168.10.11:6123
(...)
(bla bla bla akka starting and all the rest)
(...)
2019-10-24 06:36:14,842 INFO  ....MesosResourceManager  - Registering as new framework.
2019-10-24 06:36:14,842 INFO  ....MesosResourceManager  - --------------------------------------------------------------------------------
2019-10-24 06:36:14,842 INFO  ....MesosResourceManager  -  Mesos Info:
2019-10-24 06:36:14,842 INFO  ....MesosResourceManager  -     Master URL: zk://192.168.10.11:2181/mesos
2019-10-24 06:36:14,842 INFO  ....MesosResourceManager  -  Framework Info:
2019-10-24 06:36:14,842 INFO  ....MesosResourceManager  -     ID: (none)
2019-10-24 06:36:14,842 INFO  ....MesosResourceManager  -     Name: Flink
2019-10-24 06:36:14,842 INFO  ....MesosResourceManager  -     Failover Timeout (secs): 604800.0
2019-10-24 06:36:14,842 INFO  ....MesosResourceManager  -     Role: *
2019-10-24 06:36:14,843 INFO  ....MesosResourceManager  -     Capabilities: (none)
2019-10-24 06:36:14,843 INFO  ....MesosResourceManager  -     Principal: (none)
2019-10-24 06:36:14,843 INFO  ....MesosResourceManager  -     Host: 192.168.10.11
2019-10-24 06:36:14,843 INFO  ....MesosResourceManager  -     Web UI: http://192.168.10.11:8081
2019-10-24 06:36:14,843 INFO  ....MesosResourceManager  - --------------------------------------------------------------------------------
2019-10-24 06:36:14,877 INFO  ....runtime.dispatcher.StandaloneDispatcher      - Dispatcher akka.tcp://flink@192.168.10.11:6123/user/dispatcher was granted leadership with fencing token 00000000-0000-0000-0000-000000000000
(...)
2019-10-24 06:36:15,714 INFO  ....mesos.scheduler.ConnectionMonitor            - Connecting to Mesos...
2019-10-24 06:36:15,764 INFO  ....MesosResourceManager  - Mesos resource manager started.
(...)
2019-10-24 06:36:46,033 INFO  ....mesos.scheduler.ConnectionMonitor            - Connected to Mesos as framework ID e1432c43-f540-4d53-9cc7-b06e0f2a6525-0000.

Логи в mesos-master сразу после запуска Flink App Master
I1024 06:36:52.096225    19 master.cpp:2920] Received SUBSCRIBE call for framework 'Flink' at scheduler-962dfe34-be82-44c5-9c59-95d01938bdee@127.0.1.1:44411
I1024 06:36:52.097496    25 master.cpp:3001] Subscribing framework Flink with checkpointing enabled and capabilities [  ]
I1024 06:36:52.098603    25 master.cpp:10463] Adding framework 8977e508-0698-4c40-9317-cdc3cc8982f4-0000 (Flink) at scheduler-962dfe34-be82-44c5-9c59-95d01938bdee@127.0.1.1:44411 with roles {  } suppressed
I1024 06:36:52.099602    18 hierarchical.cpp:373] Added framework 8977e508-0698-4c40-9317-cdc3cc8982f4-0000
I1024 06:36:52.101519    18 master.cpp:10048] Sending offers [ 8977e508-0698-4c40-9317-cdc3cc8982f4-O1, 8977e508-0698-4c40-9317-cdc3cc8982f4-O2 ] to framework 8977e508-0698-4c40-9317-cdc3cc8982f4-0000 (Flink) at scheduler-962dfe34-be82-44c5-9c59-95d01938bdee@127.0.1.1:44411
I1024 06:36:52.185779    20 master.cpp:3438] Processing SUPPRESS call for framework 8977e508-0698-4c40-9317-cdc3cc8982f4-0000 (Flink) at scheduler-962dfe34-be82-44c5-9c59-95d01938bdee@127.0.1.1:44411
I1024 06:36:52.186064    25 hierarchical.cpp:1346] Suppressed offers for roles { * } of framework 8977e508-0698-4c40-9317-cdc3cc8982f4-0000
Записывает в Flink App Master при первой подаче работы (которая работает нормально)
2019-10-24 06:37:18,800 INFO  ....runtime.dispatcher.StandaloneDispatcher      - Received JobGraph submission 50934ae259b85a449e9a789f82287305 (Flink Java Job at Thu Oct 24 06:37:17 UTC 2019).
2019-10-24 06:37:18,804 INFO  ....runtime.dispatcher.StandaloneDispatcher      - Submitting job 50934ae259b85a449e9a789f82287305 (Flink Java Job at Thu Oct 24 06:37:17 UTC 2019).
2019-10-24 06:37:18,889 INFO  ....runtime.rpc.akka.AkkaRpcService     - Starting RPC endpoint for ....JobMaster at akka://flink/user/jobmanager_0 .
2019-10-24 06:37:18,933 INFO  ....JobMaster                           - Initializing job Flink Java Job at Thu Oct 24 06:37:17 UTC 2019 (50934ae259b85a449e9a789f82287305).
(...)
2019-10-24 06:37:19,160 INFO  ....jobmaster.JobManagerRunner          - JobManager runner for job Flink Java Job at Thu Oct 24 06:37:17 UTC 2019 (50934ae259b85a449e9a789f82287305) was granted leadership with session id 00000000-0000-0000-0000-000000000000 at akka.tcp://flink@192.168.10.11:6123/user/jobmanager_0.
2019-10-24 06:37:19,175 INFO  ....JobMaster                           - Starting execution of job Flink Java Job at Thu Oct 24 06:37:17 UTC 2019 (50934ae259b85a449e9a789f82287305) under job master id 00000000000000000000000000000000.
2019-10-24 06:37:19,177 INFO  ....ExecutionGraph                      - Job Flink Java Job at Thu Oct 24 06:37:17 UTC 2019 (50934ae259b85a449e9a789f82287305) switched from state CREATED to RUNNING.
2019-10-24 06:37:19,191 INFO  ....ExecutionGraph                      - DataSource (at $line107.$read$$iw$$iw$$iw$$iw$$iw$$iw.<init>(<console>:24) (....api.java.io.CollectionInputFormat)) (1/1) (842c8e8d1e6b91746b428331ab8712d3) switched from CREATED to SCHEDULED.
2019-10-24 06:37:19,229 INFO  ....SlotPoolImpl                        - Cannot serve slot request, no ResourceManager connected. Adding as pending request [SlotRequestId{4c886029d8d6a410cde790ba6a7ac3c0}]
2019-10-24 06:37:19,265 INFO  ....JobMaster                           - Connecting to ResourceManager akka.tcp://flink@192.168.10.11:6123/user/resourcemanager(00000000000000000000000000000000)
2019-10-24 06:37:19,272 INFO  ....JobMaster                           - Resolved ResourceManager address, beginning registration
2019-10-24 06:37:19,273 INFO  ....JobMaster                           - Registration at ResourceManager attempt 1 (timeout=100ms)
2019-10-24 06:37:19,284 INFO  ....MesosResourceManager                - Registering job manager 00000000000000000000000000000000@akka.tcp://flink@192.168.10.11:6123/user/jobmanager_0 for job 50934ae259b85a449e9a789f82287305.
2019-10-24 06:37:19,296 INFO  (...)
2019-10-24 06:37:19,305 INFO  ....JobMaster                           - JobManager successfully registered at ResourceManager, leader id: 00000000000000000000000000000000.
2019-10-24 06:37:19,306 INFO  ....SlotPoolImpl                        - Requesting new slot [SlotRequestId{4c886029d8d6a410cde790ba6a7ac3c0}] and profile ResourceProfile{cpuCores=-1.0, heapMemoryInMB=-1, directMemoryInMB=-1, nativeMemoryInMB=-1, networkMemoryInMB=-1, managedMemoryInMB=-1} from resource manager.
2019-10-24 06:37:19,311 INFO  ....MesosResourceManager                - Request slot with profile ResourceProfile{cpuCores=-1.0, heapMemoryInMB=-1, directMemoryInMB=-1, nativeMemoryInMB=-1, networkMemoryInMB=-1, managedMemoryInMB=-1} for job 50934ae259b85a449e9a789f82287305 with allocation id 907586ba9ece54f20c23341276e6bac8.
2019-10-24 06:37:19,313 INFO  ....MesosResourceManager                - Starting a new worker.
2019-10-24 06:37:19,458 INFO  ....MesosResourceManager                - Scheduling Mesos task taskmanager-00001 with (1024.0 MB, 2.0 cpus).
2019-10-24 06:37:19,473 INFO  ....mesos.scheduler.LaunchCoordinator   - Now gathering offers for at least 1 task(s).
2019-10-24 06:37:19,495 INFO  ....mesos.scheduler.LaunchCoordinator   - Received offer(s) of 9432.0 MB, 4.0 cpus, 0.0 gpus:
2019-10-24 06:37:19,499 INFO  ....mesos.scheduler.LaunchCoordinator   -   8977e508-0698-4c40-9317-cdc3cc8982f4-O3 from 192.168.10.13 of 4639.0 MB, 2.0 cpus, 0.0 gpus for [*]
2019-10-24 06:37:19,500 INFO  ....mesos.scheduler.LaunchCoordinator   -   8977e508-0698-4c40-9317-cdc3cc8982f4-O4 from 192.168.10.14 of 4793.0 MB, 2.0 cpus, 0.0 gpus for [*]
2019-10-24 06:37:20,517 INFO  ....mesos.scheduler.LaunchCoordinator   - Processing 1 task(s) against 2 new offer(s) plus outstanding offers.
2019-10-24 06:37:20,578 INFO  ....mesos.scheduler.LaunchCoordinator   - Resources considered: (note: expired offers not deducted from below)
2019-10-24 06:37:20,587 INFO  ....mesos.scheduler.LaunchCoordinator   -   192.168.10.13 has 4639.0 MB, 2.0 cpus, null gpus
2019-10-24 06:37:20,841 INFO  ....MesosResourceManager                - Launching Mesos task taskmanager-00001 on host 192.168.10.14.
2019-10-24 06:37:20,841 INFO  ....mesos.scheduler.LaunchCoordinator   - Launched 1 task(s) on 192.168.10.14 using 1 offer(s):
2019-10-24 06:37:20,842 INFO  ....mesos.scheduler.LaunchCoordinator   -   8977e508-0698-4c40-9317-cdc3cc8982f4-O4
2019-10-24 06:37:20,843 INFO  ....mesos.scheduler.LaunchCoordinator   - No longer gathering offers; all requests fulfilled.
2019-10-24 06:37:20,844 INFO  com.netflix.fenzo.TaskScheduler         - Expiring all leases
2019-10-24 06:37:20,846 INFO  ....mesos.scheduler.LaunchCoordinator   - Declined offer 8977e508-0698-4c40-9317-cdc3cc8982f4-O3 from 192.168.10.13 of 4639.0 MB, 2.0 cpus.
2019-10-24 06:37:26,251 INFO  ....mesos.scheduler.TaskMonitor         - Mesos task taskmanager-00001 is running.
2019-10-24 06:37:29,575 INFO  ....MesosResourceManager                - Registering TaskManager with ResourceID taskmanager-00001 (akka.tcp://flink@192.168.10.14:31001/user/taskmanager_0) at ResourceManager
2019-10-24 06:37:29,693 INFO  ....ExecutionGraph                      - DataSource (at $line107.$read$$iw$$iw$$iw$$iw$$iw$$iw.<init>(<console>:24) (....api.java.io.CollectionInputFormat)) (1/1) (842c8e8d1e6b91746b428331ab8712d3) switched from SCHEDULED to DEPLOYING.
(...)
(everything going on fine)
(...)
2019-10-24 06:37:30,041 INFO  ....ExecutionGraph             - Job Flink Java Job at Thu Oct 24 06:37:17 UTC 2019 (50934ae259b85a449e9a789f82287305) switched from state RUNNING to FINISHED.
2019-10-24 06:37:30,076 INFO  ....StandaloneDispatche r      - Job 50934ae259b85a449e9a789f82287305 reached globally terminal state FINISHED.
2019-10-24 06:37:30,205 INFO  ....FsJobArchivist             - Job 50934ae259b85a449e9a789f82287305 has been archived at file:/var/lib/flink/completed_jobs/50934ae259b85a449e9a789f82287305.
2019-10-24 06:37:30,206 INFO  ....JobMaster                  - Stopping the JobMaster for job Flink Java Job at Thu Oct 24 06:37:17 UTC 2019(50934ae259b85a449e9a789f82287305).
2019-10-24 06:37:30,226 INFO  ....SlotPoolImpl               - Suspending SlotPool.
2019-10-24 06:37:30,227 INFO  ....JobMaster                  - Close ResourceManager connection 9c3fc959e219c30fba3395bb7be9823c: JobManager is shutting down..
2019-10-24 06:37:30,227 INFO  ....SlotPoolImpl               - Stopping SlotPool.
2019-10-24 06:37:30,232 INFO  ....MesosResourceManager  - Disconnect job manager 00000000000000000000000000000000@akka.tcp://flink@192.168.10.11:6123/user/jobmanager_0 for job 50934ae259b85a449e9a789f82287305 from the resource manager.
2019-10-24 06:37:30,232 INFO  ....jobmaster.JobManagerRunner  - JobManagerRunner already shutdown.
Соответствующее время в мезо:
I1024 06:37:19.475060    23 hierarchical.cpp:1387] Unsuppressed offers for roles { * } of framework 8977e508-0698-4c40-9317-cdc3cc8982f4-0000
I1024 06:37:19.475097    23 hierarchical.cpp:1409] Revived roles { * } of framework 8977e508-0698-4c40-9317-cdc3cc8982f4-0000
I1024 06:37:19.476991    19 master.cpp:10048] Sending offers [ 8977e508-0698-4c40-9317-cdc3cc8982f4-O3, 8977e508-0698-4c40-9317-cdc3cc8982f4-O4 ] to framework 8977e508-0698-4c40-9317-cdc3cc8982f4-0000 (Flink) at scheduler-962dfe34-be82-44c5-9c59-95d01938bdee@127.0.1.1:44411
I1024 06:37:20.845321    19 master.cpp:3438] Processing SUPPRESS call for framework 8977e508-0698-4c40-9317-cdc3cc8982f4-0000 (Flink) at scheduler-962dfe34-be82-44c5-9c59-95d01938bdee@127.0.1.1:44411
(...)
I1024 06:37:20.856938    23 master.cpp:4590] Processing ACCEPT call for offers: [ 8977e508-0698-4c40-9317-cdc3cc8982f4-O4 ] on agent 8977e508-0698-4c40-9317-cdc3cc8982f4-S1 at slave(1)@192.168.10.14:5051 (192.168.10.14) for framework 8977e508-0698-4c40-9317-cdc3cc8982f4-0000 (Flink) at scheduler-962dfe34-be82-44c5-9c59-95d01938bdee@127.0.1.1:44411
I1024 06:37:20.860610    20 master.cpp:4151] Adding task taskmanager-00001 with resources cpus(allocated: *):2; mem(allocated: *):1024; ports(allocated: *):[31000-31001] of framework 8977e508-0698-4c40-9317-cdc3cc8982f4-0000 (Flink) at scheduler-962dfe34-be82-44c5-9c59-95d01938bdee@127.0.1.1:44411 on agent 8977e508-0698-4c40-9317-cdc3cc8982f4-S1 at slave(1)@192.168.10.14:5051 (192.168.10.14)
I1024 06:37:20.861378    20 master.cpp:5569] Launching task taskmanager-00001 of framework 8977e508-0698-4c40-9317-cdc3cc8982f4-0000 (Flink) at scheduler-962dfe34-be82-44c5-9c59-95d01938bdee@127.0.1.1:44411 with resources [{"allocation_info":{"role":"*"},"name":"cpus","scalar":{"value":2.0},"type":"SCALAR"},{"allocation_info":{"role":"*"},"name":"mem","scalar":{"value":1024.0},"type":"SCALAR"},{"allocation_info":{"role":"*"},"name":"ports","ranges":{"range":[{"begin":31000,"end":31001}]},"type":"RANGES"}] on agent 8977e508-0698-4c40-9317-cdc3cc8982f4-S1 at slave(1)@192.168.10.14:5051 (192.168.10.14) on  new executor
I1024 06:37:22.788302    24 master.cpp:8703] Status update TASK_STARTING (Status UUID: 240ff22e-d729-469c-a96d-aed5b19c0101) for task taskmanager-00001 of framework 8977e508-0698-4c40-9317-cdc3cc8982f4-0000 from agent 8977e508-0698-4c40-9317-cdc3cc8982f4-S1 at slave(1)@192.168.10.14:5051 (192.168.10.14)
I1024 06:37:22.788444    24 master.cpp:8760] Forwarding status update TASK_STARTING (Status UUID: 240ff22e-d729-469c-a96d-aed5b19c0101) for task taskmanager-00001 of framework 8977e508-0698-4c40-9317-cdc3cc8982f4-0000
I1024 06:37:22.790689    24 master.cpp:11672] Updating the state of task taskmanager-00001 of framework 8977e508-0698-4c40-9317-cdc3cc8982f4-0000 (latest state: TASK_STARTING, status update state: TASK_STARTING)
(...)
I1024 06:37:26.241962    23 master.cpp:8703] Status update TASK_RUNNING (Status UUID: d96cf8c7-5d22-4dfb-9ca5-349a847b5806) for task taskmanager-00001 of framework 8977e508-0698-4c40-9317-cdc3cc8982f4-0000 from agent 8977e508-0698-4c40-9317-cdc3cc8982f4-S1 at slave(1)@192.168.10.14:5051 (192.168.10.14)
I1024 06:37:26.242094    23 master.cpp:8760] Forwarding status update TASK_RUNNING (Status UUID: d96cf8c7-5d22-4dfb-9ca5-349a847b5806) for task taskmanager-00001 of framework 8977e508-0698-4c40-9317-cdc3cc8982f4-0000
I1024 06:37:26.243048    23 master.cpp:11672] Updating the state of task taskmanager-00001 of framework 8977e508-0698-4c40-9317-cdc3cc8982f4-0000 (latest state: TASK_RUNNING, status update state: TASK_RUNNING)

Через несколько минут простаивает TaskManager, и TaskManager уничтожается
2019-10-24 06:38:21,649 INFO  ....MesosResourceManager                - Stopping worker taskmanager-00001.
2019-10-24 06:38:21,672 INFO  ....MesosResourceManager                - Closing TaskExecutor connection taskmanager-00001 because: TaskExecutor exceeded the idle timeout.
2019-10-24 06:38:21,760 WARN  ....MesosResourceManager                - Discard registration from TaskExecutor taskmanager-00001 at (akka.tcp://flink@192.168.10.14:31001/user/taskmanager_0) because the framework did not recognize it
2019-10-24 06:38:26,674 INFO  ....mesos.scheduler.TaskMonitor         - Re-attempting to kill Mesos task taskmanager-00001.
2019-10-24 06:38:26,891 WARN  akka.remote.ReliableDeliverySupervisor  - Association with remote system [akka.tcp://flink@192.168.10.14:31001] has failed, address is now gated for [50] ms. Reason: [Disassociated] 
2019-10-24 06:38:27,318 INFO  ....mesos.scheduler.TaskMonitor         - Mesos task taskmanager-00001 exited as planned.
2019-10-24 06:38:27,326 INFO  ....MesosResourceManager                - Worker taskmanager-00001 finished successfully with message: Container exited with status 137
Соответствующие журналы в мезо:
I1024 06:38:26.678206    20 master.cpp:6261] Processing KILL call for task 'taskmanager-00001' of framework 8977e508-0698-4c40-9317-cdc3cc8982f4-0000 (Flink) at scheduler-962dfe34-be82-44c5-9c59-95d01938bdee@127.0.1.1:44411
I1024 06:38:26.678450    20 master.cpp:6339] Telling agent 8977e508-0698-4c40-9317-cdc3cc8982f4-S1 at slave(1)@192.168.10.14:5051 (192.168.10.14) to kill task taskmanager-00001 of framework 8977e508-0698-4c40-9317-cdc3cc8982f4-0000 (Flink) at scheduler-962dfe34-be82-44c5-9c59-95d01938bdee@127.0.1.1:44411
I1024 06:38:27.304287    20 master.cpp:8703] Status update TASK_KILLED (Status UUID: 94441e8d-02eb-4252-9014-abea6b372acc) for task taskmanager-00001 of framework 8977e508-0698-4c40-9317-cdc3cc8982f4-0000 from agent 8977e508-0698-4c40-9317-cdc3cc8982f4-S1 at slave(1)@192.168.10.14:5051 (192.168.10.14)
I1024 06:38:27.304426    20 master.cpp:8760] Forwarding status update TASK_KILLED (Status UUID: 94441e8d-02eb-4252-9014-abea6b372acc) for task taskmanager-00001 of framework 8977e508-0698-4c40-9317-cdc3cc8982f4-0000
(...)
I1024 06:38:27.319533    19 master.cpp:11770] Removing task taskmanager-00001 with resources cpus(allocated: *):2; mem(allocated: *):1024; ports(allocated: *):[31000-31001] of framework 8977e508-0698-4c40-9317-cdc3cc8982f4-0000 on agent 8977e508-0698-4c40-9317-cdc3cc8982f4-S1 at slave(1)@192.168.10.14:5051 (192.168.10.14)
Вторая отправка задания в App Master (тот, который остается зависает и останавливается после тайм-аута без какой-либо обработки)
2019-10-24 06:39:25,220 INFO  ....StandaloneDispatcher    - Received JobGraph submission 4888c165c1a9eae4dd69cd533c66a5c1 (Flink Java Job at Thu Oct 24 06:39:24 UTC 2019).
2019-10-24 06:39:25,220 INFO  ....StandaloneDispatcher    - Submitting job 4888c165c1a9eae4dd69cd533c66a5c1 (Flink Java Job at Thu Oct 24 06:39:24 UTC 2019).
2019-10-24 06:39:25,221 INFO  ....AkkaRpcService          - Starting RPC endpoint for ....JobMaster at akka://flink/user/jobmanager_1 .
2019-10-24 06:39:25,225 INFO  ....JobMaster               - Initializing job Flink Java Job at Thu Oct 24 06:39:24 UTC 2019 (4888c165c1a9eae4dd69cd533c66a5c1).
2019-10-24 06:39:25,228 INFO  ....JobMaster               - Using restart strategy NoRestartStrategy for Flink Java Job at Thu Oct 24 06:39:24 UTC 2019 (4888c165c1a9eae4dd69cd533c66a5c1).
2019-10-24 06:39:25,230 INFO  ....ExecutionGraph          - Job recovers via failover strategy: New Pipelined Region Failover
2019-10-24 06:39:25,231 INFO  ....JobMaster               - Running initialization on master for job Flink Java Job at Thu Oct 24 06:39:24 UTC 2019 (4888c165c1a9eae4dd69cd533c66a5c1).
2019-10-24 06:39:25,232 INFO  ....JobMaster               - Successfully ran initialization on master in 1 ms.
2019-10-24 06:39:25,234 INFO  ....executiongraph.failover.flip1.RestartPipelinedRegionStrategy  - Start building failover regions.
2019-10-24 06:39:25,234 INFO  ....executiongraph.failover.flip1.RestartPipelinedRegionStrategy  - Created 1 failover regions.
2019-10-24 06:39:25,235 INFO  ....jobmaster.JobManagerRunner           - JobManager runner for job Flink Java Job at Thu Oct 24 06:39:24 UTC 2019 (4888c165c1a9eae4dd69cd533c66a5c1) was granted leadership with session id 00000000-0000-0000-0000-000000000000 at akka.tcp://flink@192.168.10.11:6123/user/jobmanager_1.
2019-10-24 06:39:25,236 INFO  ....JobMaster               - Starting execution of job Flink Java Job at Thu Oct 24 06:39:24 UTC 2019 (4888c165c1a9eae4dd69cd533c66a5c1) under job master id 00000000000000000000000000000000.
2019-10-24 06:39:25,236 INFO  ....ExecutionGraph          - Job Flink Java Job at Thu Oct 24 06:39:24 UTC 2019 (4888c165c1a9eae4dd69cd533c66a5c1) switched from state CREATED to RUNNING.
2019-10-24 06:39:25,236 INFO  ....ExecutionGraph          - DataSource (at $line115.$read$$iw$$iw$$iw$$iw$$iw$$iw.<init>(<console>:24) (....api.java.io.CollectionInputFormat)) (1/1) (d7d19ea639a4adf9246f905f1765b338) switched from CREATED to SCHEDULED.
2019-10-24 06:39:25,237 INFO  ....SlotPoolImpl            - Cannot serve slot request, no ResourceManager connected. Adding as pending request [SlotRequestId{ddfc7c3f479469c5e6c72e505299cf1d}]
2019-10-24 06:39:25,238 INFO  ....JobMaster               - Connecting to ResourceManager akka.tcp://flink@192.168.10.11:6123/user/resourcemanager(00000000000000000000000000000000)
2019-10-24 06:39:25,239 INFO  ....JobMaster               - Resolved ResourceManager address, beginning registration
2019-10-24 06:39:25,239 INFO  ....JobMaster               - Registration at ResourceManager attempt 1 (timeout=100ms)
2019-10-24 06:39:25,240 INFO  ....MesosResourceManager  - Registering job manager 00000000000000000000000000000000@akka.tcp://flink@192.168.10.11:6123/user/jobmanager_1 for job 4888c165c1a9eae4dd69cd533c66a5c1.
2019-10-24 06:39:25,245 INFO  ....JobMaster               - JobManager successfully registered at ResourceManager, leader id: 00000000000000000000000000000000.
2019-10-24 06:39:25,246 INFO  ....SlotPoolImpl            - Requesting new slot [SlotRequestId{ddfc7c3f479469c5e6c72e505299cf1d}] and profile ResourceProfile{cpuCores=-1.0, heapMemoryInMB=-1, directMemoryInMB=-1, nativeMemoryInMB=-1, networkMemoryInMB=-1, managedMemoryInMB=-1} from resource manager.
2019-10-24 06:39:25,246 INFO  ....MesosResourceManager  - Request slot with profile ResourceProfile{cpuCores=-1.0, heapMemoryInMB=-1, directMemoryInMB=-1, nativeMemoryInMB=-1, networkMemoryInMB=-1, managedMemoryInMB=-1} for job 4888c165c1a9eae4dd69cd533c66a5c1 with allocation id 04a6f120eed690c9c7cdc8c7c0f67ed7.

На этом этапе журналы сохраняются в течение нескольких минут. В Мезосе нет никакой активности.

Мне кажется, что проблема явно проявляется в Flink App Master / Resource Manager, поскольку опять-таки отмечается, что это происходит в мезо. В мезо нет никакой активностиосновные журналы в этот момент (и консоль Mesos ничего не показывает, новая задача не отправляется или не выполняется, оба агента ничего не делают).

И через несколько минут происходит тайм-аут, и работа останавливается в состоянии FAILED.

Второе задание не выполнено по истечении времени ожидания
2019-10-24 06:44:25,245 INFO  ....ExecutionGraph        - DataSource (at $line115.$read$$iw$$iw$$iw$$iw$$iw$$iw.<init>(<console>:24) org.apache.flink.api.java.io.CollectionInputFormat)) (1/1) (d7d19ea639a4adf9246f905f1765b338) switched from SCHEDULED to FAILED.
.... NoResourceAvailableException: Could not allocate enough slots to run the job. Please make sure that the cluster has enough resources.
    at org.apache.flink.runtime.executiongraph.Execution.lambda$scheduleForExecution$0(Execution.java:460)
    at java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:760)
    (stack traces goes on)
2019-10-24 06:44:25,261 INFO  ....executiongraph...AdaptedRestartPipelinedRegionStrategyNG  - Fail to pass the restart strategy validation in region failover. Fallback to fail global.
(...)
(a few more stack traces)
(...)
2019-10-24 06:44:25,296 INFO  ....ExecutionGraph                   - Discarding the results produced by task execution d7d19ea639a4adf9246f905f1765b338.
2019-10-24 06:44:25,296 INFO  ....SlotPoolImpl                     - Pending slot request [SlotRequestId{ddfc7c3f479469c5e6c72e505299cf1d}] timed out.
2019-10-24 06:44:25,301 INFO  ....dispatcher.StandaloneDispatcher  - Job 4888c165c1a9eae4dd69cd533c66a5c1 reached globally terminal state FAILED.
2019-10-24 06:44:25,340 INFO  ....history.FsJobArchivist           - Job 4888c165c1a9eae4dd69cd533c66a5c1 has been archived at file:/var/lib/flink/completed_jobs/4888c165c1a9eae4dd69cd533c66a5c1.
2019-10-24 06:44:25,341 INFO  ....JobMaster                        - Stopping the JobMaster for job Flink Java Job at Thu Oct 24 06:39:24 UTC 2019(4888c165c1a9eae4dd69cd533c66a5c1).
2019-10-24 06:44:25,344 INFO  ....SlotPoolImpl                     - Suspending SlotPool.
2019-10-24 06:44:25,344 INFO  ....JobMaster                        - Close ResourceManager connection 9c3fc959e219c30fba3395bb7be9823c: JobManager is shutting down..
2019-10-24 06:44:25,344 INFO  ....SlotPoolImpl                     - Stopping SlotPool.
2019-10-24 06:44:25,344 INFO  ....jobmaster.JobManagerRunner       - JobManagerRunner already shutdown.
2019-10-24 06:44:25,344 INFO  ....mesos....MesosResourceManager    - Disconnect job manager 00000000000000000000000000000000@akka.tcp://flink@192.168.10.11:6123/user/jobmanager_1 for job 4888c165c1a9eae4dd69cd533c66a5c1 from the resource manager.

Для записи отправляемое мной задание представляет собой простой пакетный Hello World от http://diegoreico.com/environments/runningflinkgclusterwithzeppelin/

ОБНОВЛЕНИЕ 2019-10-24 - 13:00

Доступны следующие выдержки из журнала с уровнем DEBUG:

Обновление 2019-10-24 - 17:30

Добро пожаловать на сайт PullRequest, где вы можете задавать вопросы и получать ответы от других членов сообщества.
...