Question

Я использую Apache Flink 1.9.1 на Apache Mesos 1.8.1, используя более или менее стандартные настройки для обоих кластеров из 3 узлов. Первый узел (192.168.10.11) запускает мастер Mesos и мастер приложений Flink. Оба других узла (192.168.10.13, 192.168.10.14) являются несколько подчиненными узлами, на которых выполняются агенты mesos и которые имеют доступ к док-контейнеру Flink Worker. Flink запускается из Docker, но Mesos Master и агенты работают на хосте.

Сначала все работает нормально. Первое пакетное задание, которое я отправляю Flink, правильно занято в App Master, Resource Manager успешно согласовывает с Mesos, и Resource Manager Framework получает новую рабочую задачу, отправленную на один из подчиненных узлов. Задача успешно завершается, и клиентская программа получает свои результаты.

Через несколько минут происходит истечение времени ожидания ResourceManager, и из-за превышения времени ожидания рабочая задача Flink (контейнер Docker) уничтожается. Пока все хорошо.

Но после этого момента, если я снова отправлю ту же самую работу точно так же, как в первый раз, Менеджер ресурсов ничего не сделает и не попытается создать какую-либоновый работник. Работа длится 5 минут и терпит неудачу после тайм-аута.

Если бы кто-то мог указать мне на то, что может происходить неправильно, я был бы очень счастлив.

Ниже выписка изсоответствующие журналы мастера приложений Flink и мастера Mesos:

Запуск мастера приложений Flink

2019-10-24 06:36:05,191 INFO  ....entrypoint.ClusterEntrypoint         - --------------------------------------------------------------------------------
2019-10-24 06:36:05,195 INFO  ....entrypoint.ClusterEntrypoint         -  Starting MesosSessionClusterEntrypoint (Version: 1.9.1, Rev:4d56de8, Date:30.09.2019 @ 11:32:19 CST)
2019-10-24 06:36:05,195 INFO  ....entrypoint.ClusterEntrypoint         -  OS current user: flink
2019-10-24 06:36:07,045 WARN  org.apache.hadoop.util.NativeCodeLoader                       - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
2019-10-24 06:36:07,504 INFO  ....entrypoint.ClusterEntrypoint         -  Current Hadoop/Kerberos user: flink
2019-10-24 06:36:07,504 INFO  ....entrypoint.ClusterEntrypoint         -  JVM: OpenJDK 64-Bit Server VM - Oracle Corporation - 1.8/25.222-b10
2019-10-24 06:36:07,504 INFO  ....entrypoint.ClusterEntrypoint         -  Maximum heap size: 2169 MiBytes
2019-10-24 06:36:07,504 INFO  ....entrypoint.ClusterEntrypoint         -  JAVA_HOME: (not set)
2019-10-24 06:36:07,519 INFO  ....entrypoint.ClusterEntrypoint         -  Hadoop version: 2.8.3
2019-10-24 06:36:07,520 INFO  ....entrypoint.ClusterEntrypoint         -  JVM Options:
2019-10-24 06:36:07,520 INFO  ....entrypoint.ClusterEntrypoint         -     -Dlog.file=/usr/local/lib/flink-1.9.1/log/flink--mesos-appmaster-deb-node1.log
2019-10-24 06:36:07,520 INFO  ....entrypoint.ClusterEntrypoint         -     -Dlog4j.configuration=file:/usr/local/lib/flink-1.9.1/conf/log4j.properties
2019-10-24 06:36:07,520 INFO  ....entrypoint.ClusterEntrypoint         -     -Dlogback.configurationFile=file:/usr/local/lib/flink-1.9.1/conf/logback.xml
2019-10-24 06:36:07,520 INFO  ....entrypoint.ClusterEntrypoint         -  Program Arguments: (none)
(...)
2019-10-24 06:36:07,520 INFO  ....entrypoint.ClusterEntrypoint         - --------------------------------------------------------------------------------
2019-10-24 06:36:07,526 INFO  ....entrypoint.ClusterEntrypoint         - Registered UNIX signal handlers for [TERM, HUP, INT]
2019-10-24 06:36:07,574 INFO  ....configuration.GlobalConfiguration            - Loading configuration property
----
properties summarized :
----
jobmanager.rpc.address, 192.168.10.11
jobmanager.rpc.port, 6123
jobmanager.heap.size, 1024m
taskmanager.heap.size, 1024m
taskmanager.numberOfTaskSlots, 2
parallelism.default, 2
state.checkpoints.dir, file:///var/lib/flink/data/ckeckpoints
state.savepoints.dir, file:///var/lib/flink/data/savepoints
jobmanager.execution.failover-strategy, region
rest.port, 8081
rest.bind-address, 0.0.0.0
jobmanager.archive.fs.dir, file:///var/lib/flink/completed_jobs/
historyserver.archive.fs.dir, file:///var/lib/flink/completed_jobs/
mesos.resourcemanager.tasks.container.type, docker
mesos.resourcemanager.tasks.container.image.name, eskimo:flink-worker
mesos.resourcemanager.tasks.bootstrap-cmd, export FLINK_HOME=/usr/local/lib/flink/
mesos.resourcemanager.tasks.container.volumes, /var/log/flink:/var/log/flink:RW,/var/lib/flink:/var/lib/flink:RW,/etc:/host_etc:RO
mesos.resourcemanager.tasks.cpus, 2
mesos.resourcemanager.tasks.mem, 1024
mesos.master, zk://192.168.10.11:2181/mesos
----
2019-10-24 06:36:07,753 INFO  ....entrypoint.ClusterEntrypoint         - Starting MesosSessionClusterEntrypoint.
(...)
2019-10-24 06:36:09,393 INFO  ....runtime.rpc.akka.AkkaRpcServiceUtils         - Trying to start actor system at 192.168.10.11:6123
(...)
(bla bla bla akka starting and all the rest)
(...)
2019-10-24 06:36:14,842 INFO  ....MesosResourceManager  - Registering as new framework.
2019-10-24 06:36:14,842 INFO  ....MesosResourceManager  - --------------------------------------------------------------------------------
2019-10-24 06:36:14,842 INFO  ....MesosResourceManager  -  Mesos Info:
2019-10-24 06:36:14,842 INFO  ....MesosResourceManager  -     Master URL: zk://192.168.10.11:2181/mesos
2019-10-24 06:36:14,842 INFO  ....MesosResourceManager  -  Framework Info:
2019-10-24 06:36:14,842 INFO  ....MesosResourceManager  -     ID: (none)
2019-10-24 06:36:14,842 INFO  ....MesosResourceManager  -     Name: Flink
2019-10-24 06:36:14,842 INFO  ....MesosResourceManager  -     Failover Timeout (secs): 604800.0
2019-10-24 06:36:14,842 INFO  ....MesosResourceManager  -     Role: *
2019-10-24 06:36:14,843 INFO  ....MesosResourceManager  -     Capabilities: (none)
2019-10-24 06:36:14,843 INFO  ....MesosResourceManager  -     Principal: (none)
2019-10-24 06:36:14,843 INFO  ....MesosResourceManager  -     Host: 192.168.10.11
2019-10-24 06:36:14,843 INFO  ....MesosResourceManager  -     Web UI: http://192.168.10.11:8081
2019-10-24 06:36:14,843 INFO  ....MesosResourceManager  - --------------------------------------------------------------------------------
2019-10-24 06:36:14,877 INFO  ....runtime.dispatcher.StandaloneDispatcher      - Dispatcher akka.tcp://flink@192.168.10.11:6123/user/dispatcher was granted leadership with fencing token 00000000-0000-0000-0000-000000000000
(...)
2019-10-24 06:36:15,714 INFO  ....mesos.scheduler.ConnectionMonitor            - Connecting to Mesos...
2019-10-24 06:36:15,764 INFO  ....MesosResourceManager  - Mesos resource manager started.
(...)
2019-10-24 06:36:46,033 INFO  ....mesos.scheduler.ConnectionMonitor            - Connected to Mesos as framework ID e1432c43-f540-4d53-9cc7-b06e0f2a6525-0000.

Логи в mesos-master сразу после запуска Flink App Master

I1024 06:36:52.096225    19 master.cpp:2920] Received SUBSCRIBE call for framework 'Flink' at scheduler-962dfe34-be82-44c5-9c59-95d01938bdee@127.0.1.1:44411
I1024 06:36:52.097496    25 master.cpp:3001] Subscribing framework Flink with checkpointing enabled and capabilities [  ]
I1024 06:36:52.098603    25 master.cpp:10463] Adding framework 8977e508-0698-4c40-9317-cdc3cc8982f4-0000 (Flink) at scheduler-962dfe34-be82-44c5-9c59-95d01938bdee@127.0.1.1:44411 with roles {  } suppressed
I1024 06:36:52.099602    18 hierarchical.cpp:373] Added framework 8977e508-0698-4c40-9317-cdc3cc8982f4-0000
I1024 06:36:52.101519    18 master.cpp:10048] Sending offers [ 8977e508-0698-4c40-9317-cdc3cc8982f4-O1, 8977e508-0698-4c40-9317-cdc3cc8982f4-O2 ] to framework 8977e508-0698-4c40-9317-cdc3cc8982f4-0000 (Flink) at scheduler-962dfe34-be82-44c5-9c59-95d01938bdee@127.0.1.1:44411
I1024 06:36:52.185779    20 master.cpp:3438] Processing SUPPRESS call for framework 8977e508-0698-4c40-9317-cdc3cc8982f4-0000 (Flink) at scheduler-962dfe34-be82-44c5-9c59-95d01938bdee@127.0.1.1:44411
I1024 06:36:52.186064    25 hierarchical.cpp:1346] Suppressed offers for roles { * } of framework 8977e508-0698-4c40-9317-cdc3cc8982f4-0000

Записывает в Flink App Master при первой подаче работы (которая работает нормально)

2019-10-24 06:37:18,800 INFO  ....runtime.dispatcher.StandaloneDispatcher      - Received JobGraph submission 50934ae259b85a449e9a789f82287305 (Flink Java Job at Thu Oct 24 06:37:17 UTC 2019).
2019-10-24 06:37:18,804 INFO  ....runtime.dispatcher.StandaloneDispatcher      - Submitting job 50934ae259b85a449e9a789f82287305 (Flink Java Job at Thu Oct 24 06:37:17 UTC 2019).
2019-10-24 06:37:18,889 INFO  ....runtime.rpc.akka.AkkaRpcService     - Starting RPC endpoint for ....JobMaster at akka://flink/user/jobmanager_0 .
2019-10-24 06:37:18,933 INFO  ....JobMaster                           - Initializing job Flink Java Job at Thu Oct 24 06:37:17 UTC 2019 (50934ae259b85a449e9a789f82287305).
(...)
2019-10-24 06:37:19,160 INFO  ....jobmaster.JobManagerRunner          - JobManager runner for job Flink Java Job at Thu Oct 24 06:37:17 UTC 2019 (50934ae259b85a449e9a789f82287305) was granted leadership with session id 00000000-0000-0000-0000-000000000000 at akka.tcp://flink@192.168.10.11:6123/user/jobmanager_0.
2019-10-24 06:37:19,175 INFO  ....JobMaster                           - Starting execution of job Flink Java Job at Thu Oct 24 06:37:17 UTC 2019 (50934ae259b85a449e9a789f82287305) under job master id 00000000000000000000000000000000.
2019-10-24 06:37:19,177 INFO  ....ExecutionGraph                      - Job Flink Java Job at Thu Oct 24 06:37:17 UTC 2019 (50934ae259b85a449e9a789f82287305) switched from state CREATED to RUNNING.
2019-10-24 06:37:19,191 INFO  ....ExecutionGraph                      - DataSource (at $line107.$read$$iw$$iw$$iw$$iw$$iw$$iw.<init>(<console>:24) (....api.java.io.CollectionInputFormat)) (1/1) (842c8e8d1e6b91746b428331ab8712d3) switched from CREATED to SCHEDULED.
2019-10-24 06:37:19,229 INFO  ....SlotPoolImpl                        - Cannot serve slot request, no ResourceManager connected. Adding as pending request [SlotRequestId{4c886029d8d6a410cde790ba6a7ac3c0}]
2019-10-24 06:37:19,265 INFO  ....JobMaster                           - Connecting to ResourceManager akka.tcp://flink@192.168.10.11:6123/user/resourcemanager(00000000000000000000000000000000)
2019-10-24 06:37:19,272 INFO  ....JobMaster                           - Resolved ResourceManager address, beginning registration
2019-10-24 06:37:19,273 INFO  ....JobMaster                           - Registration at ResourceManager attempt 1 (timeout=100ms)
2019-10-24 06:37:19,284 INFO  ....MesosResourceManager                - Registering job manager 00000000000000000000000000000000@akka.tcp://flink@192.168.10.11:6123/user/jobmanager_0 for job 50934ae259b85a449e9a789f82287305.
2019-10-24 06:37:19,296 INFO  (...)
2019-10-24 06:37:19,305 INFO  ....JobMaster                           - JobManager successfully registered at ResourceManager, leader id: 00000000000000000000000000000000.
2019-10-24 06:37:19,306 INFO  ....SlotPoolImpl                        - Requesting new slot [SlotRequestId{4c886029d8d6a410cde790ba6a7ac3c0}] and profile ResourceProfile{cpuCores=-1.0, heapMemoryInMB=-1, directMemoryInMB=-1, nativeMemoryInMB=-1, networkMemoryInMB=-1, managedMemoryInMB=-1} from resource manager.
2019-10-24 06:37:19,311 INFO  ....MesosResourceManager                - Request slot with profile ResourceProfile{cpuCores=-1.0, heapMemoryInMB=-1, directMemoryInMB=-1, nativeMemoryInMB=-1, networkMemoryInMB=-1, managedMemoryInMB=-1} for job 50934ae259b85a449e9a789f82287305 with allocation id 907586ba9ece54f20c23341276e6bac8.
2019-10-24 06:37:19,313 INFO  ....MesosResourceManager                - Starting a new worker.
2019-10-24 06:37:19,458 INFO  ....MesosResourceManager                - Scheduling Mesos task taskmanager-00001 with (1024.0 MB, 2.0 cpus).
2019-10-24 06:37:19,473 INFO  ....mesos.scheduler.LaunchCoordinator   - Now gathering offers for at least 1 task(s).
2019-10-24 06:37:19,495 INFO  ....mesos.scheduler.LaunchCoordinator   - Received offer(s) of 9432.0 MB, 4.0 cpus, 0.0 gpus:
2019-10-24 06:37:19,499 INFO  ....mesos.scheduler.LaunchCoordinator   -   8977e508-0698-4c40-9317-cdc3cc8982f4-O3 from 192.168.10.13 of 4639.0 MB, 2.0 cpus, 0.0 gpus for [*]
2019-10-24 06:37:19,500 INFO  ....mesos.scheduler.LaunchCoordinator   -   8977e508-0698-4c40-9317-cdc3cc8982f4-O4 from 192.168.10.14 of 4793.0 MB, 2.0 cpus, 0.0 gpus for [*]
2019-10-24 06:37:20,517 INFO  ....mesos.scheduler.LaunchCoordinator   - Processing 1 task(s) against 2 new offer(s) plus outstanding offers.
2019-10-24 06:37:20,578 INFO  ....mesos.scheduler.LaunchCoordinator   - Resources considered: (note: expired offers not deducted from below)
2019-10-24 06:37:20,587 INFO  ....mesos.scheduler.LaunchCoordinator   -   192.168.10.13 has 4639.0 MB, 2.0 cpus, null gpus
2019-10-24 06:37:20,841 INFO  ....MesosResourceManager                - Launching Mesos task taskmanager-00001 on host 192.168.10.14.
2019-10-24 06:37:20,841 INFO  ....mesos.scheduler.LaunchCoordinator   - Launched 1 task(s) on 192.168.10.14 using 1 offer(s):
2019-10-24 06:37:20,842 INFO  ....mesos.scheduler.LaunchCoordinator   -   8977e508-0698-4c40-9317-cdc3cc8982f4-O4
2019-10-24 06:37:20,843 INFO  ....mesos.scheduler.LaunchCoordinator   - No longer gathering offers; all requests fulfilled.
2019-10-24 06:37:20,844 INFO  com.netflix.fenzo.TaskScheduler         - Expiring all leases
2019-10-24 06:37:20,846 INFO  ....mesos.scheduler.LaunchCoordinator   - Declined offer 8977e508-0698-4c40-9317-cdc3cc8982f4-O3 from 192.168.10.13 of 4639.0 MB, 2.0 cpus.
2019-10-24 06:37:26,251 INFO  ....mesos.scheduler.TaskMonitor         - Mesos task taskmanager-00001 is running.
2019-10-24 06:37:29,575 INFO  ....MesosResourceManager                - Registering TaskManager with ResourceID taskmanager-00001 (akka.tcp://flink@192.168.10.14:31001/user/taskmanager_0) at ResourceManager
2019-10-24 06:37:29,693 INFO  ....ExecutionGraph                      - DataSource (at $line107.$read$$iw$$iw$$iw$$iw$$iw$$iw.<init>(<console>:24) (....api.java.io.CollectionInputFormat)) (1/1) (842c8e8d1e6b91746b428331ab8712d3) switched from SCHEDULED to DEPLOYING.
(...)
(everything going on fine)
(...)
2019-10-24 06:37:30,041 INFO  ....ExecutionGraph             - Job Flink Java Job at Thu Oct 24 06:37:17 UTC 2019 (50934ae259b85a449e9a789f82287305) switched from state RUNNING to FINISHED.
2019-10-24 06:37:30,076 INFO  ....StandaloneDispatche r      - Job 50934ae259b85a449e9a789f82287305 reached globally terminal state FINISHED.
2019-10-24 06:37:30,205 INFO  ....FsJobArchivist             - Job 50934ae259b85a449e9a789f82287305 has been archived at file:/var/lib/flink/completed_jobs/50934ae259b85a449e9a789f82287305.
2019-10-24 06:37:30,206 INFO  ....JobMaster                  - Stopping the JobMaster for job Flink Java Job at Thu Oct 24 06:37:17 UTC 2019(50934ae259b85a449e9a789f82287305).
2019-10-24 06:37:30,226 INFO  ....SlotPoolImpl               - Suspending SlotPool.
2019-10-24 06:37:30,227 INFO  ....JobMaster                  - Close ResourceManager connection 9c3fc959e219c30fba3395bb7be9823c: JobManager is shutting down..
2019-10-24 06:37:30,227 INFO  ....SlotPoolImpl               - Stopping SlotPool.
2019-10-24 06:37:30,232 INFO  ....MesosResourceManager  - Disconnect job manager 00000000000000000000000000000000@akka.tcp://flink@192.168.10.11:6123/user/jobmanager_0 for job 50934ae259b85a449e9a789f82287305 from the resource manager.
2019-10-24 06:37:30,232 INFO  ....jobmaster.JobManagerRunner  - JobManagerRunner already shutdown.

Соответствующее время в мезо:

I1024 06:37:19.475060    23 hierarchical.cpp:1387] Unsuppressed offers for roles { * } of framework 8977e508-0698-4c40-9317-cdc3cc8982f4-0000
I1024 06:37:19.475097    23 hierarchical.cpp:1409] Revived roles { * } of framework 8977e508-0698-4c40-9317-cdc3cc8982f4-0000
I1024 06:37:19.476991    19 master.cpp:10048] Sending offers [ 8977e508-0698-4c40-9317-cdc3cc8982f4-O3, 8977e508-0698-4c40-9317-cdc3cc8982f4-O4 ] to framework 8977e508-0698-4c40-9317-cdc3cc8982f4-0000 (Flink) at scheduler-962dfe34-be82-44c5-9c59-95d01938bdee@127.0.1.1:44411
I1024 06:37:20.845321    19 master.cpp:3438] Processing SUPPRESS call for framework 8977e508-0698-4c40-9317-cdc3cc8982f4-0000 (Flink) at scheduler-962dfe34-be82-44c5-9c59-95d01938bdee@127.0.1.1:44411
(...)
I1024 06:37:20.856938    23 master.cpp:4590] Processing ACCEPT call for offers: [ 8977e508-0698-4c40-9317-cdc3cc8982f4-O4 ] on agent 8977e508-0698-4c40-9317-cdc3cc8982f4-S1 at slave(1)@192.168.10.14:5051 (192.168.10.14) for framework 8977e508-0698-4c40-9317-cdc3cc8982f4-0000 (Flink) at scheduler-962dfe34-be82-44c5-9c59-95d01938bdee@127.0.1.1:44411
I1024 06:37:20.860610    20 master.cpp:4151] Adding task taskmanager-00001 with resources cpus(allocated: *):2; mem(allocated: *):1024; ports(allocated: *):[31000-31001] of framework 8977e508-0698-4c40-9317-cdc3cc8982f4-0000 (Flink) at scheduler-962dfe34-be82-44c5-9c59-95d01938bdee@127.0.1.1:44411 on agent 8977e508-0698-4c40-9317-cdc3cc8982f4-S1 at slave(1)@192.168.10.14:5051 (192.168.10.14)
I1024 06:37:20.861378    20 master.cpp:5569] Launching task taskmanager-00001 of framework 8977e508-0698-4c40-9317-cdc3cc8982f4-0000 (Flink) at scheduler-962dfe34-be82-44c5-9c59-95d01938bdee@127.0.1.1:44411 with resources [{"allocation_info":{"role":"*"},"name":"cpus","scalar":{"value":2.0},"type":"SCALAR"},{"allocation_info":{"role":"*"},"name":"mem","scalar":{"value":1024.0},"type":"SCALAR"},{"allocation_info":{"role":"*"},"name":"ports","ranges":{"range":[{"begin":31000,"end":31001}]},"type":"RANGES"}] on agent 8977e508-0698-4c40-9317-cdc3cc8982f4-S1 at slave(1)@192.168.10.14:5051 (192.168.10.14) on  new executor
I1024 06:37:22.788302    24 master.cpp:8703] Status update TASK_STARTING (Status UUID: 240ff22e-d729-469c-a96d-aed5b19c0101) for task taskmanager-00001 of framework 8977e508-0698-4c40-9317-cdc3cc8982f4-0000 from agent 8977e508-0698-4c40-9317-cdc3cc8982f4-S1 at slave(1)@192.168.10.14:5051 (192.168.10.14)
I1024 06:37:22.788444    24 master.cpp:8760] Forwarding status update TASK_STARTING (Status UUID: 240ff22e-d729-469c-a96d-aed5b19c0101) for task taskmanager-00001 of framework 8977e508-0698-4c40-9317-cdc3cc8982f4-0000
I1024 06:37:22.790689    24 master.cpp:11672] Updating the state of task taskmanager-00001 of framework 8977e508-0698-4c40-9317-cdc3cc8982f4-0000 (latest state: TASK_STARTING, status update state: TASK_STARTING)
(...)
I1024 06:37:26.241962    23 master.cpp:8703] Status update TASK_RUNNING (Status UUID: d96cf8c7-5d22-4dfb-9ca5-349a847b5806) for task taskmanager-00001 of framework 8977e508-0698-4c40-9317-cdc3cc8982f4-0000 from agent 8977e508-0698-4c40-9317-cdc3cc8982f4-S1 at slave(1)@192.168.10.14:5051 (192.168.10.14)
I1024 06:37:26.242094    23 master.cpp:8760] Forwarding status update TASK_RUNNING (Status UUID: d96cf8c7-5d22-4dfb-9ca5-349a847b5806) for task taskmanager-00001 of framework 8977e508-0698-4c40-9317-cdc3cc8982f4-0000
I1024 06:37:26.243048    23 master.cpp:11672] Updating the state of task taskmanager-00001 of framework 8977e508-0698-4c40-9317-cdc3cc8982f4-0000 (latest state: TASK_RUNNING, status update state: TASK_RUNNING)

Через несколько минут простаивает TaskManager, и TaskManager уничтожается

2019-10-24 06:38:21,649 INFO  ....MesosResourceManager                - Stopping worker taskmanager-00001.
2019-10-24 06:38:21,672 INFO  ....MesosResourceManager                - Closing TaskExecutor connection taskmanager-00001 because: TaskExecutor exceeded the idle timeout.
2019-10-24 06:38:21,760 WARN  ....MesosResourceManager                - Discard registration from TaskExecutor taskmanager-00001 at (akka.tcp://flink@192.168.10.14:31001/user/taskmanager_0) because the framework did not recognize it
2019-10-24 06:38:26,674 INFO  ....mesos.scheduler.TaskMonitor         - Re-attempting to kill Mesos task taskmanager-00001.
2019-10-24 06:38:26,891 WARN  akka.remote.ReliableDeliverySupervisor  - Association with remote system [akka.tcp://flink@192.168.10.14:31001] has failed, address is now gated for [50] ms. Reason: [Disassociated] 
2019-10-24 06:38:27,318 INFO  ....mesos.scheduler.TaskMonitor         - Mesos task taskmanager-00001 exited as planned.
2019-10-24 06:38:27,326 INFO  ....MesosResourceManager                - Worker taskmanager-00001 finished successfully with message: Container exited with status 137

Соответствующие журналы в мезо:

I1024 06:38:26.678206    20 master.cpp:6261] Processing KILL call for task 'taskmanager-00001' of framework 8977e508-0698-4c40-9317-cdc3cc8982f4-0000 (Flink) at scheduler-962dfe34-be82-44c5-9c59-95d01938bdee@127.0.1.1:44411
I1024 06:38:26.678450    20 master.cpp:6339] Telling agent 8977e508-0698-4c40-9317-cdc3cc8982f4-S1 at slave(1)@192.168.10.14:5051 (192.168.10.14) to kill task taskmanager-00001 of framework 8977e508-0698-4c40-9317-cdc3cc8982f4-0000 (Flink) at scheduler-962dfe34-be82-44c5-9c59-95d01938bdee@127.0.1.1:44411
I1024 06:38:27.304287    20 master.cpp:8703] Status update TASK_KILLED (Status UUID: 94441e8d-02eb-4252-9014-abea6b372acc) for task taskmanager-00001 of framework 8977e508-0698-4c40-9317-cdc3cc8982f4-0000 from agent 8977e508-0698-4c40-9317-cdc3cc8982f4-S1 at slave(1)@192.168.10.14:5051 (192.168.10.14)
I1024 06:38:27.304426    20 master.cpp:8760] Forwarding status update TASK_KILLED (Status UUID: 94441e8d-02eb-4252-9014-abea6b372acc) for task taskmanager-00001 of framework 8977e508-0698-4c40-9317-cdc3cc8982f4-0000
(...)
I1024 06:38:27.319533    19 master.cpp:11770] Removing task taskmanager-00001 with resources cpus(allocated: *):2; mem(allocated: *):1024; ports(allocated: *):[31000-31001] of framework 8977e508-0698-4c40-9317-cdc3cc8982f4-0000 on agent 8977e508-0698-4c40-9317-cdc3cc8982f4-S1 at slave(1)@192.168.10.14:5051 (192.168.10.14)

Вторая отправка задания в App Master (тот, который остается зависает и останавливается после тайм-аута без какой-либо обработки)

2019-10-24 06:39:25,220 INFO  ....StandaloneDispatcher    - Received JobGraph submission 4888c165c1a9eae4dd69cd533c66a5c1 (Flink Java Job at Thu Oct 24 06:39:24 UTC 2019).
2019-10-24 06:39:25,220 INFO  ....StandaloneDispatcher    - Submitting job 4888c165c1a9eae4dd69cd533c66a5c1 (Flink Java Job at Thu Oct 24 06:39:24 UTC 2019).
2019-10-24 06:39:25,221 INFO  ....AkkaRpcService          - Starting RPC endpoint for ....JobMaster at akka://flink/user/jobmanager_1 .
2019-10-24 06:39:25,225 INFO  ....JobMaster               - Initializing job Flink Java Job at Thu Oct 24 06:39:24 UTC 2019 (4888c165c1a9eae4dd69cd533c66a5c1).
2019-10-24 06:39:25,228 INFO  ....JobMaster               - Using restart strategy NoRestartStrategy for Flink Java Job at Thu Oct 24 06:39:24 UTC 2019 (4888c165c1a9eae4dd69cd533c66a5c1).
2019-10-24 06:39:25,230 INFO  ....ExecutionGraph          - Job recovers via failover strategy: New Pipelined Region Failover
2019-10-24 06:39:25,231 INFO  ....JobMaster               - Running initialization on master for job Flink Java Job at Thu Oct 24 06:39:24 UTC 2019 (4888c165c1a9eae4dd69cd533c66a5c1).
2019-10-24 06:39:25,232 INFO  ....JobMaster               - Successfully ran initialization on master in 1 ms.
2019-10-24 06:39:25,234 INFO  ....executiongraph.failover.flip1.RestartPipelinedRegionStrategy  - Start building failover regions.
2019-10-24 06:39:25,234 INFO  ....executiongraph.failover.flip1.RestartPipelinedRegionStrategy  - Created 1 failover regions.
2019-10-24 06:39:25,235 INFO  ....jobmaster.JobManagerRunner           - JobManager runner for job Flink Java Job at Thu Oct 24 06:39:24 UTC 2019 (4888c165c1a9eae4dd69cd533c66a5c1) was granted leadership with session id 00000000-0000-0000-0000-000000000000 at akka.tcp://flink@192.168.10.11:6123/user/jobmanager_1.
2019-10-24 06:39:25,236 INFO  ....JobMaster               - Starting execution of job Flink Java Job at Thu Oct 24 06:39:24 UTC 2019 (4888c165c1a9eae4dd69cd533c66a5c1) under job master id 00000000000000000000000000000000.
2019-10-24 06:39:25,236 INFO  ....ExecutionGraph          - Job Flink Java Job at Thu Oct 24 06:39:24 UTC 2019 (4888c165c1a9eae4dd69cd533c66a5c1) switched from state CREATED to RUNNING.
2019-10-24 06:39:25,236 INFO  ....ExecutionGraph          - DataSource (at $line115.$read$$iw$$iw$$iw$$iw$$iw$$iw.<init>(<console>:24) (....api.java.io.CollectionInputFormat)) (1/1) (d7d19ea639a4adf9246f905f1765b338) switched from CREATED to SCHEDULED.
2019-10-24 06:39:25,237 INFO  ....SlotPoolImpl            - Cannot serve slot request, no ResourceManager connected. Adding as pending request [SlotRequestId{ddfc7c3f479469c5e6c72e505299cf1d}]
2019-10-24 06:39:25,238 INFO  ....JobMaster               - Connecting to ResourceManager akka.tcp://flink@192.168.10.11:6123/user/resourcemanager(00000000000000000000000000000000)
2019-10-24 06:39:25,239 INFO  ....JobMaster               - Resolved ResourceManager address, beginning registration
2019-10-24 06:39:25,239 INFO  ....JobMaster               - Registration at ResourceManager attempt 1 (timeout=100ms)
2019-10-24 06:39:25,240 INFO  ....MesosResourceManager  - Registering job manager 00000000000000000000000000000000@akka.tcp://flink@192.168.10.11:6123/user/jobmanager_1 for job 4888c165c1a9eae4dd69cd533c66a5c1.
2019-10-24 06:39:25,245 INFO  ....JobMaster               - JobManager successfully registered at ResourceManager, leader id: 00000000000000000000000000000000.
2019-10-24 06:39:25,246 INFO  ....SlotPoolImpl            - Requesting new slot [SlotRequestId{ddfc7c3f479469c5e6c72e505299cf1d}] and profile ResourceProfile{cpuCores=-1.0, heapMemoryInMB=-1, directMemoryInMB=-1, nativeMemoryInMB=-1, networkMemoryInMB=-1, managedMemoryInMB=-1} from resource manager.
2019-10-24 06:39:25,246 INFO  ....MesosResourceManager  - Request slot with profile ResourceProfile{cpuCores=-1.0, heapMemoryInMB=-1, directMemoryInMB=-1, nativeMemoryInMB=-1, networkMemoryInMB=-1, managedMemoryInMB=-1} for job 4888c165c1a9eae4dd69cd533c66a5c1 with allocation id 04a6f120eed690c9c7cdc8c7c0f67ed7.

На этом этапе журналы сохраняются в течение нескольких минут. В Мезосе нет никакой активности.

Мне кажется, что проблема явно проявляется в Flink App Master / Resource Manager, поскольку опять-таки отмечается, что это происходит в мезо. В мезо нет никакой активностиосновные журналы в этот момент (и консоль Mesos ничего не показывает, новая задача не отправляется или не выполняется, оба агента ничего не делают).

И через несколько минут происходит тайм-аут, и работа останавливается в состоянии FAILED.

Второе задание не выполнено по истечении времени ожидания

2019-10-24 06:44:25,245 INFO  ....ExecutionGraph        - DataSource (at $line115.$read$$iw$$iw$$iw$$iw$$iw$$iw.<init>(<console>:24) org.apache.flink.api.java.io.CollectionInputFormat)) (1/1) (d7d19ea639a4adf9246f905f1765b338) switched from SCHEDULED to FAILED.
.... NoResourceAvailableException: Could not allocate enough slots to run the job. Please make sure that the cluster has enough resources.
    at org.apache.flink.runtime.executiongraph.Execution.lambda$scheduleForExecution$0(Execution.java:460)
    at java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:760)
    (stack traces goes on)
2019-10-24 06:44:25,261 INFO  ....executiongraph...AdaptedRestartPipelinedRegionStrategyNG  - Fail to pass the restart strategy validation in region failover. Fallback to fail global.
(...)
(a few more stack traces)
(...)
2019-10-24 06:44:25,296 INFO  ....ExecutionGraph                   - Discarding the results produced by task execution d7d19ea639a4adf9246f905f1765b338.
2019-10-24 06:44:25,296 INFO  ....SlotPoolImpl                     - Pending slot request [SlotRequestId{ddfc7c3f479469c5e6c72e505299cf1d}] timed out.
2019-10-24 06:44:25,301 INFO  ....dispatcher.StandaloneDispatcher  - Job 4888c165c1a9eae4dd69cd533c66a5c1 reached globally terminal state FAILED.
2019-10-24 06:44:25,340 INFO  ....history.FsJobArchivist           - Job 4888c165c1a9eae4dd69cd533c66a5c1 has been archived at file:/var/lib/flink/completed_jobs/4888c165c1a9eae4dd69cd533c66a5c1.
2019-10-24 06:44:25,341 INFO  ....JobMaster                        - Stopping the JobMaster for job Flink Java Job at Thu Oct 24 06:39:24 UTC 2019(4888c165c1a9eae4dd69cd533c66a5c1).
2019-10-24 06:44:25,344 INFO  ....SlotPoolImpl                     - Suspending SlotPool.
2019-10-24 06:44:25,344 INFO  ....JobMaster                        - Close ResourceManager connection 9c3fc959e219c30fba3395bb7be9823c: JobManager is shutting down..
2019-10-24 06:44:25,344 INFO  ....SlotPoolImpl                     - Stopping SlotPool.
2019-10-24 06:44:25,344 INFO  ....jobmaster.JobManagerRunner       - JobManagerRunner already shutdown.
2019-10-24 06:44:25,344 INFO  ....mesos....MesosResourceManager    - Disconnect job manager 00000000000000000000000000000000@akka.tcp://flink@192.168.10.11:6123/user/jobmanager_1 for job 4888c165c1a9eae4dd69cd533c66a5c1 from the resource manager.

Для записи отправляемое мной задание представляет собой простой пакетный Hello World от http://diegoreico.com/environments/runningflinkgclusterwithzeppelin/

ОБНОВЛЕНИЕ 2019-10-24 - 13:00

Доступны следующие выдержки из журнала с уровнем DEBUG:

Мастер запуска приложения Flink: http://www.niceideas.ch/flink/1_flink_app_master_startup.txt
Первое выполнение задания (успешное): https://www.niceideas.ch/flink/2_first_job_execution_successful.txt
Завершение тайм-аута в простом диспетчере задач: https://www.niceideas.ch/flink/3_task_manager_idle_kill.txt
Второе выполнение задания (не удалось): https://www.niceideas.ch/flink/4_second_job_execution_failed.txt

Обновление 2019-10-24 - 17:30

Журналы диспетчера задач (работник Flink) доступны здесь: https://www.niceideas.ch/flink/5_task_manger_logs.txt

Apache Flink Resource Manager (App Master) не может выделить новых диспетчеров задач после того, как первый из них был убит после простоя

Пожалуйста, войдите или зарегистрируйтесь чтобы ответить на этот вопрос.

Ответы [ 0 ]

Apache Flink Resource Manager (App Master) не может выделить новых диспетчеров задач после того, как первый из них был убит после простоя

Пожалуйста, войдите или зарегистрируйтесь чтобы ответить на этот вопрос.

Ответы [ 0 ]

Похожие темы