Flink 1.9: автономный кластер: не удалось передать файл с идентификатором TaskExecutor: `AskTimeoutException` - PullRequest
0 голосов
/ 05 февраля 2020

Фон

Пытается создать Apache Автономный кластер Flink.

Среда: AWS
Задание Диспетчер: 1
Диспетчер задач: 2
Конфиг:

FLINK_PLUGINS_DIR                       :   /usr/local/flink-1.9.1/plugins
io.tmp.dirs                             :   /tmp/flink
jobmanager.execution.failover-strategy  :   region
jobmanager.heap.size                    :   1024m
jobmanager.rpc.address                  :   job manager ip
jobmanager.rpc.port                     :   6123
jobstore.cache-size                     :   52428800
jobstore.expiration-time                :   3600
parallelism.default                     :   4
slot.idle.timeout                       :   50000
slot.request.timeout                    :   300000
task.cancellation.interval              :   30000
task.cancellation.timeout               :   180000
task.cancellation.timers.timeout        :   7500
taskmanager.exit-on-fatal-akka-error    :   false
taskmanager.heap.size                   :   1024m
taskmanager.network.bind-policy         :   "ip"
taskmanager.numberOfTaskSlots           :   2
taskmanager.registration.initial-backoff:   500ms
taskmanager.registration.timeout        :   5min
taskmanager.rpc.port                    :   50100-50200
web.tmpdir                              :   /tmp/flink-web-74cce811-17c0-411e-9d11-6d91edd2e9b0

Тип экземпляра: t2 medium (2 ЦП, 4 ГБ памяти)
Открыты порты группы безопасности: 6123, 8081, 50100 - 50200

ОС: CentOS Linux выпуск 7.6.1810 (ядро)

Java:

openjdk version "1.8.0_191"
OpenJDK Runtime Environment (build 1.8.0_191-b12)
OpenJDK 64-Bit Server VM (build 25.191-b12, mixed mode) 
  • Кластер работает и работает правильно
org.apache.flink.runtime.dispatcher.DispatcherRestEndpoint    - http://ip:8081 was granted leadership with leaderSessionID=00000000-0000-0000-0000-000000000000
org.apache.flink.runtime.dispatcher.DispatcherRestEndpoint    - Web frontend listening at http:/ip:8081.
org.apache.flink.runtime.rpc.akka.AkkaRpcService              - Starting RPC endpoint for org.apache.flink.runtime.resourcemanager.StandaloneResourceManager at akka://flink/user/resourcemanager .
org.apache.flink.runtime.rpc.akka.AkkaRpcService              - Starting RPC endpoint for org.apache.flink.runtime.dispatcher.StandaloneDispatcher at akka://flink/user/dispatcher .
org.apache.flink.runtime.resourcemanager.StandaloneResourceManager  - ResourceManager akka.tcp://flink@ip:6123/user/resourcemanager was granted leadership with fencing token 00000000000000000000000000000000
org.apache.flink.runtime.resourcemanager.slotmanager.SlotManagerImpl  - Starting the SlotManager.
org.apache.flink.runtime.dispatcher.StandaloneDispatcher      - Dispatcher akka.tcp://flink@ip:6123/user/dispatcher was granted leadership with fencing token 00000000-0000-0000-0000-000000000000
org.apache.flink.runtime.dispatcher.StandaloneDispatcher      - Recovering all persisted jobs.
org.apache.flink.runtime.resourcemanager.StandaloneResourceManager  - Registering TaskManager with ResourceID f2c7f664378b40ce44463713ae98e1c4 (akka.tcp://flink@TaskManager1Ip:38566/user/taskmanager_0) at ResourceManager
org.apache.flink.runtime.resourcemanager.StandaloneResourceManager  - Registering TaskManager with ResourceID 354a785f637751fb3b034618a47480ed (akka.tcp://flink@TaskManager2Ip:34400/user/taskmanager_0) at ResourceManager
  • Интерфейс пользователя отображает все детали кластера enter image description here

enter image description here

Проблема

Отправка задачи не работает

java.util.concurrent.CompletionException: akka.pattern.AskTimeoutException: Ask timed out on [Actor[akka://flink/user/resourcemanager#-1545644127]] after [10000 ms]. Message of type [org.apache.flink.runtime.rpc.messages.LocalFencedMessage]. A typical reason for `AskTimeoutException` is that the recipient actor didn'
t send a reply.
        at java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:292)
        at java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:308)
        at java.util.concurrent.CompletableFuture.uniApply(CompletableFuture.java:593)
        at java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:577)
        at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474)
        at java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1977)
        at org.apache.flink.runtime.concurrent.FutureUtils$1.onComplete(FutureUtils.java:871)
        at akka.dispatch.OnComplete.internal(Future.scala:263)
        at akka.dispatch.OnComplete.internal(Future.scala:261)
        at akka.dispatch.japi$CallbackBridge.apply(Future.scala:191)
        at akka.dispatch.japi$CallbackBridge.apply(Future.scala:188)
        at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:36)
        at org.apache.flink.runtime.concurrent.Executors$DirectExecutionContext.execute(Executors.java:74)
        at scala.concurrent.impl.CallbackRunnable.executeWithValue(Promise.scala:44)
        at scala.concurrent.impl.Promise$DefaultPromise.tryComplete(Promise.scala:252)
        at akka.pattern.PromiseActorRef$$anonfun$1.apply$mcV$sp(AskSupport.scala:644)
        at akka.actor.Scheduler$$anon$4.run(Scheduler.scala:205)
        at scala.concurrent.Future$InternalCallbackExecutor$.unbatchedExecute(Future.scala:601)
        at scala.concurrent.BatchingExecutor$class.execute(BatchingExecutor.scala:109)
        at scala.concurrent.Future$InternalCallbackExecutor$.execute(Future.scala:599)
        at akka.actor.LightArrayRevolverScheduler$TaskHolder.executeTask(LightArrayRevolverScheduler.scala:328)
        at akka.actor.LightArrayRevolverScheduler$$anon$4.executeBucket$1(LightArrayRevolverScheduler.scala:279)
        at akka.actor.LightArrayRevolverScheduler$$anon$4.nextTick(LightArrayRevolverScheduler.scala:283)
        at akka.actor.LightArrayRevolverScheduler$$anon$4.run(LightArrayRevolverScheduler.scala:235)
        at java.lang.Thread.run(Thread.java:748)
Caused by: akka.pattern.AskTimeoutException: Ask timed out on [Actor[akka://flink/user/resourcemanager#-1545644127]] after [10000 ms]. Message of type [org.apache.flink.runtime.rpc.messages.LocalFencedMessage]. A typical reason for `AskTimeoutException` is that the recipient actor didn't send a reply.
        at akka.pattern.PromiseActorRef$$anonfun$2.apply(AskSupport.scala:635)
        at akka.pattern.PromiseActorRef$$anonfun$2.apply(AskSupport.scala:635)
        at akka.pattern.PromiseActorRef$$anonfun$1.apply$mcV$sp(AskSupport.scala:648)
        ... 9 more
2020-02-04 23:25:16,125 ERROR org.apache.flink.runtime.rest.handler.taskmanager.TaskManagerLogFileHandler  - Unhandled exception.
akka.pattern.AskTimeoutException: Ask timed out on [Actor[akka://flink/user/resourcemanager#-1545644127]] after [10000 ms]. Message of type [org.apache.flink.runtime.rpc.messages.LocalFencedMessage]. A typical reason for `AskTimeoutException` is that the recipient actor didn't send a reply.
        at akka.pattern.PromiseActorRef$$anonfun$2.apply(AskSupport.scala:635)
        at akka.pattern.PromiseActorRef$$anonfun$2.apply(AskSupport.scala:635)
        at akka.pattern.PromiseActorRef$$anonfun$1.apply$mcV$sp(AskSupport.scala:648)
        at akka.actor.Scheduler$$anon$4.run(Scheduler.scala:205)
        at scala.concurrent.Future$InternalCallbackExecutor$.unbatchedExecute(Future.scala:601)
        at scala.concurrent.BatchingExecutor$class.execute(BatchingExecutor.scala:109)
        at scala.concurrent.Future$InternalCallbackExecutor$.execute(Future.scala:599)
        at akka.actor.LightArrayRevolverScheduler$TaskHolder.executeTask(LightArrayRevolverScheduler.scala:328)
        at akka.actor.LightArrayRevolverScheduler$$anon$4.executeBucket$1(LightArrayRevolverScheduler.scala:279)
        at akka.actor.LightArrayRevolverScheduler$$anon$4.nextTick(LightArrayRevolverScheduler.scala:283)
        at akka.actor.LightArrayRevolverScheduler$$anon$4.run(LightArrayRevolverScheduler.scala:235)
        at java.lang.Thread.run(Thread.java:748)

Может Кто-нибудь пролить свет на это? Это проблема, связанная с портами / брандмауэром, или какая-то настройка не работает?

1 Ответ

0 голосов
/ 20 февраля 2020

Проблема была с разрешениями порта группы безопасности. Когда открыли весь диапазон от 0 - 65565, все заработало. Тем не менее, этого недостаточно для производственной системы, поэтому в конечном итоге для рабочих в конфигурационном файле flink-conf.yaml ключу taskmanager.data.port был назначен определенный порт, и это помогло. Таким образом, диспетчеры задач могут быть настроены на прослушивание определенного порта в пределах диапазона.

...