Пакетное задание Flink всегда завершается неудачей из-за AskTimeoutException или Heartbeat TaskManager с тайм-аутом контейнера id - PullRequest
0 голосов
/ 25 января 2020

Я запускаю пакетное задание Flink на YARN. Задание считывает данные из нескольких путей HDFS и объединяет их в один набор данных, и на основе этого набора данных вычисляет groupBy и уменьшает функцию. Но задание не может успешно завершиться sh.

Имеет oop Версия YARN: 2.6.0 Версия Flink: 1.9.1

Отправить оболочку сеанса пряжи:

./bin/yarn-session.sh -jm 20480 -tm 51200 -s 1 -nm 'Batch Job' \
    -Djobmanager.rpc.port=6123 \
    -Djobmanager.web.address=0.0.0.0 \
    -Drest.port=8081 \
    -Dtaskmanager.network.memory.fraction=0.2 \
    -Dparallelism.default=6 \
    -Dhigh-availability=zookeeper \
    -Dhigh-availability.zookeeper.quorum=hadoop-infra-1.com:2181,hadoop-infra-2.com:2181,hadoop-infra-0.com:2181 \
    -Dhigh-availability.zookeeper.path.root=/flink-1.9.1-yarn \
    -Dhigh-availability.storageDir=hdfs://hadoop-infra-0.com:8020/flink/flink-1.9.1/ha/ \
    -Dstate.backend=rocksdb \
    -Dstate.backend.incremental=true \
    -Dstate.checkpoints.dir=hdfs://hadoop-infra-0.com:8020/flink/flink-1.9.1/flink-checkpoints \
    -Dstate.savepoints.dir=hdfs://hadoop-infra-0.com:8020/flink/flink-1.9.1/flink-savepoints \
    -Djobmanager.web.submit.enable=true \
    -Djobmanager.archive.fs.dir=hdfs://hadoop-infra-0.com:8020/flink/flink-1.9.1/completed-jobs/ \
    -Dhistoryserver.archive.fs.dir=hdfs://hadoop-infra-0.com:8020/flink/flink-1.9.1/completed-jobs/ \
    -Dhistoryserver.archive.fs.refresh-interval=20000 \
    -Dyarn.application-attempts=1000 \
    -Dfs.hdfs.hadoopconf=/etc/hadoop/conf \
    -Dakka.lookup.timeout='100 s' \
    -Dakka.ask.timeout='1000 s' \
    -Dweb.timeout=10000 \
    -Dtaskmanager.registration.timeout='10 min' \
    -Dmetrics.reporters=flink_jmx_reporter,flink_prometheus_ops_reporter \
    -Dmetrics.scope.delimiter=: \
    -Dmetrics.reporter.flink_jmx_reporter.class=org.apache.flink.metrics.jmx.JMXReporter \
    -Dmetrics.reporter.flink_jmx_reporter.port=9020-9040 \
    -Dmetrics.reporter.flink_prometheus_ops_reporter.class=org.apache.flink.metrics.prometheus.PrometheusReporter \
    -Dmetrics.reporter.flink_prometheus_ops_reporter.port=9150-9200

Код задания пакетной обработки прилагается здесь:

 public static void main(String[] args) throws Exception {
        ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();
        env.setParallelism(128);
        List<DataSet<String>> sources = new ArrayList<DataSet<String>>();
        for(int i = 0; i < 100; i++){
             DataSet<String> inhouseDateSet = env.readTextFile("hdfs://namenode/path" + i);
             sources.add(inhouseDateSet);
        }
        DataSet<String> dataSet = sources.get(0);
        for(int i = 1; i < 100; i++){
             dataSet = dataSet.union(sources.get(i));
        }

        GroupReduceOperator<User, User> userResult = dataSet
                .flatMap(new EventMapOperator())
                .groupBy(1)
                .reduceGroup(new SelfGroupReducer());

        userResult.map(User::toString).writeAsText("hdfs://namenode/out_path");
        env.execute();
    }

Однако задание всегда выдает исключение:

java.util.concurrent.TimeoutException: Heartbeat of TaskManager with id container_e05_1578886899625_23144_01_000159 timed out.
    at org.apache.flink.runtime.jobmaster.JobMaster$TaskManagerHeartbeatListener.notifyHeartbeatTimeout(JobMaster.java:1149)
    at org.apache.flink.runtime.heartbeat.HeartbeatMonitorImpl.run(HeartbeatMonitorImpl.java:109)
    at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
    at java.util.concurrent.FutureTask.run(FutureTask.java:266)
    at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRunAsync(AkkaRpcActor.java:397)
    at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(AkkaRpcActor.java:190)
    at org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.handleRpcMessage(FencedAkkaRpcActor.java:74)
    at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleMessage(AkkaRpcActor.java:152)
    at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:26)
    at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:21)
    at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:123)
    at akka.japi.pf.UnitCaseStatement.applyOrElse(CaseStatements.scala:21)
    at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:170)
    at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171)
    at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171)
    at akka.actor.Actor$class.aroundReceive(Actor.scala:517)
    at akka.actor.AbstractActor.aroundReceive(AbstractActor.scala:225)
    at akka.actor.ActorCell.receiveMessage(ActorCell.scala:592)
    at akka.actor.ActorCell.invoke(ActorCell.scala:561)
    at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:258)
    at akka.dispatch.Mailbox.run(Mailbox.scala:225)
    at akka.dispatch.Mailbox.exec(Mailbox.scala:235)
    at akka.dispatch.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
    at akka.dispatch.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
    at akka.dispatch.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
    at akka.dispatch.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)

Или выбрасывает это исключение:

java.lang.IllegalStateException: Update to task [CHAIN FlatMap (FlatMap at main(RcmdPerformanceETLBatchJob.java:35)) -> Map (Key Extractor) (111/128) - execution #4] on TaskManager container_e05_1578886899625_18703_01_000094 @ hadoop-data-234.recommend.shopeemobile.com (dataPort=34531) failed
    at org.apache.flink.runtime.executiongraph.Execution.lambda$sendUpdatePartitionInfoRpcCall$14(Execution.java:1395)
    at java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:760)
    at java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:736)
    at java.util.concurrent.CompletableFuture$Completion.run(CompletableFuture.java:442)
    at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRunAsync(AkkaRpcActor.java:397)
    at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(AkkaRpcActor.java:190)
    at org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.handleRpcMessage(FencedAkkaRpcActor.java:74)
    at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleMessage(AkkaRpcActor.java:152)
    at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:26)
    at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:21)
    at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:123)
    at akka.japi.pf.UnitCaseStatement.applyOrElse(CaseStatements.scala:21)
    at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:170)
    at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171)
    at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171)
    at akka.actor.Actor$class.aroundReceive(Actor.scala:517)
    at akka.actor.AbstractActor.aroundReceive(AbstractActor.scala:225)
    at akka.actor.ActorCell.receiveMessage(ActorCell.scala:592)
    at akka.actor.ActorCell.invoke(ActorCell.scala:561)
    at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:258)
    at akka.dispatch.Mailbox.run(Mailbox.scala:225)
    at akka.dispatch.Mailbox.exec(Mailbox.scala:235)
    at akka.dispatch.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
    at akka.dispatch.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
    at akka.dispatch.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
    at akka.dispatch.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
Caused by: java.util.concurrent.CompletionException: akka.pattern.AskTimeoutException: Ask timed out on [Actor[akka.tcp://flink@hadoop-data-234.recommend.shopeemobile.com:29851/user/taskmanager_0#1997707422]] after [10000 ms]. Message of type [org.apache.flink.runtime.rpc.messages.RemoteRpcInvocation]. A typical reason for `AskTimeoutException` is that the recipient actor didn't send a reply.
    at java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:292)
    at java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:308)
    at java.util.concurrent.CompletableFuture.uniApply(CompletableFuture.java:593)
    at java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:577)
    at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474)
    at java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1977)
    at org.apache.flink.runtime.concurrent.FutureUtils$1.onComplete(FutureUtils.java:871)
    at akka.dispatch.OnComplete.internal(Future.scala:263)
    at akka.dispatch.OnComplete.internal(Future.scala:261)
    at akka.dispatch.japi$CallbackBridge.apply(Future.scala:191)
    at akka.dispatch.japi$CallbackBridge.apply(Future.scala:188)
    at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:36)
    at org.apache.flink.runtime.concurrent.Executors$DirectExecutionContext.execute(Executors.java:74)
    at scala.concurrent.impl.CallbackRunnable.executeWithValue(Promise.scala:44)
    at scala.concurrent.impl.Promise$DefaultPromise.tryComplete(Promise.scala:252)
    at akka.pattern.PromiseActorRef$$anonfun$1.apply$mcV$sp(AskSupport.scala:644)
    at akka.actor.Scheduler$$anon$4.run(Scheduler.scala:205)
    at scala.concurrent.Future$InternalCallbackExecutor$.unbatchedExecute(Future.scala:601)
    at scala.concurrent.BatchingExecutor$class.execute(BatchingExecutor.scala:109)
    at scala.concurrent.Future$InternalCallbackExecutor$.execute(Future.scala:599)
    at akka.actor.LightArrayRevolverScheduler$TaskHolder.executeTask(LightArrayRevolverScheduler.scala:328)
    at akka.actor.LightArrayRevolverScheduler$$anon$4.executeBucket$1(LightArrayRevolverScheduler.scala:279)
    at akka.actor.LightArrayRevolverScheduler$$anon$4.nextTick(LightArrayRevolverScheduler.scala:283)
    at akka.actor.LightArrayRevolverScheduler$$anon$4.run(LightArrayRevolverScheduler.scala:235)
    at java.lang.Thread.run(Thread.java:748)
Caused by: akka.pattern.AskTimeoutException: Ask timed out on [Actor[akka.tcp://flink@hadoop-data-234.recommend.shopeemobile.com:29851/user/taskmanager_0#1997707422]] after [10000 ms]. Message of type [org.apache.flink.runtime.rpc.messages.RemoteRpcInvocation]. A typical reason for `AskTimeoutException` is that the recipient actor didn't send a reply.
    at akka.pattern.PromiseActorRef$$anonfun$2.apply(AskSupport.scala:635)
    at akka.pattern.PromiseActorRef$$anonfun$2.apply(AskSupport.scala:635)
    at akka.pattern.PromiseActorRef$$anonfun$1.apply$mcV$sp(AskSupport.scala:648)
    ... 9 more
...