Я использую Azure HDInsights. Мои рабочие процессы Ooz ie были написаны для использования map-Reduce, и они долго работали нормально. Но недавно задания начали давать сбой с приведенным ниже журналом
WARN HiveActionExecutor:523 - SERVER[hn1-xxxxx.xxxxxxxxxxxxxxxxxxxxxxxxxxx.dx.internal.cloudapp.net] USER[admin] GROUP[-] TOKEN[] APP[cmd-etl-wf] JOB[0000000-191226081559399-oozie-oozi-W] ACTION[0000000-191226081559399-oozie-oozi-W@cmd-first-run-preparation] Launcher ERROR, reason: Main class [org.apache.oozie.action.hadoop.HiveMain], main() threw exception, org.apache.tez.dag.api.SessionNotRunning: TezSession has already shutdown. Application application_1577348158861_0003 failed 2 times (global limit =5; local limit is =2) due to AM Container for appattempt_1577348158861_0003_000002 exited with exitCode: 1
For more detailed output, check the application tracking page: http://hn0-xxxxx.xxxxxxxxxxxxxxxxxxxxxxxxxxx.dx.internal.cloudapp.net:8088/cluster/app/application_1577348158861_0003 Then click on links to logs of each attempt.
Diagnostics: Exception from container-launch.
Container id: container_e01_1577348158861_0003_02_000001
Exit code: 1
Container exited with a non-zero exit code 1. Error file: prelaunch.err.
Last 4096 bytes of prelaunch.err :
Last 4096 bytes of stderr :
Failing this attempt. Failing the application.
2019-12-26 08:47:41,519 WARN HiveActionExecutor:523 - SERVER[hn1-xxxxx.xxxxxxxxxxxxxxxxxxxxxxxxxxx.dx.internal.cloudapp.net] USER[admin] GROUP[-] TOKEN[] APP[cmd-etl-wf] JOB[0000000-191226081559399-oozie-oozi-W] ACTION[0000000-191226081559399-oozie-oozi-W@cmd-first-run-preparation] Launcher exception: org.apache.tez.dag.api.SessionNotRunning: TezSession has already shutdown. Application application_1577348158861_0003 failed 2 times (global limit =5; local limit is =2) due to AM Container for appattempt_1577348158861_0003_000002 exited with exitCode: 1
For more detailed output, check the application tracking page: http://hn0-xxxxx.xxxxxxxxxxxxxxxxxxxxxxxxxxx.dx.internal.cloudapp.net:8088/cluster/app/application_1577348158861_0003 Then click on links to logs of each attempt.
Diagnostics: Exception from container-launch.
Container id: container_e01_1577348158861_0003_02_000001
Exit code: 1
Container exited with a non-zero exit code 1. Error file: prelaunch.err.
Last 4096 bytes of prelaunch.err :
Last 4096 bytes of stderr :
Failing this attempt. Failing the application.
java.lang.RuntimeException: org.apache.tez.dag.api.SessionNotRunning: TezSession has already shutdown. Application application_1577348158861_0003 failed 2 times (global limit =5; local limit is =2) due to AM Container for appattempt_1577348158861_0003_000002 exited with exitCode: 1
For more detailed output, check the application tracking page: http://hn0-xxxxx.xxxxxxxxxxxxxxxxxxxxxxxxxxx.dx.internal.cloudapp.net:8088/cluster/app/application_1577348158861_0003 Then click on links to logs of each attempt.
Diagnostics: Exception from container-launch.
Container id: container_e01_1577348158861_0003_02_000001
Exit code: 1
Container exited with a non-zero exit code 1. Error file: prelaunch.err.
Last 4096 bytes of prelaunch.err :
Last 4096 bytes of stderr :
Failing this attempt. Failing the application.
at org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:582)
at org.apache.hadoop.hive.cli.CliDriver.run(CliDriver.java:681)
at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:625)
at org.apache.oozie.action.hadoop.HiveMain.runHive(HiveMain.java:310)
at org.apache.oozie.action.hadoop.HiveMain.run(HiveMain.java:287)
at org.apache.oozie.action.hadoop.LauncherMain.run(LauncherMain.java:75)
at org.apache.oozie.action.hadoop.HiveMain.main(HiveMain.java:65)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.oozie.action.hadoop.LauncherMapper.map(LauncherMapper.java:231)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:54)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:453)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:343)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:170)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1869)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:164)
Caused by: org.apache.tez.dag.api.SessionNotRunning: TezSession has already shutdown. Application application_1577348158861_0003 failed 2 times (global limit =5; local limit is =2) due to AM Container for appattempt_1577348158861_0003_000002 exited with exitCode: 1
For more detailed output, check the application tracking page: http://hn0-xxxxx.xxxxxxxxxxxxxxxxxxxxxxxxxxx.dx.internal.cloudapp.net:8088/cluster/app/application_1577348158861_0003 Then click on links to logs of each attempt.
Diagnostics: Exception from container-launch.
Container id: container_e01_1577348158861_0003_02_000001
Exit code: 1
Container exited with a non-zero exit code 1. Error file: prelaunch.err.
Last 4096 bytes of prelaunch.err :
Last 4096 bytes of stderr :
Failing this attempt. Failing the application.
at org.apache.tez.client.TezClient.waitTillReady(TezClient.java:699)
at org.apache.hadoop.hive.ql.exec.tez.TezSessionState.open(TezSessionState.java:218)
at org.apache.hadoop.hive.ql.exec.tez.TezSessionState.open(TezSessionState.java:116)
at org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:579)
... 19 more
2019-12-26 08:47:41,600 INFO HiveActionExecutor:520 - SERVER[hn1-xxxxx.xxxxxxxxxxxxxxxxxxxxxxxxxxx.dx.internal.cloudapp.net] USER[admin] GROUP[-] TOKEN[] APP[cmd-etl-wf] JOB[0000000-191226081559399-oozie-oozi-W] ACTION[0000000-191226081559399-oozie-oozi-W@cmd-first-run-preparation] Action ended with external status [FAILED/KILLED]
2019-12-26 08:47:41,677 INFO ActionEndXCommand:520 - SERVER[hn1-xxxxx.xxxxxxxxxxxxxxxxxxxxxxxxxxx.dx.internal.cloudapp.net] USER[admin] GROUP[-] TOKEN[] APP[cmd-etl-wf] JOB[0000000-191226081559399-oozie-oozi-W] ACTION[0000000-191226081559399-oozie-oozi-W@cmd-first-run-preparation] ERROR is considered as FAILED for SLA
2019-12-26 08:47:41,906 INFO ActionStartXCommand:520 - SERVER[hn1-xxxxx.xxxxxxxxxxxxxxxxxxxxxxxxxxx.dx.internal.cloudapp.net] USER[admin] GROUP[-] TOKEN[] APP[cmd-etl-wf] JOB[0000000-191226081559399-oozie-oozi-W] ACTION[0000000-191226081559399-oozie-oozi-W@general-fail] Start action [0000000-191226081559399-oozie-oozi-W@general-fail] with user-retry state : userRetryCount [0], userRetryMax [0], userRetryInterval [10]
2019-12-26 08:47:41,906 INFO KillActionExecutor:520 - SERVER[hn1-xxxxx.xxxxxxxxxxxxxxxxxxxxxxxxxxx.dx.internal.cloudapp.net] USER[admin] GROUP[-] TOKEN[] APP[cmd-etl-wf] JOB[0000000-191226081559399-oozie-oozi-W] ACTION[0000000-191226081559399-oozie-oozi-W@general-fail] Starting action
2019-12-26 08:47:41,906 INFO ActionStartXCommand:520 - SERVER[hn1-xxxxx.xxxxxxxxxxxxxxxxxxxxxxxxxxx.dx.internal.cloudapp.net] USER[admin] GROUP[-] TOKEN[] APP[cmd-etl-wf] JOB[0000000-191226081559399-oozie-oozi-W] ACTION[0000000-191226081559399-oozie-oozi-W@general-fail] [***0000000-191226081559399-oozie-oozi-W@general-fail***]Action status=DONE
2019-12-26 08:47:41,907 INFO ActionStartXCommand:520 - SERVER[hn1-xxxxx.xxxxxxxxxxxxxxxxxxxxxxxxxxx.dx.internal.cloudapp.net] USER[admin] GROUP[-] TOKEN[] APP[cmd-etl-wf] JOB[0000000-191226081559399-oozie-oozi-W] ACTION[0000000-191226081559399-oozie-oozi-W@general-fail] [***0000000-191226081559399-oozie-oozi-W@general-fail***]Action updated in DB!
2019-12-26 08:47:41,970 INFO KillActionExecutor:520 - SERVER[hn1-xxxxx.xxxxxxxxxxxxxxxxxxxxxxxxxxx.dx.internal.cloudapp.net] USER[admin] GROUP[-] TOKEN[] APP[cmd-etl-wf] JOB[0000000-191226081559399-oozie-oozi-W] ACTION[0000000-191226081559399-oozie-oozi-W@general-fail] Action ended with external status [OK]
Фрагмент из моего рабочего процесса ooz ie, как показано ниже
<?xml version="1.0" encoding="utf-8"?>
<!--This is a dynamically generated file, do not edit directly.-->
<workflow-app name="cmd-etl-wf" xmlns="uri:oozie:workflow:0.2">
<start to="cmd-first-run-preparation" />
<kill name="general-fail">
<message>Workflow general failure. Error message: ${wf:errorMessage(wf:lastErrorNode())}</message>
</kill>
<action name="rollback-fail">
<hive xmlns="uri:oozie:hive-action:0.2">
<job-tracker>${jobTracker}</job-tracker>
<name-node>${nameNode}</name-node>
<configuration>
<property>
<name>mapred.compress.map.output</name>
<value>true</value>
</property>
<property>
<name>oozie.launcher.mapred.job.queue.name</name>
<value>joblauncher</value>
</property>
<property>
<name>mapred.job.queue.name</name>
<value>default</value>
</property>
</configuration>
<script>hive/cmd-rollback-after-failure.hql</script>
<param>environmentKey=prod</param>
<param>azureStorageAccount=psclasprodlinux</param>
</hive>
<ok to="general-fail" />
<error to="general-fail" />
</action>
<action name="cmd-first-run-preparation">
<hive xmlns="uri:oozie:hive-action:0.2">
<job-tracker>${jobTracker}</job-tracker>
<name-node>${nameNode}</name-node>
<configuration>
<property>
<name>mapred.compress.map.output</name>
<value>true</value>
</property>
<property>
<name>oozie.launcher.mapred.job.queue.name</name>
<value>joblauncher</value>
</property>
<property>
<name>mapred.job.queue.name</name>
<value>default</value>
</property>
</configuration>
<script>hive/cmd-first-run-preparation.hql</script>
<param>environmentKey=prod</param>
<param>azureStorageAccount=prodlinux</param>
</hive>
<ok to="cmd-roll-shared-tables" />
<error to="general-fail" />
</action>
Сценарий куста запускается, как показано ниже
SET hivevar:tablePrefix=current;
SET hivevar:previousTablePrefix=previous;
CREATE DATABASE IF NOT EXISTS ${environmentKey}_env
COMMENT 'Database for CMD ETL environment ${environmentKey}'
LOCATION 'wasbs://hadoop@${azureStorageAccount}.blob.core.windows.net/hive/warehouse/${environmentKey}_env';
USE ${environmentKey}_env;
--<<Regular Hive script follows>>
Поскольку ошибка указывает на то, что проблема связана с сеансом Tez, я изменил механизм выполнения по умолчанию для куста на mapreduce из Tez в портале Ambari и выполнил задания. Но выполнение заданий заняло больше времени и завершилось с ошибкой.
Поскольку задания работали до последних нескольких недель, я предполагаю, что проблема связана с изменением версии различных компонентов в кластере HDInsight. Пожалуйста, предложите, как решить проблему.