При запуске запроса куста из рабочего процесса Ooz ie возникает ошибка отключения сеанса Tez. - PullRequest
0 голосов
/ 06 января 2020

Я использую Azure HDInsights. Мои рабочие процессы Ooz ie были написаны для использования map-Reduce, и они долго работали нормально. Но недавно задания начали давать сбой с приведенным ниже журналом

WARN HiveActionExecutor:523 - SERVER[hn1-xxxxx.xxxxxxxxxxxxxxxxxxxxxxxxxxx.dx.internal.cloudapp.net] USER[admin] GROUP[-] TOKEN[] APP[cmd-etl-wf] JOB[0000000-191226081559399-oozie-oozi-W] ACTION[0000000-191226081559399-oozie-oozi-W@cmd-first-run-preparation] Launcher ERROR, reason: Main class [org.apache.oozie.action.hadoop.HiveMain], main() threw exception, org.apache.tez.dag.api.SessionNotRunning: TezSession has already shutdown. Application application_1577348158861_0003 failed 2 times (global limit =5; local limit is =2) due to AM Container for appattempt_1577348158861_0003_000002 exited with  exitCode: 1
For more detailed output, check the application tracking page: http://hn0-xxxxx.xxxxxxxxxxxxxxxxxxxxxxxxxxx.dx.internal.cloudapp.net:8088/cluster/app/application_1577348158861_0003 Then click on links to logs of each attempt.
Diagnostics: Exception from container-launch.
Container id: container_e01_1577348158861_0003_02_000001
Exit code: 1

Container exited with a non-zero exit code 1. Error file: prelaunch.err.
Last 4096 bytes of prelaunch.err :
Last 4096 bytes of stderr :

Failing this attempt. Failing the application.
2019-12-26 08:47:41,519  WARN HiveActionExecutor:523 - SERVER[hn1-xxxxx.xxxxxxxxxxxxxxxxxxxxxxxxxxx.dx.internal.cloudapp.net] USER[admin] GROUP[-] TOKEN[] APP[cmd-etl-wf] JOB[0000000-191226081559399-oozie-oozi-W] ACTION[0000000-191226081559399-oozie-oozi-W@cmd-first-run-preparation] Launcher exception: org.apache.tez.dag.api.SessionNotRunning: TezSession has already shutdown. Application application_1577348158861_0003 failed 2 times (global limit =5; local limit is =2) due to AM Container for appattempt_1577348158861_0003_000002 exited with  exitCode: 1
For more detailed output, check the application tracking page: http://hn0-xxxxx.xxxxxxxxxxxxxxxxxxxxxxxxxxx.dx.internal.cloudapp.net:8088/cluster/app/application_1577348158861_0003 Then click on links to logs of each attempt.
Diagnostics: Exception from container-launch.
Container id: container_e01_1577348158861_0003_02_000001
Exit code: 1

Container exited with a non-zero exit code 1. Error file: prelaunch.err.
Last 4096 bytes of prelaunch.err :
Last 4096 bytes of stderr :

Failing this attempt. Failing the application.
java.lang.RuntimeException: org.apache.tez.dag.api.SessionNotRunning: TezSession has already shutdown. Application application_1577348158861_0003 failed 2 times (global limit =5; local limit is =2) due to AM Container for appattempt_1577348158861_0003_000002 exited with  exitCode: 1
For more detailed output, check the application tracking page: http://hn0-xxxxx.xxxxxxxxxxxxxxxxxxxxxxxxxxx.dx.internal.cloudapp.net:8088/cluster/app/application_1577348158861_0003 Then click on links to logs of each attempt.
Diagnostics: Exception from container-launch.
Container id: container_e01_1577348158861_0003_02_000001
Exit code: 1

Container exited with a non-zero exit code 1. Error file: prelaunch.err.
Last 4096 bytes of prelaunch.err :
Last 4096 bytes of stderr :

Failing this attempt. Failing the application.
    at org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:582)
    at org.apache.hadoop.hive.cli.CliDriver.run(CliDriver.java:681)
    at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:625)
    at org.apache.oozie.action.hadoop.HiveMain.runHive(HiveMain.java:310)
    at org.apache.oozie.action.hadoop.HiveMain.run(HiveMain.java:287)
    at org.apache.oozie.action.hadoop.LauncherMain.run(LauncherMain.java:75)
    at org.apache.oozie.action.hadoop.HiveMain.main(HiveMain.java:65)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at org.apache.oozie.action.hadoop.LauncherMapper.map(LauncherMapper.java:231)
    at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:54)
    at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:453)
    at org.apache.hadoop.mapred.MapTask.run(MapTask.java:343)
    at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:170)
    at java.security.AccessController.doPrivileged(Native Method)
    at javax.security.auth.Subject.doAs(Subject.java:422)
    at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1869)
    at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:164)
Caused by: org.apache.tez.dag.api.SessionNotRunning: TezSession has already shutdown. Application application_1577348158861_0003 failed 2 times (global limit =5; local limit is =2) due to AM Container for appattempt_1577348158861_0003_000002 exited with  exitCode: 1
For more detailed output, check the application tracking page: http://hn0-xxxxx.xxxxxxxxxxxxxxxxxxxxxxxxxxx.dx.internal.cloudapp.net:8088/cluster/app/application_1577348158861_0003 Then click on links to logs of each attempt.
Diagnostics: Exception from container-launch.
Container id: container_e01_1577348158861_0003_02_000001
Exit code: 1

Container exited with a non-zero exit code 1. Error file: prelaunch.err.
Last 4096 bytes of prelaunch.err :
Last 4096 bytes of stderr :

Failing this attempt. Failing the application.
    at org.apache.tez.client.TezClient.waitTillReady(TezClient.java:699)
    at org.apache.hadoop.hive.ql.exec.tez.TezSessionState.open(TezSessionState.java:218)
    at org.apache.hadoop.hive.ql.exec.tez.TezSessionState.open(TezSessionState.java:116)
    at org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:579)
    ... 19 more

2019-12-26 08:47:41,600  INFO HiveActionExecutor:520 - SERVER[hn1-xxxxx.xxxxxxxxxxxxxxxxxxxxxxxxxxx.dx.internal.cloudapp.net] USER[admin] GROUP[-] TOKEN[] APP[cmd-etl-wf] JOB[0000000-191226081559399-oozie-oozi-W] ACTION[0000000-191226081559399-oozie-oozi-W@cmd-first-run-preparation] Action ended with external status [FAILED/KILLED]
2019-12-26 08:47:41,677  INFO ActionEndXCommand:520 - SERVER[hn1-xxxxx.xxxxxxxxxxxxxxxxxxxxxxxxxxx.dx.internal.cloudapp.net] USER[admin] GROUP[-] TOKEN[] APP[cmd-etl-wf] JOB[0000000-191226081559399-oozie-oozi-W] ACTION[0000000-191226081559399-oozie-oozi-W@cmd-first-run-preparation] ERROR is considered as FAILED for SLA
2019-12-26 08:47:41,906  INFO ActionStartXCommand:520 - SERVER[hn1-xxxxx.xxxxxxxxxxxxxxxxxxxxxxxxxxx.dx.internal.cloudapp.net] USER[admin] GROUP[-] TOKEN[] APP[cmd-etl-wf] JOB[0000000-191226081559399-oozie-oozi-W] ACTION[0000000-191226081559399-oozie-oozi-W@general-fail] Start action [0000000-191226081559399-oozie-oozi-W@general-fail] with user-retry state : userRetryCount [0], userRetryMax [0], userRetryInterval [10]
2019-12-26 08:47:41,906  INFO KillActionExecutor:520 - SERVER[hn1-xxxxx.xxxxxxxxxxxxxxxxxxxxxxxxxxx.dx.internal.cloudapp.net] USER[admin] GROUP[-] TOKEN[] APP[cmd-etl-wf] JOB[0000000-191226081559399-oozie-oozi-W] ACTION[0000000-191226081559399-oozie-oozi-W@general-fail] Starting action
2019-12-26 08:47:41,906  INFO ActionStartXCommand:520 - SERVER[hn1-xxxxx.xxxxxxxxxxxxxxxxxxxxxxxxxxx.dx.internal.cloudapp.net] USER[admin] GROUP[-] TOKEN[] APP[cmd-etl-wf] JOB[0000000-191226081559399-oozie-oozi-W] ACTION[0000000-191226081559399-oozie-oozi-W@general-fail] [***0000000-191226081559399-oozie-oozi-W@general-fail***]Action status=DONE
2019-12-26 08:47:41,907  INFO ActionStartXCommand:520 - SERVER[hn1-xxxxx.xxxxxxxxxxxxxxxxxxxxxxxxxxx.dx.internal.cloudapp.net] USER[admin] GROUP[-] TOKEN[] APP[cmd-etl-wf] JOB[0000000-191226081559399-oozie-oozi-W] ACTION[0000000-191226081559399-oozie-oozi-W@general-fail] [***0000000-191226081559399-oozie-oozi-W@general-fail***]Action updated in DB!
2019-12-26 08:47:41,970  INFO KillActionExecutor:520 - SERVER[hn1-xxxxx.xxxxxxxxxxxxxxxxxxxxxxxxxxx.dx.internal.cloudapp.net] USER[admin] GROUP[-] TOKEN[] APP[cmd-etl-wf] JOB[0000000-191226081559399-oozie-oozi-W] ACTION[0000000-191226081559399-oozie-oozi-W@general-fail] Action ended with external status [OK]

Фрагмент из моего рабочего процесса ooz ie, как показано ниже

<?xml version="1.0" encoding="utf-8"?>
<!--This is a dynamically generated file, do not edit directly.-->
<workflow-app name="cmd-etl-wf" xmlns="uri:oozie:workflow:0.2">
 <start to="cmd-first-run-preparation" />
 <kill name="general-fail">
  <message>Workflow general failure. Error message: ${wf:errorMessage(wf:lastErrorNode())}</message>
 </kill>
 <action name="rollback-fail">
  <hive xmlns="uri:oozie:hive-action:0.2">
   <job-tracker>${jobTracker}</job-tracker>
   <name-node>${nameNode}</name-node>
   <configuration>
    <property>
     <name>mapred.compress.map.output</name>
     <value>true</value>
    </property>
    <property>
     <name>oozie.launcher.mapred.job.queue.name</name>
     <value>joblauncher</value>
    </property>
    <property>
     <name>mapred.job.queue.name</name>
     <value>default</value>
    </property>
   </configuration>
   <script>hive/cmd-rollback-after-failure.hql</script>
   <param>environmentKey=prod</param>
   <param>azureStorageAccount=psclasprodlinux</param>
  </hive>
  <ok to="general-fail" />
  <error to="general-fail" />
 </action>
 <action name="cmd-first-run-preparation">
  <hive xmlns="uri:oozie:hive-action:0.2">
   <job-tracker>${jobTracker}</job-tracker>
   <name-node>${nameNode}</name-node>
   <configuration>
    <property>
     <name>mapred.compress.map.output</name>
     <value>true</value>
    </property>
    <property>
     <name>oozie.launcher.mapred.job.queue.name</name>
     <value>joblauncher</value>
    </property>
    <property>
     <name>mapred.job.queue.name</name>
     <value>default</value>
    </property>
   </configuration>
   <script>hive/cmd-first-run-preparation.hql</script>
   <param>environmentKey=prod</param>
   <param>azureStorageAccount=prodlinux</param>
  </hive>
  <ok to="cmd-roll-shared-tables" />
  <error to="general-fail" />
 </action>

Сценарий куста запускается, как показано ниже

SET hivevar:tablePrefix=current;
SET hivevar:previousTablePrefix=previous;

CREATE DATABASE IF NOT EXISTS ${environmentKey}_env
    COMMENT 'Database for CMD ETL environment ${environmentKey}'
    LOCATION 'wasbs://hadoop@${azureStorageAccount}.blob.core.windows.net/hive/warehouse/${environmentKey}_env';

USE ${environmentKey}_env;

--<<Regular Hive script follows>>

Поскольку ошибка указывает на то, что проблема связана с сеансом Tez, я изменил механизм выполнения по умолчанию для куста на mapreduce из Tez в портале Ambari и выполнил задания. Но выполнение заданий заняло больше времени и завершилось с ошибкой.

Поскольку задания работали до последних нескольких недель, я предполагаю, что проблема связана с изменением версии различных компонентов в кластере HDInsight. Пожалуйста, предложите, как решить проблему.

Добро пожаловать на сайт PullRequest, где вы можете задавать вопросы и получать ответы от других членов сообщества.
...