Apache Had oop hiveserver2 часто завершает работу в кластере has oop - PullRequest
0 голосов
/ 02 августа 2020

У нас есть кластер Had oop (v2.9.2, 100 узлов, Ubuntu18,), работающий вместе с кластером HiveServer2 (v2.3.3, 10 узлов, Ubuntu18), и в последнее время мы заметили, что каждый время от времени служба улья отключается сама по себе. Я не знаю, когда это началось и было ли время, когда эти падения не происходили, потому что наши системы настроены на запуск повара каждые полчаса, и они заботятся о запуске службы.

Ошибка из systemctl:

● hive-server2.service - Apache Hadoop hiveserver2
   Loaded: loaded (/lib/systemd/system/hive-server2.service; enabled; vendor preset: enabled)
   Active: failed (Result: exit-code) since Sun 2020-08-02 06:39:34 EDT; 1h 14min ago
  Process: 68691 ExecStart=/opt/hive/apache-hive-2.3.3-bin/bin/start-hiveserver2.sh (code=exited, status=127)
 Main PID: 68691 (code=exited, status=127)

Aug 02 06:39:33 hive13 java[68691]: pam_unix(sshd:auth): authentication failure; logname= uid=70003 euid=70003 tty= ruser= rhost=  user=hdp13
Aug 02 06:39:33 hive13 java[68691]: pam_unix(login:auth): authentication failure; logname= uid=70003 euid=70003 tty= ruser= rhost=  user=hdp13
Aug 02 06:39:33 hive13 start-hiveserver2.sh[68691]: OK
Aug 02 06:39:33 hive13 java[68691]: pam_unix(sshd:auth): authentication failure; logname= uid=70003 euid=70003 tty= ruser= rhost=  user=hdp13
Aug 02 06:39:33 hive13 java[68691]: pam_unix(login:auth): authentication failure; logname= uid=70003 euid=70003 tty= ruser= rhost=  user=hdp13
Aug 02 06:39:33 hive13 start-hiveserver2.sh[68691]: OK
Aug 02 06:39:33 hive13 java[68691]: pam_unix(sshd:auth): authentication failure; logname= uid=70003 euid=70003 tty= ruser= rhost=  user=hdp13
Aug 02 06:39:33 hive13 start-hiveserver2.sh[68691]: Inconsistency detected by ld.so: ../elf/dl-tls.c: 481: _dl_allocate_tls_init: Assertion `listp->slotinfo[cnt].ge
Aug 02 06:39:34 hive13 systemd[1]: hive-server2.service: Main process exited, code=exited, status=127/n/a
Aug 02 06:39:34 hive13 systemd[1]: hive-server2.service: Failed with result 'exit-code'.```

And in /var/log/syslog:
```<30>Aug  2 06:39:33 hive13 start-hiveserver2.sh[68691]: Inconsistency detected by ld.so: ../elf/dl-tls.c: 481: _dl_allocate_tls_init: Assertion `listp->slotinfo[cnt].gen <= GL(dl_tls_generation)' failed!
<29>Aug  2 06:39:34 hive13 systemd[1]: hive-server2.service: Main process exited, code=exited, status=127/n/a
<28>Aug  2 06:39:34 hive13 systemd[1]: hive-server2.service: Failed with result 'exit-code'.
<30>Aug  2 06:41:27 hive13 systemd[1]: Starting Cleanup of Temporary Directories...
<30>Aug  2 06:41:27 hive13 systemd[1]: Starting Daily apt upgrade and clean activities...
<30>Aug  2 06:41:27 hive13 systemd[1]: Started Cleanup of Temporary Directories.
<30>Aug  2 06:41:28 hive13 systemd[1]: Started Daily apt upgrade and clean activities.
<30>Aug  2 06:50:21 hive13 start-metastore.sh[139225]: 2020-08-02T06:50:21.905-0400: 1015612.169: [GC (Allocation Failure) 2020-08-02T06:50:21.905-0400: 1015612.169: [ParNew: 1245408K->10744K(1380160K), 0.0546688 secs] 1369822K->135643K(25012480K), 0.0549767 secs] [Times: user=0.81 sys=0.00, real=0.06 secs]
<30>Aug  2 06:52:35 hive13 start-metastore.sh[139225]: 2020-08-02T06:52:35.896-0400: 1015746.160: [GC (Allocation Failure) 2020-08-02T06:52:35.896-0400: 1015746.160: [ParNew: 1237560K->4893K(1380160K), 0.0484039 secs] 1362459K->129793K(25012480K), 0.0487111 secs] [Times: user=0.71 sys=0.00, real=0.04 secs]
<26>Aug  2 06:56:11 hive13 smartd[17646]: Device: /dev/sdb [SAT], FAILED SMART self-check. BACK UP DATA NOW!
<26>Aug  2 06:56:11 hive13 smartd[17646]: Device: /dev/sdb [SAT], 1 Currently unreadable (pending) sectors
<26>Aug  2 06:56:11 hive13 smartd[17646]: Device: /dev/sdb [SAT], Failed SMART usage Attribute: 5 Reallocated_Sector_Ct.
<30>Aug  2 07:03:57 hive13 dbus-daemon[1120]: [system] Activating via systemd: service name='org.freedesktop.hostname1' unit='dbus-org.freedesktop.hostname1.service' requested by ':1.24561' (uid=0 pid=3417 comm="/usr/bin/hostnamectl " label="unconfined")```

Please advice.
...