Прерывистая потеря кластера kubernetes - PullRequest
0 голосов
/ 27 декабря 2018

Я пытался диагностировать проблему, которая началась несколько дней назад.

Запуск kubelet, kubeadm версия 1.13.1.Кластер состоит из 5 узлов и работал в течение нескольких месяцев до конца прошлой недели.Выполнение этого на RHEL 7.x с достаточными свободными ресурсами.

Возникла странная проблема, из-за которой ресурсы кластера (api, планировщик и т. Д.) Становятся недоступными.Это в конечном счете исправляет себя, и кластер возвращается на некоторое время снова.

Если я делаю sudo systemctl restart kubelet, то все в кластере снова работает нормально, пока не возникает прерывистая странность.

Я наблюдаюjournactl регистрирует, чтобы увидеть, что происходит, когда это происходит, и выделяющийся фрагмент:

Dec 26 15:28:06 thalia0.domain dockerd-current[1609]: 2018-12-26 21:28:06.762004 I | etcdserver: skipped leadership transfer for single member cluster
Dec 26 15:28:06 thalia0.domain dockerd-current[1609]: W1226 21:28:06.763648       1 reflector.go:270] k8s.io/client-go/informers/factory.go:132: watch of *v1beta1.Event ended with: Internal error occurred: rpc error: code = Canceled desc = stream terminated by RST_STREAM with error code: CANCEL
Dec 26 15:28:06 thalia0.domain dockerd-current[1609]: E1226 21:28:06.762788       1 watcher.go:208] watch chan error: rpc error: code = Canceled desc = stream terminated by RST_STREAM with error code: CANCEL
Dec 26 15:28:06 thalia0.domain dockerd-current[1609]: W1226 21:28:06.762910       1 reflector.go:270] storage/cacher.go:/podsecuritypolicy: watch of *policy.PodSecurityPolicy ended with: Internal error occurred: rpc error: code = Canceled desc = stream terminated by RST_STREAM with error code: CANCEL
Dec 26 15:28:06 thalia0.domain dockerd-current[1609]: E1226 21:28:06.763149       1 watcher.go:208] watch chan error: rpc error: code = Canceled desc = stream terminated by RST_STREAM with error code: CANCEL
Dec 26 15:28:06 thalia0.domain dockerd-current[1609]: E1226 21:28:06.763232       1 watcher.go:208] watch chan error: rpc error: code = Canceled desc = stream terminated by RST_STREAM with error code: CANCEL
Dec 26 15:28:06 thalia0.domain dockerd-current[1609]: W1226 21:28:06.763439       1 reflector.go:270] storage/cacher.go:/apiregistration.k8s.io/apiservices: watch of *apiregistration.APIService ended with: Internal error occurred: rpc error: code = Canceled desc = stream terminated by RST_STREAM with error code: CANCEL
Dec 26 15:28:06 thalia0.domain dockerd-current[1609]: E1226 21:28:06.763719       1 watcher.go:208] watch chan error: rpc error: code = Canceled desc = stream terminated by RST_STREAM with error code: CANCEL
Dec 26 15:28:06 thalia0.domain dockerd-current[1609]: W1226 21:28:06.763786       1 reflector.go:270] storage/cacher.go:/daemonsets: watch of *apps.DaemonSet ended with: Internal error occurred: rpc error: code = Canceled desc = stream terminated by RST_STREAM with error code: CANCEL
Dec 26 15:28:06 thalia0.domain dockerd-current[1609]: E1226 21:28:06.763937       1 watcher.go:208] watch chan error: rpc error: code = Canceled desc = stream terminated by RST_STREAM with error code: CANCEL
Dec 26 15:28:06 thalia0.domain dockerd-current[1609]: W1226 21:28:06.764016       1 reflector.go:270] storage/cacher.go:/cronjobs: watch of *batch.CronJob ended with: Internal error occurred: rpc error: code = Canceled desc = stream terminated by RST_STREAM with error code: CANCEL
Dec 26 15:28:06 thalia0.domain dockerd-current[1609]: E1226 21:28:06.764250       1 watcher.go:208] watch chan error: rpc error: code = Canceled desc = stream terminated by RST_STREAM with error code: CANCEL
Dec 26 15:28:06 thalia0.domain dockerd-current[1609]: E1226 21:28:06.764324       1 watcher.go:208] watch chan error: rpc error: code = Canceled desc = stream terminated by RST_STREAM with error code: CANCEL
Dec 26 15:28:06 thalia0.domain dockerd-current[1609]: W1226 21:28:06.764386       1 reflector.go:270] storage/cacher.go:/services/endpoints: watch of *core.Endpoints ended with: Internal error occurred: rpc error: code = Canceled desc = stream terminated by RST_STREAM with error code: CANCEL
Dec 26 15:28:06 thalia0.domain dockerd-current[1609]: W1226 21:28:06.764440       1 reflector.go:270] storage/cacher.go:/deployments: watch of *apps.Deployment ended with: Internal error occurred: rpc error: code = Canceled desc = stream terminated by RST_STREAM with error code: CANCEL
Dec 26 15:28:06 thalia0.domain dockerd-current[1609]: WARNING: 2018/12/26 21:28:06 grpc: addrConn.transportMonitor exits due to: context canceled
Dec 26 15:28:06 thalia0.domain dockerd-current[1609]: 2018-12-26 21:28:06.765201 W | etcdserver/api/v3rpc: failed to receive watch request from gRPC stream ("rpc error: code = Unavailable desc = body closed by handler")
Dec 26 15:28:06 thalia0.domain dockerd-current[1609]: 2018-12-26 21:28:06.765384 W | etcdserver/api/v3rpc: failed to receive watch request from gRPC stream ("rpc error: code = Unavailable desc = body closed by handler")

...

Dec 26 15:28:06 thalia0.domain dockerd-current[1609]: W1226 21:28:06.784805       1 reflector.go:270] storage/cacher.go:/controllerrevisions: watch of *apps.ControllerRevision ended with: Internal error occurred: rpc error: code = Canceled desc = stream terminated by RST_STREAM with error code: CANCEL
Dec 26 15:28:06 thalia0.domain dockerd-current[1609]: W1226 21:28:06.784871       1 reflector.go:270] storage/cacher.go:/pods: watch of *core.Pod ended with: Internal error occurred: rpc error: code = Canceled desc = stream terminated by RST_STREAM with error code: CANCEL
Dec 26 15:28:06 thalia0.domain dockerd-current[1609]: E1226 21:28:06.786587       1 watcher.go:208] watch chan error: rpc error: code = Canceled desc = stream terminated by RST_STREAM with error code: CANCEL
Dec 26 15:28:06 thalia0.domain dockerd-current[1609]: W1226 21:28:06.786700       1 reflector.go:270] storage/cacher.go:/horizontalpodautoscalers: watch of *autoscaling.HorizontalPodAutoscaler ended with: Internal error occurred: rpc error: code = Canceled desc = stream terminated by RST_STREAM with error code: CANCEL
Dec 26 15:28:06 thalia0.domain dockerd-current[1609]: E1226 21:28:06.788274       1 watcher.go:208] watch chan error: rpc error: code = Canceled desc = stream terminated by RST_STREAM with error code: CANCEL
Dec 26 15:28:06 thalia0.domain dockerd-current[1609]: W1226 21:28:06.788385       1 reflector.go:270] storage/cacher.go:/crd.projectcalico.org/clusterinformations: watch of *unstructured.Unstructured ended with: Internal error occurred: rpc error: code = Canceled desc = stream terminated by RST_STREAM with error code: CANCEL
Dec 26 15:28:06 thalia0.domain oci-systemd-hook[9353]: systemdhook <debug>: 02cb55687848: Skipping as container command is etcd, not init or systemd
Dec 26 15:28:06 thalia0.domain oci-umount[9355]: umounthook <debug>: 02cb55687848: only runs in prestart stage, ignoring
Dec 26 15:28:07 thalia0.domain dockerd-current[1609]: time="2018-12-26T15:28:07.003175741-06:00" level=warning msg="02cb556878485b24e4705dd0efe1051c02f3e3bbbe7b8a7ab23ea71bd6d82b2f cleanup: failed to unmount secrets: invalid argument"
Dec 26 15:28:07 thalia0.domain kubelet[24604]: E1226 15:28:07.006714   24604 pod_workers.go:190] Error syncing pod 0264932236d6afef396f466fc3bd3181 ("etcd-thalia0.domain_kube-system(0264932236d6afef396f466fc3bd3181)"), skipping: failed to "StartContainer" for "etcd" with CrashLoopBackOff: "Back-off 2m40s restarting failed container=etcd pod=etcd-thalia0.domain_kube-system(0264932236d6afef396f466fc3bd3181)"
Dec 26 15:28:07 thalia0.domain kubelet[24604]: E1226 15:28:07.040361   24604 pod_workers.go:190] Error syncing pod 0264932236d6afef396f466fc3bd3181 ("etcd-thalia0.domain_kube-system(0264932236d6afef396f466fc3bd3181)"), skipping: failed to "StartContainer" for "etcd" with CrashLoopBackOff: "Back-off 2m40s restarting failed container=etcd pod=etcd-thalia0.domain_kube-system(0264932236d6afef396f466fc3bd3181)"

Для того, чтобы сократитьшум в журналах, я отключил другие узлы.

Как уже отмечалось, если я делаю перезапуск службы kubelet, на некоторое время все в порядке, а затем происходит прерывистое поведение.

Любые предложения будут приветствоваться.Я работаю с нашим системным администратором, и он сказал, что, похоже, etcd делает частые перезапуски.Я думаю, что неприятности начинаются, когда CrashLoopBackOff начинает происходить.

1 Ответ

0 голосов
/ 27 декабря 2018

На самом деле это ошибка RHEL / Docker.См. Ошибка 1655214 - docker exec не работает с registry.access.redhat.com/rhel7:7.3

Мы применили это исправление, и пока все кажется стабильным.

...