Я пытался диагностировать проблему, которая началась несколько дней назад.
Запуск kubelet
, kubeadm
версия 1.13.1.Кластер состоит из 5 узлов и работал в течение нескольких месяцев до конца прошлой недели.Выполнение этого на RHEL 7.x с достаточными свободными ресурсами.
Возникла странная проблема, из-за которой ресурсы кластера (api, планировщик и т. Д.) Становятся недоступными.Это в конечном счете исправляет себя, и кластер возвращается на некоторое время снова.
Если я делаю sudo systemctl restart kubelet
, то все в кластере снова работает нормально, пока не возникает прерывистая странность.
Я наблюдаюjournactl
регистрирует, чтобы увидеть, что происходит, когда это происходит, и выделяющийся фрагмент:
Dec 26 15:28:06 thalia0.domain dockerd-current[1609]: 2018-12-26 21:28:06.762004 I | etcdserver: skipped leadership transfer for single member cluster
Dec 26 15:28:06 thalia0.domain dockerd-current[1609]: W1226 21:28:06.763648 1 reflector.go:270] k8s.io/client-go/informers/factory.go:132: watch of *v1beta1.Event ended with: Internal error occurred: rpc error: code = Canceled desc = stream terminated by RST_STREAM with error code: CANCEL
Dec 26 15:28:06 thalia0.domain dockerd-current[1609]: E1226 21:28:06.762788 1 watcher.go:208] watch chan error: rpc error: code = Canceled desc = stream terminated by RST_STREAM with error code: CANCEL
Dec 26 15:28:06 thalia0.domain dockerd-current[1609]: W1226 21:28:06.762910 1 reflector.go:270] storage/cacher.go:/podsecuritypolicy: watch of *policy.PodSecurityPolicy ended with: Internal error occurred: rpc error: code = Canceled desc = stream terminated by RST_STREAM with error code: CANCEL
Dec 26 15:28:06 thalia0.domain dockerd-current[1609]: E1226 21:28:06.763149 1 watcher.go:208] watch chan error: rpc error: code = Canceled desc = stream terminated by RST_STREAM with error code: CANCEL
Dec 26 15:28:06 thalia0.domain dockerd-current[1609]: E1226 21:28:06.763232 1 watcher.go:208] watch chan error: rpc error: code = Canceled desc = stream terminated by RST_STREAM with error code: CANCEL
Dec 26 15:28:06 thalia0.domain dockerd-current[1609]: W1226 21:28:06.763439 1 reflector.go:270] storage/cacher.go:/apiregistration.k8s.io/apiservices: watch of *apiregistration.APIService ended with: Internal error occurred: rpc error: code = Canceled desc = stream terminated by RST_STREAM with error code: CANCEL
Dec 26 15:28:06 thalia0.domain dockerd-current[1609]: E1226 21:28:06.763719 1 watcher.go:208] watch chan error: rpc error: code = Canceled desc = stream terminated by RST_STREAM with error code: CANCEL
Dec 26 15:28:06 thalia0.domain dockerd-current[1609]: W1226 21:28:06.763786 1 reflector.go:270] storage/cacher.go:/daemonsets: watch of *apps.DaemonSet ended with: Internal error occurred: rpc error: code = Canceled desc = stream terminated by RST_STREAM with error code: CANCEL
Dec 26 15:28:06 thalia0.domain dockerd-current[1609]: E1226 21:28:06.763937 1 watcher.go:208] watch chan error: rpc error: code = Canceled desc = stream terminated by RST_STREAM with error code: CANCEL
Dec 26 15:28:06 thalia0.domain dockerd-current[1609]: W1226 21:28:06.764016 1 reflector.go:270] storage/cacher.go:/cronjobs: watch of *batch.CronJob ended with: Internal error occurred: rpc error: code = Canceled desc = stream terminated by RST_STREAM with error code: CANCEL
Dec 26 15:28:06 thalia0.domain dockerd-current[1609]: E1226 21:28:06.764250 1 watcher.go:208] watch chan error: rpc error: code = Canceled desc = stream terminated by RST_STREAM with error code: CANCEL
Dec 26 15:28:06 thalia0.domain dockerd-current[1609]: E1226 21:28:06.764324 1 watcher.go:208] watch chan error: rpc error: code = Canceled desc = stream terminated by RST_STREAM with error code: CANCEL
Dec 26 15:28:06 thalia0.domain dockerd-current[1609]: W1226 21:28:06.764386 1 reflector.go:270] storage/cacher.go:/services/endpoints: watch of *core.Endpoints ended with: Internal error occurred: rpc error: code = Canceled desc = stream terminated by RST_STREAM with error code: CANCEL
Dec 26 15:28:06 thalia0.domain dockerd-current[1609]: W1226 21:28:06.764440 1 reflector.go:270] storage/cacher.go:/deployments: watch of *apps.Deployment ended with: Internal error occurred: rpc error: code = Canceled desc = stream terminated by RST_STREAM with error code: CANCEL
Dec 26 15:28:06 thalia0.domain dockerd-current[1609]: WARNING: 2018/12/26 21:28:06 grpc: addrConn.transportMonitor exits due to: context canceled
Dec 26 15:28:06 thalia0.domain dockerd-current[1609]: 2018-12-26 21:28:06.765201 W | etcdserver/api/v3rpc: failed to receive watch request from gRPC stream ("rpc error: code = Unavailable desc = body closed by handler")
Dec 26 15:28:06 thalia0.domain dockerd-current[1609]: 2018-12-26 21:28:06.765384 W | etcdserver/api/v3rpc: failed to receive watch request from gRPC stream ("rpc error: code = Unavailable desc = body closed by handler")
...
Dec 26 15:28:06 thalia0.domain dockerd-current[1609]: W1226 21:28:06.784805 1 reflector.go:270] storage/cacher.go:/controllerrevisions: watch of *apps.ControllerRevision ended with: Internal error occurred: rpc error: code = Canceled desc = stream terminated by RST_STREAM with error code: CANCEL
Dec 26 15:28:06 thalia0.domain dockerd-current[1609]: W1226 21:28:06.784871 1 reflector.go:270] storage/cacher.go:/pods: watch of *core.Pod ended with: Internal error occurred: rpc error: code = Canceled desc = stream terminated by RST_STREAM with error code: CANCEL
Dec 26 15:28:06 thalia0.domain dockerd-current[1609]: E1226 21:28:06.786587 1 watcher.go:208] watch chan error: rpc error: code = Canceled desc = stream terminated by RST_STREAM with error code: CANCEL
Dec 26 15:28:06 thalia0.domain dockerd-current[1609]: W1226 21:28:06.786700 1 reflector.go:270] storage/cacher.go:/horizontalpodautoscalers: watch of *autoscaling.HorizontalPodAutoscaler ended with: Internal error occurred: rpc error: code = Canceled desc = stream terminated by RST_STREAM with error code: CANCEL
Dec 26 15:28:06 thalia0.domain dockerd-current[1609]: E1226 21:28:06.788274 1 watcher.go:208] watch chan error: rpc error: code = Canceled desc = stream terminated by RST_STREAM with error code: CANCEL
Dec 26 15:28:06 thalia0.domain dockerd-current[1609]: W1226 21:28:06.788385 1 reflector.go:270] storage/cacher.go:/crd.projectcalico.org/clusterinformations: watch of *unstructured.Unstructured ended with: Internal error occurred: rpc error: code = Canceled desc = stream terminated by RST_STREAM with error code: CANCEL
Dec 26 15:28:06 thalia0.domain oci-systemd-hook[9353]: systemdhook <debug>: 02cb55687848: Skipping as container command is etcd, not init or systemd
Dec 26 15:28:06 thalia0.domain oci-umount[9355]: umounthook <debug>: 02cb55687848: only runs in prestart stage, ignoring
Dec 26 15:28:07 thalia0.domain dockerd-current[1609]: time="2018-12-26T15:28:07.003175741-06:00" level=warning msg="02cb556878485b24e4705dd0efe1051c02f3e3bbbe7b8a7ab23ea71bd6d82b2f cleanup: failed to unmount secrets: invalid argument"
Dec 26 15:28:07 thalia0.domain kubelet[24604]: E1226 15:28:07.006714 24604 pod_workers.go:190] Error syncing pod 0264932236d6afef396f466fc3bd3181 ("etcd-thalia0.domain_kube-system(0264932236d6afef396f466fc3bd3181)"), skipping: failed to "StartContainer" for "etcd" with CrashLoopBackOff: "Back-off 2m40s restarting failed container=etcd pod=etcd-thalia0.domain_kube-system(0264932236d6afef396f466fc3bd3181)"
Dec 26 15:28:07 thalia0.domain kubelet[24604]: E1226 15:28:07.040361 24604 pod_workers.go:190] Error syncing pod 0264932236d6afef396f466fc3bd3181 ("etcd-thalia0.domain_kube-system(0264932236d6afef396f466fc3bd3181)"), skipping: failed to "StartContainer" for "etcd" with CrashLoopBackOff: "Back-off 2m40s restarting failed container=etcd pod=etcd-thalia0.domain_kube-system(0264932236d6afef396f466fc3bd3181)"
Для того, чтобы сократитьшум в журналах, я отключил другие узлы.
Как уже отмечалось, если я делаю перезапуск службы kubelet
, на некоторое время все в порядке, а затем происходит прерывистое поведение.
Любые предложения будут приветствоваться.Я работаю с нашим системным администратором, и он сказал, что, похоже, etcd
делает частые перезапуски.Я думаю, что неприятности начинаются, когда CrashLoopBackOff
начинает происходить.