Я выполнил шаги, описанные https://docs.aws.amazon.com/eks/latest/userguide/prometheus.html, изменил файл values.yaml и запустил
kubectl --namespace=prometheus port-forward deploy/prometheus-server 9090
Исходные значения values.yaml были изменены следующим образом:
scrape_configs:
- job_name: prometheus
metrics_path: /metrics
scheme: https
tls_config:
insecure_skip_verify: true
static_configs:
- targets:
- localhost:9090
- "service-A.kube-system.svc.cluster.local:8080"
labels:
- service-A
До прошлой ночи все работало нормально, но потом стало CrashLoopBackOff
только сегодня утром.
Изучив логи, я обнаружил
level=error ts=2020-08-04T16:08:38.367Z caller=main.go:758 err="error loading config from \"/etc/config/prometheus.yml\": couldn't load configuration (--config.file=\"/etc/config/prometheus.yml\"): parsing YAML file /etc/config/prometheus.yml: yaml: unmarshal errors:\n line 16: cannot unmarshal !!seq into model.LabelSet"
Описание капсулы следующее:
Name: prometheus-server-7965f5cbcb-h9jn6
Namespace: prometheus
Priority: 0
Node: ip-10-116-170-224.us-west-2.compute.internal/10.116.170.224
Start Time: Tue, 04 Aug 2020 04:34:43 -0700
Labels: app=prometheus
chart=prometheus-11.11.1
component=server
heritage=Helm
pod-template-hash=7965f5cbcb
release=prometheus
Annotations: kubernetes.io/psp: eks.privileged
Status: Running
IP: 10.116.170.227
Controlled By: ReplicaSet/prometheus-server-7965f5cbcb
Containers:
prometheus-server-configmap-reload:
Container ID: docker://f9067bee2632a05a5040a3c18ee4bd683756a633adeaf6af4cbfd4c1e7868257
Image: jimmidyson/configmap-reload:v0.3.0
Image ID: docker-pullable://jimmidyson/configmap-reload@sha256:d107c7a235c266273b1c3502a391fec374430e5625539403d0de797fa9c556a2
Port: <none>
Host Port: <none>
Args:
--volume-dir=/etc/config
--webhook-url=http://127.0.0.1:9090/-/reload
State: Running
Started: Tue, 04 Aug 2020 04:35:04 -0700
Ready: True
Restart Count: 0
Environment: <none>
Mounts:
/etc/config from config-volume (ro)
/var/run/secrets/kubernetes.io/serviceaccount from prometheus-server-token-fq9gj (ro)
prometheus-server:
Container ID: docker://f022ba0bb23fc03c99fb0aac638506eebdb03d5e9af48df23fef053b66440a83
Image: prom/prometheus:v2.19.2
Image ID: docker-pullable://prom/prometheus@sha256:cd134bd4fca0f60ff8b4c679cebe5c5c5cf5e2da5f4886b2ae933da821915f92
Port: 9090/TCP
Host Port: 0/TCP
Args:
--storage.tsdb.retention.time=15d
--config.file=/etc/config/prometheus.yml
--storage.tsdb.path=/data
--web.console.libraries=/etc/prometheus/console_libraries
--web.console.templates=/etc/prometheus/consoles
--web.enable-lifecycle
State: Waiting
Reason: CrashLoopBackOff
Last State: Terminated
Reason: Error
Exit Code: 1
Started: Tue, 04 Aug 2020 09:29:08 -0700
Finished: Tue, 04 Aug 2020 09:29:12 -0700
Ready: False
Restart Count: 62
Liveness: http-get http://:9090/-/healthy delay=30s timeout=30s period=15s #success=1 #failure=3
Readiness: http-get http://:9090/-/ready delay=30s timeout=30s period=5s #success=1 #failure=3
Environment: <none>
Mounts:
/data from storage-volume (rw)
/etc/config from config-volume (rw)
/var/run/secrets/kubernetes.io/serviceaccount from prometheus-server-token-fq9gj (ro)
Conditions:
Type Status
Initialized True
Ready False
ContainersReady False
PodScheduled True
Volumes:
config-volume:
Type: ConfigMap (a volume populated by a ConfigMap)
Name: prometheus-server
Optional: false
storage-volume:
Type: PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace)
ClaimName: prometheus-server
ReadOnly: false
prometheus-server-token-fq9gj:
Type: Secret (a volume populated by a Secret)
SecretName: prometheus-server-token-fq9gj
Optional: false
QoS Class: BestEffort
Node-Selectors: <none>
Tolerations: node.kubernetes.io/not-ready:NoExecute for 300s
node.kubernetes.io/unreachable:NoExecute for 300s
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning BackOff 2m4s (x1377 over 4h57m) kubelet, ip-10-116-170-224.us-west-2.compute.internal Back-off restarting failed container
Я не могу понять, что произошло внезапно.