Сбой сервера Prometheus - PullRequest
       107

Сбой сервера Prometheus

0 голосов
/ 04 августа 2020

Я выполнил шаги, описанные https://docs.aws.amazon.com/eks/latest/userguide/prometheus.html, изменил файл values.yaml и запустил

kubectl --namespace=prometheus port-forward deploy/prometheus-server 9090

Исходные значения values.yaml были изменены следующим образом:

scrape_configs:
      - job_name: prometheus
        metrics_path: /metrics
        scheme: https
        tls_config:
          insecure_skip_verify: true
        static_configs:
          - targets:
              - localhost:9090
              - "service-A.kube-system.svc.cluster.local:8080"
            labels:
              - service-A

До прошлой ночи все работало нормально, но потом стало CrashLoopBackOff только сегодня утром.

Изучив логи, я обнаружил

level=error ts=2020-08-04T16:08:38.367Z caller=main.go:758 err="error loading config from \"/etc/config/prometheus.yml\": couldn't load configuration (--config.file=\"/etc/config/prometheus.yml\"): parsing YAML file /etc/config/prometheus.yml: yaml: unmarshal errors:\n  line 16: cannot unmarshal !!seq into model.LabelSet"

Описание капсулы следующее:

Name:           prometheus-server-7965f5cbcb-h9jn6
Namespace:      prometheus
Priority:       0
Node:           ip-10-116-170-224.us-west-2.compute.internal/10.116.170.224
Start Time:     Tue, 04 Aug 2020 04:34:43 -0700
Labels:         app=prometheus
                chart=prometheus-11.11.1
                component=server
                heritage=Helm
                pod-template-hash=7965f5cbcb
                release=prometheus
Annotations:    kubernetes.io/psp: eks.privileged
Status:         Running
IP:             10.116.170.227
Controlled By:  ReplicaSet/prometheus-server-7965f5cbcb
Containers:
  prometheus-server-configmap-reload:
    Container ID:  docker://f9067bee2632a05a5040a3c18ee4bd683756a633adeaf6af4cbfd4c1e7868257
    Image:         jimmidyson/configmap-reload:v0.3.0
    Image ID:      docker-pullable://jimmidyson/configmap-reload@sha256:d107c7a235c266273b1c3502a391fec374430e5625539403d0de797fa9c556a2
    Port:          <none>
    Host Port:     <none>
    Args:
      --volume-dir=/etc/config
      --webhook-url=http://127.0.0.1:9090/-/reload
    State:          Running
      Started:      Tue, 04 Aug 2020 04:35:04 -0700
    Ready:          True
    Restart Count:  0
    Environment:    <none>
    Mounts:
      /etc/config from config-volume (ro)
      /var/run/secrets/kubernetes.io/serviceaccount from prometheus-server-token-fq9gj (ro)
  prometheus-server:
    Container ID:  docker://f022ba0bb23fc03c99fb0aac638506eebdb03d5e9af48df23fef053b66440a83
    Image:         prom/prometheus:v2.19.2
    Image ID:      docker-pullable://prom/prometheus@sha256:cd134bd4fca0f60ff8b4c679cebe5c5c5cf5e2da5f4886b2ae933da821915f92
    Port:          9090/TCP
    Host Port:     0/TCP
    Args:
      --storage.tsdb.retention.time=15d
      --config.file=/etc/config/prometheus.yml
      --storage.tsdb.path=/data
      --web.console.libraries=/etc/prometheus/console_libraries
      --web.console.templates=/etc/prometheus/consoles
      --web.enable-lifecycle
    State:          Waiting
      Reason:       CrashLoopBackOff
    Last State:     Terminated
      Reason:       Error
      Exit Code:    1
      Started:      Tue, 04 Aug 2020 09:29:08 -0700
      Finished:     Tue, 04 Aug 2020 09:29:12 -0700
    Ready:          False
    Restart Count:  62
    Liveness:       http-get http://:9090/-/healthy delay=30s timeout=30s period=15s #success=1 #failure=3
    Readiness:      http-get http://:9090/-/ready delay=30s timeout=30s period=5s #success=1 #failure=3
    Environment:    <none>
    Mounts:
      /data from storage-volume (rw)
      /etc/config from config-volume (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from prometheus-server-token-fq9gj (ro)
Conditions:
  Type              Status
  Initialized       True 
  Ready             False 
  ContainersReady   False 
  PodScheduled      True 
Volumes:
  config-volume:
    Type:      ConfigMap (a volume populated by a ConfigMap)
    Name:      prometheus-server
    Optional:  false
  storage-volume:
    Type:       PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace)
    ClaimName:  prometheus-server
    ReadOnly:   false
  prometheus-server-token-fq9gj:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  prometheus-server-token-fq9gj
    Optional:    false
QoS Class:       BestEffort
Node-Selectors:  <none>
Tolerations:     node.kubernetes.io/not-ready:NoExecute for 300s
                 node.kubernetes.io/unreachable:NoExecute for 300s
Events:
  Type     Reason   Age                      From                                                   Message
  ----     ------   ----                     ----                                                   -------
  Warning  BackOff  2m4s (x1377 over 4h57m)  kubelet, ip-10-116-170-224.us-west-2.compute.internal  Back-off restarting failed container

Я не могу понять, что произошло внезапно.

...