Prometheus pod crashlooping
A Prometheus Kubernetes pod is crashlooping.
Common symptoms
Section titled “Common symptoms”Out of memory
Section titled “Out of memory”Increase the memory: https://gitlab.com/gitlab-com/gl-infra/k8s-workloads/gitlab-helmfiles/blob/master/releases/30-gitlab-monitoring/gprd.yaml.gotmpl
Ensure that we have enough cluster-level headroom to accommodate this. As of today, there is no simple, single procedure to ensure this.
Persistent disk full
Section titled “Persistent disk full”This is actually often a symptom of OOM kills: the crashlooping process will begin to write out some WAL on each boot, until the disk is full.
Mitigations:
glsh kube use-cluster gprd
kubectl -n monitoring exec -it prometheus-gitlab-monitoring-promethe-prometheus-1 -c thanos-sidecar sh
df -h
rm -rf /prometheus/*.tmp
df -h
rm -rf /prometheus/wal/*
Remove corrupted WAL files
Section titled “Remove corrupted WAL files”On occasion the WAL files will become corrupted as they did in incident 6148 and incident 5998.
There are a few things to check to determine if the WAL files are corrupted.
- Run du -h and confirm that wal is large and chunks_head is also huge
- Tail the logs on the pod to confirm it was recovering WALs when it was killed
- Search the logs for “iterate on on-disk chunks” and look for errors (example)
The resolution to this was to delete the WAL files as below:
~ kubectl -n monitoring exec -it prometheus-gitlab-monitoring-promethe-prometheus-0 -c thanos-sidecar -- sh/ $ cd /prometheus//prometheus $ rm -rf wal/