Prometheus pod crashlooping

A Prometheus Kubernetes pod is crashlooping.

Common symptoms

Out of memory

Increase the memory: https://gitlab.com/gitlab-com/gl-infra/k8s-workloads/gitlab-helmfiles/blob/master/releases/30-gitlab-monitoring/gprd.yaml.gotmpl

Ensure that we have enough cluster-level headroom to accommodate this. As of today, there is no simple, single procedure to ensure this.

Persistent disk full

This is actually often a symptom of OOM kills: the crashlooping process will begin to write out some WAL on each boot, until the disk is full.

Mitigations:

glsh kube use-cluster gprd



kubectl -n monitoring exec -it prometheus-gitlab-monitoring-promethe-prometheus-1 -c thanos-sidecar sh


df -h



rm -rf /prometheus/*.tmp


df -h





rm -rf /prometheus/wal/*

Remove corrupted WAL files

On occasion the WAL files will become corrupted as they did in incident 6148 and incident 5998.

There are a few things to check to determine if the WAL files are corrupted.

Run du -h and confirm that wal is large and chunks_head is also huge
Tail the logs on the pod to confirm it was recovering WALs when it was killed
Search the logs for “iterate on on-disk chunks” and look for errors (example)

The resolution to this was to delete the WAL files as below:

 ~ kubectl -n monitoring exec -it prometheus-gitlab-monitoring-promethe-prometheus-0 -c thanos-sidecar -- sh
/ $ cd /prometheus/
/prometheus $ rm -rf wal/