Skip to content

Prometheus pod crashlooping

A Prometheus Kubernetes pod is crashlooping.

Increase the memory: https://gitlab.com/gitlab-com/gl-infra/k8s-workloads/gitlab-helmfiles/blob/master/releases/30-gitlab-monitoring/gprd.yaml.gotmpl

Ensure that we have enough cluster-level headroom to accommodate this. As of today, there is no simple, single procedure to ensure this.

This is actually often a symptom of OOM kills: the crashlooping process will begin to write out some WAL on each boot, until the disk is full.

Mitigations:

glsh kube use-cluster gprd
kubectl -n monitoring exec -it prometheus-gitlab-monitoring-promethe-prometheus-1 -c thanos-sidecar sh
df -h
rm -rf /prometheus/*.tmp
df -h
rm -rf /prometheus/wal/*

On occasion the WAL files will become corrupted as they did in incident 6148 and incident 5998.

There are a few things to check to determine if the WAL files are corrupted.

  • Run du -h and confirm that wal is large and chunks_head is also huge
  • Tail the logs on the pod to confirm it was recovering WALs when it was killed
  • Search the logs for “iterate on on-disk chunks” and look for errors (example)

The resolution to this was to delete the WAL files as below:

~ kubectl -n monitoring exec -it prometheus-gitlab-monitoring-promethe-prometheus-0 -c thanos-sidecar -- sh
/ $ cd /prometheus/
/prometheus $ rm -rf wal/