Diagnose Before Delete: Single-Pod 5xx Incidents
When to use this runbook
Section titled “When to use this runbook”Use this runbook when dashboards or logs show that elevated 5xx errors are
isolated to a single pod (e.g. one Puma worker in the web or api
fleet). The instinct is to delete the pod or trigger a redeploy — that clears
the alert but destroys the evidence needed to find root cause. Follow the steps
below to capture diagnostics first, then remove the pod.
1. Identify the pod
Section titled “1. Identify the pod”Confirm that the 5xxs are coming from a single pod and not spread across the fleet. Check per-pod error rates in the web service dashboard and in logs.
- Dashboards: web service overview — check the per-pod error rate panels to confirm the spike is isolated to a single pod.
- Kibana: filter
json.hostname: <pod-name>andjson.status: [500 TO 599]in thepubsub-rails-inf-gprd-*index. See INC-7175 for an example. - Sentry: group errors by
server_nameto confirm the dominant error class is concentrated on one host.
Once you have the pod name, proceed immediately — do not delete it yet.
2. Isolate the pod
Section titled “2. Isolate the pod”Remove the pod from the load-balancer rotation so it stops receiving new traffic, but keep it running for inspection.
Follow docs/kube/k8s-isolate-pod.md.
Do NOT delete the pod yet. Deletion destroys all in-process state, logs, and memory that you need for diagnosis.
3. Capture diagnostics
Section titled “3. Capture diagnostics”Run the following while the pod is isolated and still alive.
a. Recent application logs
Section titled “a. Recent application logs”Pull the last few hundred lines from the pod:
kubectl -n gitlab logs <pod-name> --tail=500
# If the container has restarted, fetch logs from the previous instance:kubectl -n gitlab logs <pod-name> --previous --tail=500For a multi-container pod (e.g. with a sidecar), specify the container:
kubectl -n gitlab logs <pod-name> -c <container-name> --tail=500Cross-reference with Kibana using json.hostname: <pod-name> for the full
structured log stream.
b. Stack trace / exception sample
Section titled “b. Stack trace / exception sample”In Sentry, filter by server_name:<pod-name> and note:
- The dominant exception class (e.g.
ArgumentError) - The full backtrace of the most recent occurrence
- The request path and parameters if available
Copy the Sentry event URL into the incident issue.
c. Ruby process snapshot / CPU flamegraph
Section titled “c. Ruby process snapshot / CPU flamegraph”See docs/uncategorized/ruby-profiling.md
for the full procedure (stackprof, flamegraphs via perf).
Quick path using the GKE helper script (run on the GKE node — use toolbox on COS nodes since perf is not available in the default COS environment):
scripts/gke/perf_flamegraph_for_pod_id.sh <pod-name>Save the resulting SVG and attach it to the incident issue.
d. Pod environment data
Section titled “d. Pod environment data”# Full pod description: image SHA, node, restart count, eventskubectl -n gitlab describe pod <pod-name>
# Node the pod is running onkubectl -n gitlab get pod <pod-name> -o jsonpath='{.spec.nodeName}'
# Image digestkubectl -n gitlab get pod <pod-name> \ -o jsonpath='{.status.containerStatuses[*].imageID}'Note the restart count and any recent Warning events in the describe
output — these often reveal OOM kills or liveness probe failures that preceded
the 5xx spike.
e. Network capture (optional)
Section titled “e. Network capture (optional)”If a network-level issue is suspected (e.g. upstream timeouts, TLS errors), capture traffic on the pod’s network namespace or interface:
# Using pod network namespacescripts/gke/tcpdump_on_gke_node_for_pod_id.using_pod_netns.sh <pod-name>
# Using pod network interfacescripts/gke/tcpdump_on_gke_node_for_pod_id.using_pod_iface.sh <pod-name>See also scripts/gke/container_inspection_library.sh for lower-level
container inspection helpers.
4. Preserve evidence
Section titled “4. Preserve evidence”Before removing the pod, attach all captured artifacts to the incident issue:
kubectl logsoutput (paste or attach as a file)kubectl describe podoutput- Flamegraph SVG (if generated)
- pcap file (if a network capture was taken)
- Sentry event URL(s)
- Image SHA / digest
5. Remove the pod
Section titled “5. Remove the pod”Once diagnostics are captured, delete the pod to restore normal capacity:
kubectl -n gitlab delete pod <pod-name>The deployment controller will schedule a replacement automatically. Verify the new pod comes up healthy and that the 5xx rate returns to baseline.
6. Follow up
Section titled “6. Follow up”If root cause is not yet identified:
- Open an issue against the responsible service team with the captured diagnostics attached.
- Reference the originating incident in the issue.
- Link the new issue from the incident timeline.
Related runbooks
Section titled “Related runbooks”- docs/kube/k8s-isolate-pod.md — isolate a pod without deleting it
- docs/kube/kubernetes.md — general Kubernetes ops, pod restart inspection
- docs/uncategorized/ruby-profiling.md — Ruby profiling on k8s pods (stackprof, flamegraphs)
- scripts/gke/perf_flamegraph_for_pod_id.sh — CPU flamegraph for a pod
- scripts/gke/tcpdump_on_gke_node_for_pod_id.using_pod_netns.sh — network capture via pod netns
- scripts/gke/tcpdump_on_gke_node_for_pod_id.using_pod_iface.sh — network capture via pod interface
- scripts/gke/container_inspection_library.sh — container inspection helpers