KubernetesClusterZombieProcesses

Overview

Zombie (or defunct) processes can occur on systems when a parent process spawns a child and fails to clean up the process after it finishes executing. When processes are regularly left in this state, it can lead to PID and file handle exhaustion, thread contention, and several other problematic states. It is usually the result of bugs in code that leave processes in this state.

If this alert is firing, we should check the graphs to determine when processes started being left in this state, if the start of leaking processes correlates to a recent deployment, we may consider rolling back the code. We may also want to locate the workload responsible for the leaking processes, and pre-emptively restart these pods to alleviate some of the symptoms associated with the state temporarily.

Services

This alert can apply to any workload running in Kubernetes.
See the [## Troubleshooting] section for hints on locating the workloads that are contributing to zombie process creation.
Refer to the service catalog to locate the appropriate service owner once the workload is identified.

Metrics

Metric in Grafana Explore
Some zombie/defunct process churn is normal during day to day operations. The alert requires that the number of zombie processes be greater than 25 on a cluster for 15 minutes or longer before it will fire.
We should use this metric to detect when these processes are being created but not removed automatically.
An example of a problematic state:

Alert Behavior

This alert is intended to capture problems that exist across entire deployments in a given cluster, as opposed to individual workloads. Alerts are aggregated by cluster for this reason. Any created silence has the potential to mask additional new causes of the alert as long as it exists and should be done so for short durations, and with care.

Severities

This alert will capture symptomatic states of different issues and doesn’t represent an immediate problem on it’s own. Assigning a S3 severity may be appropriate if no additional alerts are firing.
There is a high likelihood that the cause of zombie/defunct processes being spawned will also result in Apdex violations that result in S2 incidents, so this should not be ignored.

Verification

Refer to the metric in Grafana Explore and verify that the zombie process counts are rising, and not simply the result of a spike that has subsided.

Recent changes

Look for recent deployments to the GPRD environment to determine if recent code changes have been deployed. A rollback may need to be considered if so.

Troubleshooting

Attempt to determine the workload that is responsible for spawning zombie/defunct processes.
- Locate a node in the GCP cluster mentioned in the alert that has zombie processes. This can be done by removing the sum() aggregator.
- From the process list (ps -ef), find any zombie processes as indicated by the <defunct> string.
- Use the process name and paths identified to correlate back to the likely workload.
TODO: We need to document a better way to identify the workload spawning these processes consistently. As of writing, I haven’t found any running systems to create a succinct set of actions to take. The above are just suggestions.

Possible Resolutions

A recent issue where zombie processes were related to the casue
Escalate to the team responsible for the service likely to be spawning the zombie processes.
If unsure about the service resulting in the leaked PIDs, escalate to #g_production_engineering
Alert definition
Link to edit this playbook
Update the template used to format this playbook
Related alerts