KubeContainersWaitingInError
Table of contents
Overview
Section titled “Overview”More than 50% of the containers waiting to start for a deployment are waiting due
to reasons that we consider to be an error state (that is, any reason other than
ContainerCreating
).
There are many reasons why containers will fail to start, but some include:
- GCP Quota Limits: we are unable to increase the capacity of a node pool.
- A configuration error has been pushed to the application, resulting in a termination during startup and a
CrashLoopBackOff
. - Kubernetes is unable to pull the required image from the registry
- An increase in the amount of containers that need to be created during a deployment.
- Calico-typha pods have undergone a recent migration/failure (see below)
When this alert fires, it means that new containers are not spinning up correctly. If existing containers are still running, it does not necessarily indicate an outage, but could lead to one if the existing containers are removed while in this state.
When this alert fires, the recipient should determine why the containers are failing to start. The most efficient way to do this is to connect to the cluster and look at the state of the failed pods, then check the events and logs to see what is causing them to be in that state.
Services
Section titled “Services”Metrics
Section titled “Metrics”- Container Waiting Reasons shows reasons why containers are waiting to start
- This chart indicates a saturation threshold of how many containers are stuck in a state where they are not Running. The count isn’t really meaningful, but rather just indicates the source of this alert.
- This was added as a precaution to capture problems before they became big problems. Ideally our clusters and configurations are Pods are solid such that this problem does not occur.
- Under normal circumstances, we’ll see spikes of Pods cycle through
Pending
andPodInitializing
. These spikes are normal when major changes happen, such as an upgrade or config change, or anything else that would cause a whole set of Pods to be rotated out. The length of time a Pod spends in either of these states should be short as we’ve optimized how our Pods start. This may not be true for workloads that we do not own the source code too. - Dashboard example of the alert firing
- It is normal for there to be some indicators in the dashboard of pods cycling out of the desired state. When this alert fires, you will notice that the indicators stay above 50% on the graph. Once things recover, those metrics will come back below 50% (and preferably to 0 for the containers in question)
Alert Behavior
Section titled “Alert Behavior”- Unless something is catestrophic or a known issue happening, I would shy away from alert silencing. If we do need to silence, make an attempt to target the smallest object possible, such as the deployment name, or if the problem is cluster wide, the target cluster, for example.
- This alert should be rare, and indicates a problem which will likely need manual attention
- Dashboard example of the alert firing
Severities
Section titled “Severities”- Start with a low severity until we determine what impact this has on the target service that is showing disruption. Example, if consul can’t start, this is probably outage inducing. But if sidekiq can’t start, our PDB is around to ensure old pods are stuck around while we wait for whatever the blocker is to be repaired. Sidekiq would still be functional. Additional observation would be required for the impacted service.
- This alert will impact different sets of users depending on which pods are causing it. We will need to determine that before we will know whether the impact is internal or external
Verification
Section titled “Verification”- Prometheus query that triggered the alert
- Alerts: Containers Waiting Dashboard
- We suck in events into Kibana, these would be our GKE events log. We do not filter them out, so the same information can also be found in Stackdriver. If using Stackdriver, it’s easier to look for the impacted workload/cluster and find the link to logs from GCP’s console first. It helps create a filter query when troubleshooting from this direction.
Recent changes
Section titled “Recent changes”- Recent related production change requests
- Recent helm MR’s
- To roll back a change, find the MR which introduced it. The MR is likely to be in the Kubernetes Workloads namespace. Revert that MR and make sure the pipeline completes.
Troubleshooting
Section titled “Troubleshooting”- Basic troubleshooting order
- Connect to the cluster in quetsion
- Identify the failing containers
- Determine why they are failing
- That should lead to what needs to be fixed to get the containers into a
Running
state
- On the Alerts: Containers Waiting Dashboard select the
environment
,type
, andcluster
in question and see what the metrics look like there - Useful scripts or commands
glsh kube use-cluster gprd
Set up the cluster connectionkubectl get pods -o jsonpath='{range .items[?(@.status.containerStatuses[-1:].state.waiting)]}{.metadata.name}: {@.status.containerStatuses[*].state.waiting.reason}{"\n"}{end}' -A
view the pods that are not running.kubectl get pods -A | grep -v "Running"
same as above but less correct and more informationkubectl logs -n (namespace) (podname)
View the logs of a container identified with the previous commandskube describe pod -n (namespace (podname)
View the events and other information for a container identified with the previous commands
This PromQL query will show which deployments are out of their desired states:
sum by (type, env, tier, stage, cluster) (kube_pod_container_status_waiting_reason:labeled{reason!="ContainerCreating",stage!="",type!=""})>0 >= on (type, env, tier, stage,cluster) (topk by (type,evn,tier,stage,cluster) (1, kube_deployment_spec_strategy_rollingupdate_max_surge:labeled{stage!="",tier!="",type!=""})*0.5)
Possible Resolutions
Section titled “Possible Resolutions”- Previous incidents for this alert
- 2023-09-25: KubeContainersWaitingInError for canary services
- 2024-05-23: Containers are unable to start
- 2024-03-20: KubeContainersWaitingInError external-dns
Dependencies
Section titled “Dependencies”-
Kubernetes configuration or secret changes have historically caused the most alerts
-
Secret configuration has been the primary violator
-
The alert can fire if a container image can’t be pulled
-
PVC mounting
-
It should be fairly straightforward to identify the deployment causing the problem. Once identified, it will be more clear as to where to escalate. The Foundations or delivery team are most likley to be able to help, but it will be more clear once we know the source of the alert.
-
Slack channels where help is likely to be found:
#g_foundations
-
The only tunable parameter in the alert is the percentage of errored containers that we tolerate