GKENodeCountCritical
Overview
Section titled “Overview”A GKE node pool has reached or is about to reach its configured maximum number of nodes (max_node_count in Terraform). Cluster Autoscaler cannot add more nodes to that pool until the cap is raised.
The limit is per-zone. For a zonal cluster the number of nodes equals the cap; for a regional cluster the cap is multiplied by the number of zones the cluster spans.
Services
Section titled “Services”Metrics
Section titled “Metrics”- SLI:
gitlab_component_saturation:ratio{component="kube_pool_max_nodes"} - Soft SLO: 90%. Hard SLO: 95% (alert fires above hard).
- Dashboard: Kube pool max nodes saturation
- Node pool caps live in
config-mgmt— see thegke_nodepool_max_nodesvariable in each environment’svariables.tf(for exampleenvironments/gprd/variables.tf) and its use ingke-regional.tf/gke-zonal.tf. The caps are exported to Prometheus asterraform_report_google_cluster_node_pool_max_node_countvia a CI job on Terraform runs.
Alert Behavior
Section titled “Alert Behavior”- Severity
s3. - Scope silences to
(cluster, label_pool).
Severities
Section titled “Severities”S2when workload pods are pending due to lack of capacity — seeKubeSchedulingFailuresfiring on the same cluster.S3when at cap but scheduling is not currently failing.
Verification
Section titled “Verification”-
Identify the affected pool. The alert labels include
cluster,label_pool, andshard:Terminal window gcloud --project <PROJECT> container node-pools list \--cluster <CLUSTER> --location <LOCATION> -
Cross-check the Cluster Autoscaler status:
Terminal window kubectl describe configmap cluster-autoscaler-status -n kube-systemLook for a node group where
cloudProviderTarget == maxSize. -
Confirm the Terraform-reported cap matches the GKE-side cap. Divergence means Terraform hasn’t run since the last change.
Recent changes
Section titled “Recent changes”- Recent
config-mgmtMRs — node pool caps live here.
Troubleshooting
Section titled “Troubleshooting”- Check pending pods and their reasons — see
KubeSchedulingFailures. - If the pool has been at cap and the workload is healthy, raise
gke_nodepool_max_nodesfor the relevant pool inconfig-mgmt. Apply via Atlantis. - Before raising, verify:
- The cluster subnet has enough IPs.
maxSizecannot exceed the number of IPs in the pod CIDR range. Follow k8s-operations.md — Add a secondary pod IP range if the subnet is the binding limit. - GCP quota (CPUs, in-use IPs, disk types, instance group size) allows more nodes.
- The cluster subnet has enough IPs.
- If the cap is high enough but new nodes fail to come up, follow the
scale.up.error.*diagnostics inKubeSchedulingFailures.
Possible resolutions
Section titled “Possible resolutions”- Raise
gke_nodepool_max_nodesfor the pool via aconfig-mgmtMR + Atlantis apply. - Rebalance workloads onto a different pool with headroom.
- Right-size the workload’s resource requests if they are inflated.
Dependencies
Section titled “Dependencies”- GCP quota.
- Cluster subnet IP space.
- Terraform-exported cap metric (
terraform_report_google_cluster_node_pool_max_node_count).
Escalation
Section titled “Escalation”#g_fleet_managementfor anything cluster-wide.
Definitions
Section titled “Definitions”- Saturation resource definition:
libsonnet/saturation-monitoring/kube_pool_max_nodes.libsonnet - Generated rule:
mimir-rules/gitlab-gprd/kube/autogenerated-gitlab-gprd-kube-saturation-alerts.yml - Tunable parameters:
slos.soft(0.90) andslos.hard(0.95). - Edit this playbook
Related Links
Section titled “Related Links”- Related alerts
GKENodeCountHigh— soft-threshold interpretation.KubeSchedulingFailures— pods stuckUnschedulable.- k8s-operations.md — Create a new node pool