GKENodeCountCritical

Overview

A GKE node pool has reached or is about to reach its configured maximum number of nodes (max_node_count in Terraform). Cluster Autoscaler cannot add more nodes to that pool until the cap is raised.

The limit is per-zone. For a zonal cluster the number of nodes equals the cap; for a regional cluster the cap is multiplied by the number of zones the cluster spans.

Services

Kubernetes Service Overview
Owner: Fleet Management

Metrics

SLI: gitlab_component_saturation:ratio{component="kube_pool_max_nodes"}
Soft SLO: 90%. Hard SLO: 95% (alert fires above hard).
Dashboard: Kube pool max nodes saturation
Node pool caps live in config-mgmt — see the gke_nodepool_max_nodes variable in each environment’s variables.tf (for example environments/gprd/variables.tf) and its use in gke-regional.tf / gke-zonal.tf. The caps are exported to Prometheus as terraform_report_google_cluster_node_pool_max_node_count via a CI job on Terraform runs.

Alert Behavior

Severity s3.
Scope silences to (cluster, label_pool).

Severities

S2 when workload pods are pending due to lack of capacity — see KubeSchedulingFailures firing on the same cluster.
S3 when at cap but scheduling is not currently failing.

Verification

Identify the affected pool. The alert labels include cluster, label_pool, and shard:

gcloud --project <PROJECT> container node-pools list \
  --cluster <CLUSTER> --location <LOCATION>

Cross-check the Cluster Autoscaler status:
Terminal window
```
kubectl describe configmap cluster-autoscaler-status -n kube-system
```
Look for a node group where cloudProviderTarget == maxSize.
Confirm the Terraform-reported cap matches the GKE-side cap. Divergence means Terraform hasn’t run since the last change.

Recent changes

Recent config-mgmt MRs — node pool caps live here.

Troubleshooting

Check pending pods and their reasons — see KubeSchedulingFailures.
If the pool has been at cap and the workload is healthy, raise gke_nodepool_max_nodes for the relevant pool in config-mgmt. Apply via Atlantis.
Before raising, verify:
- The cluster subnet has enough IPs. maxSize cannot exceed the number of IPs in the pod CIDR range. Follow k8s-operations.md — Add a secondary pod IP range if the subnet is the binding limit.
- GCP quota (CPUs, in-use IPs, disk types, instance group size) allows more nodes.
If the cap is high enough but new nodes fail to come up, follow the scale.up.error.* diagnostics in KubeSchedulingFailures.

Possible resolutions

Raise gke_nodepool_max_nodes for the pool via a config-mgmt MR + Atlantis apply.
Rebalance workloads onto a different pool with headroom.
Right-size the workload’s resource requests if they are inflated.

Dependencies

GCP quota.
Cluster subnet IP space.
Terraform-exported cap metric (terraform_report_google_cluster_node_pool_max_node_count).

Escalation

#g_fleet_management for anything cluster-wide.

Definitions

Saturation resource definition: libsonnet/saturation-monitoring/kube_pool_max_nodes.libsonnet
Generated rule: mimir-rules/gitlab-gprd/kube/autogenerated-gitlab-gprd-kube-saturation-alerts.yml
Tunable parameters: slos.soft (0.90) and slos.hard (0.95).
Edit this playbook

Related alerts
GKENodeCountHigh — soft-threshold interpretation.
KubeSchedulingFailures — pods stuck Unschedulable.
k8s-operations.md — Create a new node pool