Skip to content

GKENodeCountCritical

A GKE node pool has reached or is about to reach its configured maximum number of nodes (max_node_count in Terraform). Cluster Autoscaler cannot add more nodes to that pool until the cap is raised.

The limit is per-zone. For a zonal cluster the number of nodes equals the cap; for a regional cluster the cap is multiplied by the number of zones the cluster spans.

  • SLI: gitlab_component_saturation:ratio{component="kube_pool_max_nodes"}
  • Soft SLO: 90%. Hard SLO: 95% (alert fires above hard).
  • Dashboard: Kube pool max nodes saturation
  • Node pool caps live in config-mgmt — see the gke_nodepool_max_nodes variable in each environment’s variables.tf (for example environments/gprd/variables.tf) and its use in gke-regional.tf / gke-zonal.tf. The caps are exported to Prometheus as terraform_report_google_cluster_node_pool_max_node_count via a CI job on Terraform runs.
  • Severity s3.
  • Scope silences to (cluster, label_pool).
  • S2 when workload pods are pending due to lack of capacity — see KubeSchedulingFailures firing on the same cluster.
  • S3 when at cap but scheduling is not currently failing.
  1. Identify the affected pool. The alert labels include cluster, label_pool, and shard:

    Terminal window
    gcloud --project <PROJECT> container node-pools list \
    --cluster <CLUSTER> --location <LOCATION>
  2. Cross-check the Cluster Autoscaler status:

    Terminal window
    kubectl describe configmap cluster-autoscaler-status -n kube-system

    Look for a node group where cloudProviderTarget == maxSize.

  3. Confirm the Terraform-reported cap matches the GKE-side cap. Divergence means Terraform hasn’t run since the last change.

  1. Check pending pods and their reasons — see KubeSchedulingFailures.
  2. If the pool has been at cap and the workload is healthy, raise gke_nodepool_max_nodes for the relevant pool in config-mgmt. Apply via Atlantis.
  3. Before raising, verify:
    • The cluster subnet has enough IPs. maxSize cannot exceed the number of IPs in the pod CIDR range. Follow k8s-operations.md — Add a secondary pod IP range if the subnet is the binding limit.
    • GCP quota (CPUs, in-use IPs, disk types, instance group size) allows more nodes.
  4. If the cap is high enough but new nodes fail to come up, follow the scale.up.error.* diagnostics in KubeSchedulingFailures.
  • Raise gke_nodepool_max_nodes for the pool via a config-mgmt MR + Atlantis apply.
  • Rebalance workloads onto a different pool with headroom.
  • Right-size the workload’s resource requests if they are inflated.
  • GCP quota.
  • Cluster subnet IP space.
  • Terraform-exported cap metric (terraform_report_google_cluster_node_pool_max_node_count).
  • #g_fleet_management for anything cluster-wide.