Skip to content

GKENodeCountHigh

A GKE node pool is running at over 90% of its max_node_count. Not urgent, but worth investigating whether the pool needs its cap raised or workloads redistributed before we hit the hard threshold.

  • SLI: gitlab_component_saturation:ratio{component="kube_pool_max_nodes"} — same SLI as GKENodeCountCritical.

  • Soft SLO: 90% (this playbook). Hard SLO: 95% (GKENodeCountCritical).

  • Longer-term trend:

    count(stackdriver_gce_instance_compute_googleapis_com_instance_uptime{instance_name=~"gke-gprd.*"})
  • Node pool caps live in config-mgmt — see the gke_nodepool_max_nodes variable in each environment’s variables.tf (for example environments/gprd/variables.tf) and its use in gke-regional.tf / gke-zonal.tf.

  • Non-paging. Only the hard threshold generates an Alertmanager alert; the soft threshold is a dashboard signal.
  • Silences do not apply — nothing fires from the soft threshold.
  • Investigatory only. Escalate to the actioning of GKENodeCountCritical if the ratio trends toward the hard threshold.

Break down saturation by pool from the Kube pool max nodes saturation dashboard to spot which pools sit consistently above the soft threshold.

Terminal window
gcloud --project <PROJECT> container node-pools list \
--cluster <CLUSTER> --location <LOCATION>
  1. Investigate which workloads are running on the pool and whether their footprint has grown organically:

    Terminal window
    kubectl get nodes -l cloud.google.com/gke-nodepool=<pool>
    kubectl top nodes -l cloud.google.com/gke-nodepool=<pool>
  2. Options:

    • Raise gke_nodepool_max_nodes for the pool in config-mgmt (small MR, low risk).
    • Rebalance workloads onto a different pool with headroom.
    • Right-size the workload’s resource requests if they are inflated.
  3. If subnet IPs are the binding constraint rather than the pool cap, follow k8s-operations.md — Add a secondary pod IP range.

  • Raise the pool cap proactively.
  • Rebalance / right-size workloads.

Same as GKENodeCountCritical.

#g_fleet_management.