GKENodeCountHigh
Overview
Section titled “Overview”A GKE node pool is running at over 90% of its max_node_count. Not urgent, but worth investigating whether the pool needs its cap raised or workloads redistributed before we hit the hard threshold.
Services
Section titled “Services”Metrics
Section titled “Metrics”-
SLI:
gitlab_component_saturation:ratio{component="kube_pool_max_nodes"}— same SLI asGKENodeCountCritical. -
Soft SLO: 90% (this playbook). Hard SLO: 95% (
GKENodeCountCritical). -
Longer-term trend:
count(stackdriver_gce_instance_compute_googleapis_com_instance_uptime{instance_name=~"gke-gprd.*"}) -
Node pool caps live in
config-mgmt— see thegke_nodepool_max_nodesvariable in each environment’svariables.tf(for exampleenvironments/gprd/variables.tf) and its use ingke-regional.tf/gke-zonal.tf.
Alert Behavior
Section titled “Alert Behavior”- Non-paging. Only the hard threshold generates an Alertmanager alert; the soft threshold is a dashboard signal.
- Silences do not apply — nothing fires from the soft threshold.
Severities
Section titled “Severities”- Investigatory only. Escalate to the actioning of
GKENodeCountCriticalif the ratio trends toward the hard threshold.
Verification
Section titled “Verification”Break down saturation by pool from the Kube pool max nodes saturation dashboard to spot which pools sit consistently above the soft threshold.
gcloud --project <PROJECT> container node-pools list \ --cluster <CLUSTER> --location <LOCATION>Recent changes
Section titled “Recent changes”- Recent
config-mgmtMRs — node pool caps live here.
Troubleshooting
Section titled “Troubleshooting”-
Investigate which workloads are running on the pool and whether their footprint has grown organically:
Terminal window kubectl get nodes -l cloud.google.com/gke-nodepool=<pool>kubectl top nodes -l cloud.google.com/gke-nodepool=<pool> -
Options:
- Raise
gke_nodepool_max_nodesfor the pool inconfig-mgmt(small MR, low risk). - Rebalance workloads onto a different pool with headroom.
- Right-size the workload’s resource requests if they are inflated.
- Raise
-
If subnet IPs are the binding constraint rather than the pool cap, follow k8s-operations.md — Add a secondary pod IP range.
Possible resolutions
Section titled “Possible resolutions”- Raise the pool cap proactively.
- Rebalance / right-size workloads.
Dependencies
Section titled “Dependencies”Same as GKENodeCountCritical.
Escalation
Section titled “Escalation”#g_fleet_management.
Definitions
Section titled “Definitions”- Saturation resource definition:
libsonnet/saturation-monitoring/kube_pool_max_nodes.libsonnet - Same SLI as
GKENodeCountCritical; this playbook is about the soft threshold interpretation only. - Edit this playbook
Related Links
Section titled “Related Links”- Related alerts
GKENodeCountCritical— the paging counterpart at 95%.KubeSchedulingFailures.