Runway NAT Gateway Port Allocation
Overview
Section titled “Overview”- What does this alert mean? A Cloud Run NAT gateway serving egress traffic for Runway services is running low on available TCP/UDP source ports. Each NAT IP address provides 64,512 TCP and 64,512 UDP source ports. When these are exhausted, new outbound connections from Runway services will fail. Note: this alert only covers Cloud Run NAT gateways — GKE NAT gateways use automatic IP allocation (AUTO NAT) and are not monitored by this rule.
- What factors can contribute? High volume of concurrent outbound connections from Runway services, insufficient NAT IP addresses allocated for the region, or connection leaks in services not properly closing connections.
- What parts of the service are affected? All egress traffic from Runway services in the affected region. This includes outbound HTTP calls, external API requests, and connections to downstream dependencies outside the VPC.
- What action is the recipient expected to take? Identify which gateway and region is saturated, determine whether it is a traffic spike or a connection leak, and either add more NAT IPs via Terraform or investigate the offending service.
Services
Section titled “Services”- Service: Runway NAT Gateway (Cloud NAT, GCP)
- Team: Runway
Metrics
Section titled “Metrics”-
Metric Explanation:
stackdriver_nat_gateway_router_googleapis_com_nat_allocated_ports: Number of ports currently allocated per NAT IP, exported via the Stackdriver exporter.gcp_cloud_nat_ports_capacity_total: Total port capacity per gateway — derived from the number of NAT IPs assigned to each gateway multiplied by 64,512. Used as the denominator in the saturation ratio.gitlab_component_saturation:ratio{component="runway_nat_gateway_port_allocation"}: Derived saturation ratio — allocated ports divided by total capacity per gateway, joined onrouter_id,gateway_name,region, andcloud_provider.
-
Threshold Reasoning:
- Soft SLO (80%): Capacity planning warning. Tamland will raise a capacity issue when this is breached.
- Hard SLO (90%): Alert fires and incident.io is notified. At this point services are at risk of connection failures.
-
Expected Behavior: Under normal conditions the ratio should be well below 0.5. A sustained climb toward 0.8+ indicates either traffic growth requiring more NAT IPs or a connection leak in a Runway service.
Alert Behavior
Section titled “Alert Behavior”- Silencing: Only silence during planned capacity changes (e.g. adding NAT IPs via Terraform) where a brief spike is expected.
- Expected Frequency: Rare. If firing repeatedly, investigate connection leak or plan a NAT IP increase.
Severities
Section titled “Severities”-
Severity Assignment: s2 (incident.io) — port exhaustion causes immediate connection failures for all egress traffic in the affected region.
- Impact: All Runway services making outbound connections in the affected region.
- Scope: Cloud Run NAT gateways only (production and staging). GKE NAT gateways are excluded.
-
Things to Check:
- Which
regionandgateway_namelabels are on the firing alert? - Is this a sudden spike or a gradual trend?
- Which Runway service is generating the most outbound connections?
- Are connections being properly closed, or is there a leak?
- Which
Verification
Section titled “Verification”-
Saturation ratio by region:
gitlab_component_saturation:ratio{component="runway_nat_gateway_port_allocation"} -
Raw allocated ports per NAT IP:
stackdriver_nat_gateway_router_googleapis_com_nat_allocated_ports{cloud_provider="gcp"} -
Total port capacity per gateway:
gcp_cloud_nat_ports_capacity_total -
Ports dropped (connection failures already occurring):
stackdriver_nat_gateway_router_googleapis_com_nat_dropped_received_packets_count{cloud_provider="gcp"}
Troubleshooting
Section titled “Troubleshooting”-
Identify the saturated region from the alert labels (
region,gateway_name). -
Check current port allocation trend in the NAT Gateway Port Allocation dashboard.
-
Identify the top consumers — which Runway services are making the most outbound connections in the affected region:
sum by(service_name) (stackdriver_nat_gateway_router_googleapis_com_nat_allocated_ports{env="production", region="<affected-region>", cloud_provider="gcp"}) -
Check for connection leaks — look for services with an abnormally high and growing port allocation that isn’t correlated with request rate.
-
Check Stackdriver logs for NAT allocation failures:
- Go to GCP Logs Explorer
- Filter:
resource.type="nat_gateway"andjsonPayload.allocation_status != "OK"
Possible Resolutions
Section titled “Possible Resolutions”-
Short-term (traffic spike): If caused by a specific service, consider rate-limiting outbound connections or restarting the offending service to release leaked connections.
-
Long-term (capacity): Add more NAT IPs to the affected region via Terraform:
1. runway-provisioner changes:
- Update
nat_ipsfor the affected region in config/networks/gcp.yml
2. config-mgmt changes:
- Update the newly added Runway NAT egress IPs in infrastructure-ips/locals.tf. This will trigger plans in other environments as the list of Runway egress IPs is referenced in other places. Confirm that the changes look good and have the MR applied/merged.
- Update
Escalation
Section titled “Escalation”-
When to Escalate: If port exhaustion is confirmed (dropped packets metric is non-zero) and cannot be resolved within 15 minutes.
-
Escalation Path:
- Check
#f_runwaySlack channel for known issues or ongoing incidents. - Escalate to the Runway team lead.
- If GCP infrastructure issue, open a GCP support ticket.
- Check
-
Slack Channels:
#f_runway— Runway team#s_production_engineering— Production Engineering