Runway NAT Gateway Port Allocation

Overview

What does this alert mean? A Cloud Run NAT gateway serving egress traffic for Runway services is running low on available TCP/UDP source ports. Each NAT IP address provides 64,512 TCP and 64,512 UDP source ports. When these are exhausted, new outbound connections from Runway services will fail. Note: this alert only covers Cloud Run NAT gateways — GKE NAT gateways use automatic IP allocation (AUTO NAT) and are not monitored by this rule.
What factors can contribute? High volume of concurrent outbound connections from Runway services, insufficient NAT IP addresses allocated for the region, or connection leaks in services not properly closing connections.
What parts of the service are affected? All egress traffic from Runway services in the affected region. This includes outbound HTTP calls, external API requests, and connections to downstream dependencies outside the VPC.
What action is the recipient expected to take? Identify which gateway and region is saturated, determine whether it is a traffic spike or a connection leak, and either add more NAT IPs via Terraform or investigate the offending service.

Services

Service: Runway NAT Gateway (Cloud NAT, GCP)
Team: Runway

Metrics

Metric Explanation:
- stackdriver_nat_gateway_router_googleapis_com_nat_allocated_ports: Number of ports currently allocated per NAT IP, exported via the Stackdriver exporter.
- gcp_cloud_nat_ports_capacity_total: Total port capacity per gateway — derived from the number of NAT IPs assigned to each gateway multiplied by 64,512. Used as the denominator in the saturation ratio.
- gitlab_component_saturation:ratio{component="runway_nat_gateway_port_allocation"}: Derived saturation ratio — allocated ports divided by total capacity per gateway, joined on router_id, gateway_name, region, and cloud_provider.
Threshold Reasoning:
- Soft SLO (80%): Capacity planning warning. Tamland will raise a capacity issue when this is breached.
- Hard SLO (90%): Alert fires and incident.io is notified. At this point services are at risk of connection failures.
Expected Behavior: Under normal conditions the ratio should be well below 0.5. A sustained climb toward 0.8+ indicates either traffic growth requiring more NAT IPs or a connection leak in a Runway service.

Alert Behavior

Silencing: Only silence during planned capacity changes (e.g. adding NAT IPs via Terraform) where a brief spike is expected.
Expected Frequency: Rare. If firing repeatedly, investigate connection leak or plan a NAT IP increase.

Severities

Severity Assignment: s2 (incident.io) — port exhaustion causes immediate connection failures for all egress traffic in the affected region.
- Impact: All Runway services making outbound connections in the affected region.
- Scope: Cloud Run NAT gateways only (production and staging). GKE NAT gateways are excluded.
Things to Check:
- Which region and gateway_name labels are on the firing alert?
- Is this a sudden spike or a gradual trend?
- Which Runway service is generating the most outbound connections?
- Are connections being properly closed, or is there a leak?

Verification

Saturation ratio by region:

gitlab_component_saturation:ratio{component="runway_nat_gateway_port_allocation"}

Raw allocated ports per NAT IP:

stackdriver_nat_gateway_router_googleapis_com_nat_allocated_ports{
  cloud_provider="gcp"
}

Total port capacity per gateway:
```
gcp_cloud_nat_ports_capacity_total
```

Ports dropped (connection failures already occurring):

stackdriver_nat_gateway_router_googleapis_com_nat_dropped_received_packets_count{
  cloud_provider="gcp"
}

Troubleshooting

Identify the saturated region from the alert labels (region, gateway_name).
Check current port allocation trend in the NAT Gateway Port Allocation dashboard.

Identify the top consumers — which Runway services are making the most outbound connections in the affected region:

sum by(service_name) (
  stackdriver_nat_gateway_router_googleapis_com_nat_allocated_ports{
    env="production", region="<affected-region>", cloud_provider="gcp"
  }
)

Check for connection leaks — look for services with an abnormally high and growing port allocation that isn’t correlated with request rate.
Check Stackdriver logs for NAT allocation failures:
- Go to GCP Logs Explorer
- Filter: resource.type="nat_gateway" and jsonPayload.allocation_status != "OK"

Possible Resolutions

Short-term (traffic spike): If caused by a specific service, consider rate-limiting outbound connections or restarting the offending service to release leaked connections.
Long-term (capacity): Add more NAT IPs to the affected region via Terraform:

1. runway-provisioner changes:
- Update nat_ips for the affected region in config/networks/gcp.yml
nat_ips is the total number of NAT IPs, not how many to add. The default is 1, so only regions with nat_ips > 1 will have an explicit entry.

2. config-mgmt changes:
- Update the newly added Runway NAT egress IPs in infrastructure-ips/locals.tf. This will trigger plans in other environments as the list of Runway egress IPs is referenced in other places. Confirm that the changes look good and have the MR applied/merged.

Escalation

When to Escalate: If port exhaustion is confirmed (dropped packets metric is non-zero) and cannot be resolved within 15 minutes.
Escalation Path:
1. Check #f_runway Slack channel for known issues or ongoing incidents.
2. Escalate to the Runway team lead.
3. If GCP infrastructure issue, open a GCP support ticket.
Slack Channels:
- #f_runway — Runway team
- #s_production_engineering — Production Engineering