Skip to content

NAT Gateway Port Allocation

  • What does this alert mean? A Cloud NAT gateway is running low on available TCP/UDP source ports. Each NAT IP address provides 64,512 TCP and 64,512 UDP source ports. When these are exhausted, new outbound connections will fail.
  • What factors can contribute? High volume of concurrent outbound connections, insufficient NAT IP addresses allocated for the region, or connection leaks in services not properly closing connections.
  • What parts of the service are affected? All egress traffic through the affected NAT gateway. This includes outbound HTTP calls, external API requests, SMTP connections, webhook deliveries, and connections to downstream dependencies outside the VPC.
  • What action is the recipient expected to take? Identify which gateway and project is saturated, determine whether it is a traffic spike or a connection leak, and either add more NAT IPs via Terraform or investigate the offending service.

This alert covers NAT gateways across four GCP projects:

  • gprd/gstg (gitlab-production, gitlab-staging-1): Core GitLab infrastructure NAT gateways. Team: Production Engineering.
  • Runway (gitlab-runway-production, gitlab-runway-staging): Cloud Run NAT gateways for Runway services. GKE NAT gateways use automatic IP allocation (AUTO NAT) and are excluded. Team: Runway.
  • Metric Explanation:

    • stackdriver_nat_gateway_router_googleapis_com_nat_allocated_ports: Number of ports currently allocated per NAT IP, exported via the Stackdriver exporter.
    • gcp_cloud_nat_ports_capacity_total: Total port capacity per gateway — derived from the number of NAT IPs assigned to each gateway multiplied by 64,512. Used as the denominator in the saturation ratio.
    • gitlab_component_saturation:ratio{component="nat_gateway_port_allocation"}: Derived saturation ratio — allocated ports divided by total capacity per gateway, joined on router_id, gateway_name, region, and project_id. Alert labels include gateway_name and project_id.
  • Mimir Tenants:

    • gitlab-production (gprd) → Mimir - Gitlab Gprd
    • gitlab-staging-1 (gstg) → Mimir - Gitlab Gstg
    • gitlab-runway-production, gitlab-runway-stagingMimir - Runway
  • Threshold Reasoning:

    • Soft SLO (80%): Capacity planning warning. Tamland will raise a capacity issue when this is breached.
    • Hard SLO (90%): Alert fires and incident.io is notified. At this point services are at risk of connection failures.
  • Silencing: Only silence during planned capacity changes (e.g. adding NAT IPs via Terraform) where a brief spike is expected.
  • Expected Frequency: Rare. If firing repeatedly, investigate connection leak or plan a NAT IP increase.
  • Severity Assignment: s2 (incident.io) — port exhaustion causes immediate connection failures for all egress traffic through the affected gateway.

    • Impact: All services making outbound connections through the affected NAT gateway.
    • Scope: All four GCP projects. For Runway, Cloud Run NAT gateways only — GKE NAT gateways are excluded.
  • Things to Check:

    • Which gateway_name and project_id labels are on the firing alert?
      • gitlab-production or gitlab-staging-1 → gprd/gstg gateway
      • gitlab-runway-production or gitlab-runway-staging → Runway gateway
    • Is this a sudden spike or a gradual trend?
    • Are connections being properly closed, or is there a leak?
  • Saturation ratio by gateway:

    gitlab_component_saturation:ratio{component="nat_gateway_port_allocation"}
  • Filter by project:

    gitlab_component_saturation:ratio{
    component="nat_gateway_port_allocation",
    project_id="gitlab-runway-production"
    }
  • Raw allocated ports per NAT IP:

    stackdriver_nat_gateway_router_googleapis_com_nat_allocated_ports
  • Total port capacity per gateway:

    gcp_cloud_nat_ports_capacity_total
  1. Identify the saturated gateway from the alert labels (gateway_name, project_id).

  2. Check current port allocation trend (gprd/gstg only): NAT Gateway Port Allocation dashboard shows per-VM NAT port allocation to identify which hosts are spiking in NAT port usage.

  3. Check Stackdriver logs for NAT allocation failures:

    • Go to GCP Logs Explorer
    • Filter: resource.type="nat_gateway" and jsonPayload.allocation_status != "OK"
  • Short-term (traffic spike): If caused by a specific service, consider rate-limiting outbound connections or restarting the offending service to release leaked connections.

  • Long-term (capacity): Add more NAT IPs to the affected region via Terraform.

    gprd/gstg (gitlab-production, gitlab-staging-1)

    Section titled “gprd/gstg (gitlab-production, gitlab-staging-1)”

    For gprd (gitlab-production):

    For gstg (gitlab-staging-1):

    Runway (gitlab-runway-production, gitlab-runway-staging)

    Section titled “Runway (gitlab-runway-production, gitlab-runway-staging)”

    1. runway-provisioner changes:

    2. config-mgmt changes:

    • Update the newly added Runway NAT egress IPs in infrastructure-ips/locals.tf. This will trigger plans in other environments as the list of Runway egress IPs is referenced in other places. Confirm that the changes look good and have the MR applied/merged.
  • When to Escalate: If port exhaustion is confirmed and cannot be resolved within 15 minutes.

  • Escalation Path:

    • gprd/gstg gateways:
      1. Check #s_production_engineering Slack channel.
      2. Escalate to the Production Engineering on-call.
    • Runway gateways:
      1. Check #f_runway Slack channel for known issues or ongoing incidents.
      2. Escalate to the Runway team lead.
    • For any GCP infrastructure issue, open a GCP support ticket.
  • Slack Channels:

    • #s_production_engineering — Production Engineering
    • #f_runway — Runway team