Cloud NAT Troubleshooting
Background
Section titled “Background”Unless a static IP is needed for ingress, most of our GCP VMs should not have static IPs and should access the internet via a managed Cloud NAT instance. This service divides a pool of IPs, each with a number of TCP and UDP ports available for NAT mapping, between VMs in its covered region/subnetworks, by dedicating a configurable number of these ports to each VM.
High Cloud NAT error rate
Section titled “High Cloud NAT error rate”Most likely an alert brought you here, or you noticed an elevated error rate in the dashboard.
Option 1: Do nothing. Periodic bursts of NAT errors are seen as dropped packets by clients, and higher-layer protocols should retry. However, sometimes a large sustained error rate will cause errors. Note that if the environment in question is CI, raising the NAT ports per VM will allow user jobs to create more concurrent connections to the same outbound address, which may not be desirable.
Option 2: increase NAT port space:
-
Locate the terraform declaration for the NAT instance in question in gitlab-com-infrastructure. This will be an instance of the
cloud-nat
module. -
Bump
nat_ports_per_vm
. -
Verify that we will still have enough IPs in the instance’s pool for all VMs. Use the formula:
nat_ip_count = M * P / 64,512
Where:
M = number of machines in the region/subnets (multiply by some generous number to account for future growth) P = NAT ports per VM (see a variable below)
https://cloud.google.com/nat/docs/overview#number_of_nat_ports_and_connections
-
If we would run out of ports according to the above formula, raise either
nat_ip_count
orimported_ip_count
(whichever is set). Note thatimported_ip_count
will only be a variable if https://ops.gitlab.net/gitlab-com/gl-infra/terraform-modules/google/cloud-nat/merge_requests/11 is merged.