HAProxy Management at GitLab
Overview
Section titled “Overview”GitLab uses HAProxy for directing traffic to various fleets in our infrastructure. HTTP(S) and git traffic. There are clusters of HAProxy nodes that are attached to GCP load balancers. These are split into the following groups:
<env>-base-haproxy-pages: HTTP/HTTPS traffic for*.gitlab.io. It is also where customers point custom DNS for pages.- frontends:
pages_http,pages_https - backends:
pages_http,pages_https
- frontends:
<env>-base-haproxy-main: HTTP/HTTPS, SSH for gitlab.com- frontends:
http,https,ssh,api_rate_limit,https_git_ci_gateway - backends:
429_slow_down,asset_proxy,api,api_rate_limit,canary_api,canary_https_git,canary_web,ci_gateway_catch_all,https_git,main_api,main_web,ssh,web,websocket
- frontends:
<env>-base-haproxy-ci: HTTP/HTTPS, SSH for gitlab.com, dedicated gateway for internal CI runners- frontends: see above
- backends: see above
<env>-base-haproxy-registry: HTTP/HTTPS for registry.gitlab.com- frontends:
http,https - backends:
registry,canary_registry
- frontends:
Explanation:
- Each
<env>-base-haproxy-*above represents a Chef role since we use Chef to configure our nodes. Browse tochef-repo/rolesdirectory and you will see them. - The references after the frontends and backends refer to concepts in HAProxy configuration language.
graph TB a[client request] --> b[Route 53] a --> c[CloudFlare] c --> d[GCP Load Balancer] d --> e[HAProxy Frontend] e --> f[backend choice] f --> g[HAProxy Backend] h[CI Runner] --> d
Frontend and Backend Configuration
Section titled “Frontend and Backend Configuration”- HAProxy frontends define how requests are forwarded to backends.
- Backends configure a list of servers for load balancing.
- The HAProxy configuration is defined in
gitlab-haproxycookbook and you can also find it in/etc/haproxy/haproxy.cfgon any of the HAProxy nodes.
Frontends
Section titled “Frontends”http: port 80- delivers a 301 to https
https: port 443- sends to the
429_slow_downbackend if the client is rate-limited - sends to the
assets_proxybackend if the request matches/assets/ - sends to the
apibackend if the request matches/api - sends to the
https_gitbackend if the request matches a regex that tries to determine if it looks like a git path - sends to the
registrybackend if the request is registry.gitlab.com - sends to the
websocketbackend if it looks like a websocket request - sends to the
canary_api,canary_https_gitorcanary_webbackend if it looks like a canary request (cookie andcanary.\*) - if nothing else matches requests are sent to the
webbackend.
- sends to the
ssh: port 22- sends to the
sshbackend
- sends to the
api_rate_limit: port 4444- used by the
httpsfront-end (see above)
- used by the
https_git_ci_gateway: port 8989- receives requests from CI runners through a private network
- sends to the
apirate if the request matches/api/v[0-9]/jobs/(jobs|request|trace), - all other requests are sent to
ci_gateway_catch_allbackend
pages_http: port 80- sends to the
pages_httpbackend
- sends to the
pages_https: port 443- sends to the
pages_httpsbackend
- sends to the
Backends
Section titled “Backends”asset_proxy: proxy forgitlab-<env>-assets.storage.googleapis.comapi,main_api: NGINX ingress for thegitlab-webservice-apiKubernetes Deploymentapi_rate_limit: proxy for theapi_rate_limitfrontendhttps_git: Service IP of thegitlab-gitlab-gitKubernetes Deploymentregistry: Service IP of thegitlab-registryKubernetes Deploymentssh: Service IP of thegitlab-gitlab-shellKubernetes Deploymentweb,main_web: Service IP of thegitlab-webservice-webKubernetes Deploymentwebsockets: Service IP of thegitlab-webservice-websocketsKubernetes Deploymentpages_http: all of thepagesKubernetes deploymentspages_https: all of thepagesKubernetes deploymentscanary_api: NGINX ingress for thegitlab-cny-webservice-apiKubernetes Deploymentcanary_https_git: Service IP of thegitlab-cny-webservice-gitKubernetes Deploymentcanary_registry: Service IP of thegitlab-cny-registryKubernetes Deploymentcanary_web: Service IP of thegitlab-cny-webservice-webKubernetes Deploymentci_gateway_catch_all: redirect all requests tohttps://gitlab.com429_slow_down: deny rate-limited requests with HTTP 429
Kubernetes Backends
Section titled “Kubernetes Backends”For services which reside in Kubernetes, we have one backend that sends all traffic going to each cluster.
For the purposes of cost savings, backends that reside in the same zone are active,
while backends that are in differing zones are marked as backup in the server configuration line.
Doing so enables us to still send traffic to a differing zone should we suffer a zonal failure.
We obtain some cost savings with this as traffic is mostly kept inside of a single zone.
Example:
graph TB a[HAProxy node in zone b] --> b[cluster b] a -->|backup| c[cluster c] a -->|backup| d[cluster d]
Some services send traffic to an NGINX Ingress endpoint, while others use the Service IP provided by the Kubernetes Deployment for that service. Go to the associated documentation for a given service to determine how this is configured.
Load balancing
Section titled “Load balancing”Currently the HAProxy backend configuration is such that
every pool of servers is round-robin with the exception of asset_proxy, ssh and pages.
There is an open issue to discuss using sticky sessions for the web backend,
see this.
Server Weights
Section titled “Server Weights”By default all servers attached to the backends have the same weight of 100 with the exception of
the canary servers which are also in the non-canary backends with a weight of zero.
It is possible to direct all traffic to canary but
the normal way we send traffic is through a static list of request paths for internal projects.
For more information see the canary release documentation
Draining
Section titled “Draining”The HAProxy nodes need to be drained of traffic before restarts of the node or HAProxy binary can be applied.
In order to simplify this, the HAProxy deployment includes a drain and wait script.
By default, it waits 10`` minutes. It works by blocking the health checks from the upstream GCP load balancer. Note, that not all HTTP connections will be drained in 10` minutes.
The HTTP clients can, and do, keep long-lived sessions open.
So a small number of users will get disconnected when doing a drain.
But it will cleanly clear out the majority of traffic.
This script is automatically called as part of the systemd unit start and stop process. This allows for easy draining and restarting of HAProxy.
For example:
sudo systemctl stop haproxyNOTE: This stop command will wait 10 minutes before it completes.
To drain the node with a custom time, or drain without stopping haproxy:
sudo /usr/local/sbin/drain_haproxy.sh -w 60Un-draining is executed as part of the HAProxy systemd unit start process. It can also be done manually by calling the drain script again in un-drain mode.
sudo /usr/local/sbin/drain_haproxy.sh -uCanary
Section titled “Canary”Canary can be drained directly using chatops in the #production Slack channel, examples:
- Drain cny in staging
/chatops run canary --disable --staging - Drain cny in production
/chatops run canary --disable --production - Enable cny in production
/chatops run canary --enable --production
Tooling
Section titled “Tooling”- There are helper scripts in chef-repo to assist setting server statuses. In general, it is advised to always drain active connections from a server before rebooting.
- For controlling traffic to canary there are ChatOps commands, for more information see the Canary ChatOps documentation
The following helper script can be used for setting the state of any server in the backend:
get-server-state
Section titled “get-server-state”chef-repo$ ./bin/get-server-state gstgFetching server state... 3 fe #: 3 fe 429_slow_down/BACKEND: UP 3 fe api/api-gke-us-east1-b-8181: UP 3 fe api/api-gke-us-east1-c-8181: UP 3 fe api/api-gke-us-east1-d-8181: UP 3 fe api/BACKEND: UP 3 fe api/gke-cny-api-8181: UP 3 fe api_rate_limit/BACKEND: UP 3 fe api_rate_limit/FRONTEND: 3 fe api_rate_limit/localhost: 3 fe asset_proxy/asset-bucket: UP 3 fe asset_proxy/BACKEND: UP 3 fe canary_api/BACKEND: UP 3 fe canary_api/gke-cny-api-8181: UP 3 fe canary_https_git/BACKEND: UP 3 fe canary_https_git/gke-cny-git-https: UP 3 fe canary_web/BACKEND: UP 3 fe canary_web/gke-cny-web: UP 3 fe check_http/FRONTEND: 3 fe check_https/FRONTEND: 3 fe check_ssh/FRONTEND: 3 fe http/FRONTEND: 3 fe https/FRONTEND: 3 fe https_git/BACKEND: UP 3 fe https_git/git-https-gke-us-east1-b: UP 3 fe https_git/git-https-gke-us-east1-c: UP 3 fe https_git/git-https-gke-us-east1-d: UP 3 fe https_git/gke-cny-git-https: UP 2 fe-pages #: 2 fe-pages check_http/FRONTEND: 2 fe-pages check_https/FRONTEND: 2 fe-pages pages_http/BACKEND: UP 2 fe-pages pages_http/FRONTEND: 2 fe-pages pages_http/gke-cny-pages: UP 2 fe-pages pages_http/pages-us-east1-b: UP 2 fe-pages pages_http/pages-us-east1-c: UP 2 fe-pages pages_http/pages-us-east1-d: UP 2 fe-pages pages_https/BACKEND: UP 2 fe-pages pages_https/FRONTEND: 2 fe-pages pages_https/gke-cny-pages-proxyv2: UP 2 fe-pages pages_https/pages-us-east1-b-proxyv2: UP 2 fe-pages pages_https/pages-us-east1-c-proxyv2: UP 2 fe-pages pages_https/pages-us-east1-d-proxyv2: UP 2 fe-pages pxname/svname: 2 fe-pages stats/BACKEND: UP 2 fe-pages stats/FRONTEND: 3 fe pxname/svname: 2 fe-registry #: 2 fe-registry canary_registry/BACKEND: UP 2 fe-registry check_http/FRONTEND: 2 fe-registry check_https/FRONTEND: 2 fe-registry http/FRONTEND: 2 fe-registry https/FRONTEND: 2 fe-registry pxname/svname: 2 fe-registry registry/BACKEND: UP 2 fe-registry registry/registry-us-east1-b: UP 2 fe-registry registry/registry-us-east1-c: UP 2 fe-registry registry/registry-us-east1-d: UP 2 fe-registry stats/BACKEND: UP 2 fe-registry stats/FRONTEND: 3 fe ssh/BACKEND: UP 3 fe ssh/FRONTEND: 3 fe ssh/gke-cny-ssh: UP 3 fe ssh/shell-gke-us-east1-b: UP 3 fe ssh/shell-gke-us-east1-c: UP 3 fe ssh/shell-gke-us-east1-d: UP 3 fe ssh/sock-1: 3 fe stats/BACKEND: UP 3 fe stats/FRONTEND: 3 fe web/BACKEND: UP 3 fe web/gke-cny-web: UP 3 fe websockets/BACKEND: UP 3 fe websockets/gke-cny-ws: UP 3 fe websockets/ws-gke-us-east1-b: UP 3 fe websockets/ws-gke-us-east1-c: UP 3 fe websockets/ws-gke-us-east1-d: UP 3 fe web/web-gke-us-east1-b-8181: UP 3 fe web/web-gke-us-east1-c-8181: UP 3 fe web/web-gke-us-east1-d-8181: UP- The first number refers to the number of load balancers reporting the server status
- The second field is the backend/server-name
- The last field is the current status which may be {UP,MAINT,DRAIN}
- This field may be blank if the backend is down or has no status as seen in the above example.
set-server-state
Section titled “set-server-state”The set-server-state script allows you change the server state so
that it can start draining connections or not take any if there is a situation
where you do not want any traffic going to a server.
Note, this will trigger the HAProxyServerDown alert.
Before you drain a node, add a new silence.
Important:
Be sure to alert release managers on Slack when changing the state of VMs or Clusters in HAProxy by mentioning @release-managers.
Example Matchers:
- Name:
alertname, Value:HAProxyServerDown - Name:
server, Value:git-01-sv-gstg
Be sure to put a link to an issue in the alert silence comment.
# Sets server state on HAProxy nodeschef-repo$ ./bin/set-server-state {gprd,gstg} <ready|drain|maint> [filter]
Examples: chef-repo$ ./bin/set-server-state gstg drain git-10 # set git-10 to drain chef-repo$ ./bin/set-server-state gstg ready git-10 # set git-10 to readyHere is a full example of setting server git-01 in gstg to the DRAIN state:
chef-repo$ ./bin/set-server-state gstg drain git-01Fetching server state... 2 altssh/git-01-sv-gstg : UP 3 https_git/git-01-sv-gstg : UP 3 ssh/git-01-sv-gstg : UP 3 websockets/git-01-sv-gstg : UP^^^^^ The states of the above servers will be changed to drain.Press enter to continue.Setting server state... 2 updated altssh/git-01-sv-gstg 3 updated https_git/git-01-sv-gstg 3 updated ssh/git-01-sv-gstg 3 updated websockets/git-01-sv-gstgFetching server state... 2 altssh/git-01-sv-gstg : DRAIN 3 https_git/git-01-sv-gstg : DRAIN 3 ssh/git-01-sv-gstg : DRAIN 3 websockets/git-01-sv-gstg : DRAINAdmin console for HAProxy (single node)
Section titled “Admin console for HAProxy (single node)”HAProxy has a built-in web admin console; this is not terribly useful for managing a fleet of HAProxy nodes, but if just one is misbehaving then it might be handy. To access it, ssh port forward to port 7331, e.g.:
ssh -L 7331:localhost:7331 fe-01-lb-gstg.c.gitlab-staging-1.internalThen access http://localhost:7331 in your browser.
The username is admin, the password is most easily obtained from haproxy.cfg on the server itself (look for stats auth section),
but can also be obtained by looking for the admin_password value in GKME vault, e.g.:
gkms-vault-show frontend-loadbalancer gstgRemoving HAProxy VMs from GCP Load Balancers
Section titled “Removing HAProxy VMs from GCP Load Balancers”It is possible to temporarily remove HAProxy machines from the GCP Load Balancers individually (they are not added as an instance group). Be careful! This operation has user facing impact, such as terminating sessions, and can cause higher load on the remaining HAProxy nodes!
The load balancer backend service is defined in the corresponding tf module (at the moment of writing that’s generic-sv-with-group)
so doing it with the use of Terraform is non-trivial.
It can be done using the GCP console: Load Balancing -> click on the relevant lb -> Edit -> Backend configuration -> remove the HAProxy machine from the “Select existing instances” list -> Click update.
hatop is an ncurses-based command-line GUI for inspecting the status of HAProxy.
It can be run on any HAProxy (<env>-base-haproxy-*) host via:
sudo hatop -s /run/haproxy/admin.sockHAProxy Alert Troubleshooting
Section titled “HAProxy Alert Troubleshooting”Reason
Section titled “Reason”- Errors are being reported by HAProxy, this could be a spike in 5xx errors, server connection errors, or backends reporting unhealthy.
Prechecks
Section titled “Prechecks”- Examine the health of all backends and the HAProxy dashboard
- HAProxy - https://dashboards.gitlab.net/d/haproxy/haproxy
- HAProxy Backend Status - https://dashboards.gitlab.net/d/frontend-main/frontend-overview
- Is the alert specific to canary servers or the canary backend?
Check canaries to ensure they are reporting OK.
If this is the cause you should immediately change the weight of canary traffic.
- Canary dashboard - https://dashboards.gitlab.net/d/llfd4b2ik/canary
- To disable canary traffic see the Canary ChatOps documentation
Resolution
Section titled “Resolution”-
If there is a single backend server alerting, check to see if the node is healthy on the host status dashboard. It is possible in some cases, most notably the git server where it is possible to reject connections even though the server is reporting healthy. Check the health of the deployment in Kubernetes:
kubectl --namespace gitlab get deployment gitlab-gitlab-shellkubectl --namespace gitlab get pods --selector app=gitlab-shell -
HAProxy logs are not currently being sent to ELK because of capacity issues. More information can be read here.
-
If the errors are from pages backends, consider possible intentional abuse or accidental DoS from specific IPs or for specific domains in Pages
-
Client IPs can be identified by volume from the current HAProxy logs on the Haproxy nodes with:
sudo grep -v check_http /var/log/haproxy.log | awk '{print $6}' | cut -d: -f1|sort|uniq -c |sort -n|tailIdentifying problematic levels is not set in stone; hopefully if there is one or two (or a subnet), they will stand out. Consider removing the ‘tail’ or making it ‘tail -100’ etc, to get more context.
- To block: In https://gitlab.com/gitlab-com/security-tools/front-end-security edit
deny-403-ips.lst. Commit, push, open MR, ensure it has pull mirrored toops.gitlab.net, then run chef on the pages HAProxy nodes to deploy. This will block that IP across all frontend (pages, web, api etc), so be sure you want to do this.
- To block: In https://gitlab.com/gitlab-com/security-tools/front-end-security edit
-
Problem sites/projects/domains can be identified with the
Gitlab-Pages activitydashboard on Kibana - https://log.gprd.gitlab.net/app/kibana#/dashboard/AW6GlNKPqthdGjPJ2HqH- To block: In https://gitlab.com/gitlab-com/security-tools/front-end-security edit
deny-403-ips.lst. Commit, push, open MR, ensure it has pull mirrored toops.gitlab.net, then run chef on the pages HAProxy nodes to deploy. This will block only the named domain (exact match) in pages, preventing the request ever making it to the pages deployments. This is very low-risk.
- To block: In https://gitlab.com/gitlab-com/security-tools/front-end-security edit
-
Extraneous Processes
Section titled “Extraneous Processes”HAProxy forks on reload and old processes will continue to service requests,
for long-lived SSH connections we use the hard-stop configuration parameter to prevent processes from lingering more than 5 minutes.
In https://gitlab.com/gitlab-com/gl-infra/delivery/issues/588 we have observed that processes remain for longer than this interval, this may require manual intervention:
- Display the process tree for HAProxy, for example here it shows two processes where we expect one:
pstree -pals $(pgrep -u root -f /usr/sbin/haproxy)systemd,1 --system --deserialize 36 └─haproxy,28214 -Ws -f /etc/haproxy/haproxy.cfg -p /run/haproxy.pid -sf 1827 ├─haproxy,1827 -Ws -f /etc/haproxy/haproxy.cfg -p /run/haproxy.pid -sf 1639 └─haproxy,2002 -Ws -f /etc/haproxy/haproxy.cfg -p /run/haproxy.pid -sf- Show the elapsed time of the haproxy processes:
# for p in $(pgrep -u haproxy -f haproxy); do ps -o user,pid,etimes,command $p; doneUSER PID ELAPSED COMMANDhaproxy 1827 99999 /usr/sbin/haproxy -Ws -f /etc/haproxy/haproxy.cfg -p /run/haproxy.pid -sf 1639USER PID ELAPSED COMMANDhaproxy 2002 20 /usr/sbin/haproxy -Ws -f /etc/haproxy/haproxy.cfg -p /run/haproxy.pid -sf 1827- Kill the process with the longer elapsed time:
kill -TERM 1827Server Down
Section titled “Server Down”HAProxy will mark a server for a backend down if it is unreachable or has a failing health check.
It will also be marked down if the node has been drained with set-server-state.
- From a HAProxy node, you can ping and/or curl the backend server and health check.
- From the server, check the logs of the process.
For Rails/Puma, database connection problems can cause HUP to fail and Puma will need a full restart with gitlab-ctl restart puma.