HAProxy Backend Has No Active Servers
Symptoms
Section titled “Symptoms”- PagerDuty alert:
HAProxyBackendNoActiveServers - Alert labels identify the affected fleet and backend:
type— one ofmain,ci,registry,pagesbackend— the HAProxy backend (e.g.web,api,https_git,registry,pages_https)
- Customer-visible impact: traffic to the named backend in that fleet is failing. Depending on the backend, this may surface as 5xx errors, git/SSH connection failures, registry pulls failing, or Pages outages.
What the alert means
Section titled “What the alert means”Every HAProxy node in the fleet (type=<fleet>) reports haproxy_backend_active_servers == 0
for the named backend. HAProxy has marked every upstream server in that backend as DOWN,
so it has nowhere to send traffic. The alert is aggregated with
max by (backend, type, environment), so it only fires when all nodes in the fleet agree.
Triage
Section titled “Triage”-
Identify the impacted fleet and backend from the alert payload.
-
Open the HAProxy dashboards and filter by
typeandbackend:- HAProxy overview — https://dashboards.gitlab.net/d/haproxy/haproxy
- Frontend overview — https://dashboards.gitlab.net/d/frontend-main/frontend-overview
-
Confirm the situation in Mimir:
haproxy_backend_active_servers{tier="lb", env="gprd", type="<fleet>", backend="<backend>"}Expect this to be
0for the affected(type, backend)across everyfqdn/instance. -
Cross-check upstream health for that backend. The backend maps to a Kubernetes Deployment (Service IP or NGINX ingress) — see HAProxy Management at GitLab → Backends for the mapping. For example
web/main_web→gitlab-webservice-web,registry→gitlab-registry,pages_https→pagesdeployments.
Common causes
Section titled “Common causes”- Upstream deployment is unhealthy: pods crash-looping, readiness probes failing, rollout in a bad state.
- All backend servers were drained (e.g. via
set-server-state) and never re-enabled. - HAProxy health-check misconfiguration after a
chef-repochange. - Network or DNS issue between HAProxy and the upstream Service IP / NGINX ingress.
- Certificate or TLS issue on the upstream causing health checks to fail.
Resolution
Section titled “Resolution”-
If the upstream Kubernetes Deployment is unhealthy, follow the runbook for that service (e.g. webservice, registry, gitaly, pages) and restore it. The alert clears automatically once HAProxy health checks pass.
-
If servers were drained intentionally, restore them with:
chef-repo$ ./bin/set-server-state {gprd,gstg} ready <filter>See
set-server-state. -
If a recent
chef-repochange toroles/<env>-base-haproxy-<fleet>.jsonlooks related, revert it and run chef on the affected HAProxy nodes. -
To confirm from a HAProxy node:
sudo hatop -s /run/haproxy/admin.sockand check the backend’s server status, orget-server-statefrom chef-repo. -
Logs on a HAProxy node:
/var/log/haproxy.log.
Escalation
Section titled “Escalation”- If you cannot quickly identify the upstream owner, declare an incident
(this is an
s1page) and tag the service owner from the services catalog. - If multiple backends or multiple fleets are alerting simultaneously, treat as a wider frontend / network incident and follow gitlab.com is down.