Skip to content

HTTP Router: On-Call Survival Guide

This guide helps EOC (engineer-on-call) respond to incidents related to HTTP Router.

HTTP Router is the entry point for all HTTP requests under gitlab.com/*. It sits between Cloudflare’s edge network and GitLab’s backend infrastructure, determining which Cell should handle each incoming request. Currently, by default it proxies requests to our legacy cell.

HTTP Router is built on Cloudflare Workers and deployed via http-router-deployer. It enables the Cells architecture by presenting all Cells under a single gitlab.com domain.

Design Documentation: HTTP Routing Service Architecture

  • Routes requests to the correct Cell based on path, headers, or cached classification.
    • Currently, it proxies most requests to our legacy cell. This will change once path-based routing is enabled.
  • Queries Topology Service to identify which cell to proxy requests to (classify).
  • Does not buffer request bodies (memory-constrained)
  • Does not handle Git SSH traffic (separate SSH routing)
  • Does not perform authentication or authorization (it may extract routing keys from tokens, but validation happens in the backend GitLab application)
  • Does not make routing classification decisions itself (core classification logic lives in Topology Service)

How HTTP Router Communicates with Topology Service

Section titled “How HTTP Router Communicates with Topology Service”

HTTP Router authenticates to Topology Service using Cloudflare Zero Trust Service Tokens. These tokens are:

  • Injected via Worker environment variables
  • Automatically rotated using a dual-token strategy (Token A rotates 90 days before expiry, Token B rotates 180 days before expiry) to ensure zero downtime
  • Managed through config-mgmt

If there is an issue with service token authentication, you will see high error rates from the Classify service as a symptom. In this case:

  1. Check Sentry for authentication-related exceptions
  2. Check Topology Service metrics for incoming request failures
  3. Review Cloudflare Access Audit Logs for blocked requests

For more details, see security.md.

MetricValue
Traffic Volume~40k+ requests/second
DeploymentCloudflare Workers (edge)
DependenciesTopology Service, Worker Environment Variables
ResourceLink
Repositorygitlab-org/cells/http-router
Deployerhttp-router-deployer
PipelinesDeployment Pipelines
GrafanaHTTP Router Overview
Sentryhttp-router
AlertsHTTP Router Alerts

Sentry Project captures all exceptions across environments (gprd, gstg).

EnvironmentLive LogsHistorical Logs
ProductionLiveHistorical
StagingLiveHistorical

For detailed logging, see logging.md.

If Grafana shows missing metrics, check Cloudflare Dashboard directly.

See missing-metrics.md for troubleshooting.

Review successful deployment pipelines to identify if a recent change caused the issue.

Identifying the Currently Deployed Version

Section titled “Identifying the Currently Deployed Version”
flowchart TD
    A[Find latest successful pipeline on main] --> B{Was rollback job triggered?}
    B -->|No| C[Current version = latest pipeline commit]
    B -->|Yes| D[Current version = previous pipeline commit]
    D --> E[Revert the MR that caused the rollback]

Steps:

  1. Go to Deployment Pipelines
  2. Find the latest successful pipeline
  3. Check if the rollback job was triggered on that pipeline
    • If no rollback: the commit from this pipeline is what’s currently deployed
    • If rollback was triggered: the commit from the previous pipeline is what’s deployed, and you should revert the MR that caused the rollback

HTTP Router depends on Topology Service for request classification. If TS is unhealthy, you’ll see classify failures in HTTP Router.

For detailed TS troubleshooting, see the Topology Service Runbook.

SymptomLikely CauseAction
502 on large uploadsReadableStream.tee() buffer limit exceededCheck Sentry; review recent MRs for body handling changes
Intermittent 502sWorker memory limitsCheck Cloudflare logs for buffer/memory errors
High latencyCPU time exceededCheck CPU metrics in Cloudflare dashboard
Missing Grafana metricsCloudflare processing delayVerify via GraphQL API (missing-metrics.md)
High error rate on classify requestsTopology Service connectivity or auth issuesCheck Sentry for auth exceptions; check Topology Service health
  1. Go to Deployment Pipelines
  2. Find the last known good deployment
  3. Run the rollback job

Details: Rollback Documentation

Use only if HTTP Router is causing critical issues and must be bypassed entirely:

  1. Remove or disable routes in cloudflare-workers.tf
  2. Create MR, get approval, apply with atlantis apply

Details: disable-http-router.md

HTTP Router deploys separately from GitLab application:

  1. Changes merged to http-router
  2. The project on gitlab.com is mirrored to http-router-deployer on ops.gitlab.net which runs the deployment pipeline.
  3. Progression: stagingproduction

To read more in detail about deployment, see: https://gitlab.com/gitlab-org/cells/http-router/-/blob/main/docs/deployment.md?ref_type=heads