Skip to content

Topology Service: On-Call Survival Guide

This guide helps on-call engineers respond to Topology Service incidents. It assumes you’re familiar with SLIs, error budgets, and cloud platforms (GCP), but have no prior knowledge of Topology Service.

Topology Service is the central coordination system for GitLab Cells, providing three critical services that enable routing, global uniqueness, and cell provisioning. It runs in two deployment modes: REST API (topology-rest) for HTTP Router and gRPC API (topology-grpc) for internal cell operations. It is deployed from topology-service-deployer and is deployed via Runway. Runway itself orchestrates the deployments via ‘deployment projects’ (topology-rest & topology-grpc)

ServicePurposeCode PathDeploymentImpact When Down
ClassifyServiceREST (Routes requests to correct cell)internal/services/classify/RESTCRITICAL: All routing broken, no requests reach cells1
ClaimServiceEnsures global uniqueness (usernames, emails, namespaces)internal/services/claim/gRPC onlyCannot create users/groups/projects, transactions fail
SequenceServiceAllocates non-overlapping ID ranges during cell provisioninginternal/services/cell/gRPC onlyCannot provision new cells

Footnotes: [1] ClassifyService failure is gradual - routing degrades as cache expires. Indicators: decreased (not zero) cell-local traffic, 404 spike from legacy cell fallback.

Key takeaway for on-call: ClassifyService affects routing (immediate user impact), ClaimService affects writes (no routing impact but increases database transaction failure), SequenceService affects only new cell provisioning.

Critical dependency: All services rely on Cloud Spanner. Spanner CPU/connection issues cascade to all three services.

Architecture in Brief:

Cells Architecture

Topology Service design: https://handbook.gitlab.com/handbook/engineering/architecture/design-documents/cells/topology_service/

For infrastructure-level troubleshooting, Cloud Spanner logs, or emergency rollbacks via Cloud Run UI, you’ll need Breakglass access to the GCP Console:

GCP Console (UI):

  • Production PAM
  • Click “Request Access” → Select breakglass-entitlement-gitlab-runway-topo-svc-prod → Enter incident link → Submit

gcloud CLI (Breakglass):

Terminal window
# Production
gcloud beta pam grants create \
--entitlement="breakglass-entitlement-gitlab-runway-topo-svc-prod" \
--requested-duration="1800s" \
--justification="$INCIDENT_LINK" \
--location=global \
--project="gitlab-runway-topo-svc-prod"
ComponentWhat It DoesDashboardLogs
REST API (topology-rest)HTTP endpoint for HTTP Router to classify requestsREST DashboardGrafana Logs & Cloud Run Logs
gRPC API (topology-grpc)Internal API for cells: claim resources, classify, manage ID sequencesgRPC DashboardGrafana Logs & Cloud Run Logs
Cloud SpannerStores classifications, claims, ID sequencesSpanner DashboardCloud Logging
AlertThinkCheck
ApdexSLOViolation (gRPC/REST)Requests too slow or failingSpanner → Service Service Panel → Spanner Service Logs
ErrorSLOViolation (gRPC/REST)Service returning errorsService logs → Spanner status → Recent deployments
TrafficCessation (gRPC/REST)No traffic (was flowing 1hr ago)Cloud Run instances → Deployment pipeline
Regional (suffix)Single region problemSame as above, region-specific

Mental models:

  • High latency? Example: think Spanner CPU → Spanner resource limits → Network
  • Errors? Example: think Spanner connection → Service crash → Bad deployment
  • No traffic? Example: think Instances down → Load balancer → Deployment
  • Post-deployment weirdness? Check: Recent deployments → Service logs → Spanner status

Never rollback. Always roll forward. Spanner schema migrations are one-way only.

Why We Roll Forward Only Topology Service uses a dual-codebase deployment pattern and Cloud Spanner schema migrations. Rolling back code risks:

(1) deployment/application version mismatches causing failed deploys.

(2) schema incompatibility causing startup failures or database corruption during the zero-instance deployment window.

To recover safely in the event of an incident and a code change is suspected as the root cause:

  • Create a revert, branch from known-good code, rebase onto main, add forward-compatible schema changes if needed, then deploy normally. Asking for approvals normally.

Two independent metric paths:

Query via Grafana Explore (mimir-runway datasource).

  • Access expires: Please remember that PAM grants do have a duration
  • Document everything: Add findings to incident timeline
  • Escalate early: Team prefers early escalation over solo struggle
  • Roll forward, never rollback: Always deploy fixes via new commits
  • When in doubt: Ask in #g_cells_infrastructure or #f_protocells