Topology Service: On-Call Survival Guide

This guide helps on-call engineers respond to Topology Service incidents. It assumes you’re familiar with SLIs, error budgets, and cloud platforms (GCP), but have no prior knowledge of Topology Service.

What is Topology Service?

Topology Service is the central coordination system for GitLab Cells, providing three critical services that enable routing, global uniqueness, and cell provisioning. It runs in two deployment modes: REST API (topology-rest) for HTTP Router and gRPC API (topology-grpc) for internal cell operations. It is deployed from topology-service-deployer and is deployed via Runway. Runway itself orchestrates the deployments via ‘deployment projects’ (topology-rest & topology-grpc)

Three Core Services & Failure Impacts

Service	Purpose	Code Path	Deployment	Impact When Down
ClassifyService	REST (Routes requests to correct cell)	`internal/services/classify/`	REST	CRITICAL: All routing broken, no requests reach cells¹
ClaimService	Ensures global uniqueness (usernames, emails, namespaces)	`internal/services/claim/`	gRPC only	Cannot create users/groups/projects, transactions fail
SequenceService	Allocates non-overlapping ID ranges during cell provisioning	`internal/services/cell/`	gRPC only	Cannot provision new cells

Footnotes: [1] ClassifyService failure is gradual - routing degrades as cache expires. Indicators: decreased (not zero) cell-local traffic, 404 spike from legacy cell fallback.

Key takeaway for on-call: ClassifyService affects routing (immediate user impact), ClaimService affects writes (no routing impact but increases database transaction failure), SequenceService affects only new cell provisioning.

Critical dependency: All services rely on Cloud Spanner. Spanner CPU/connection issues cascade to all three services.

Architecture in Brief:

Cells Architecture

Topology Service design: https://handbook.gitlab.com/handbook/engineering/architecture/design-documents/cells/topology_service/

Quick Reference Card

Key Resources

Dashboards: gRPC Prod | REST Prod | Spanner Prod
Deployment: topology-service-deployer
Source: topology-service
Cloud Run: Production Console
Spanner: Production Console
Runway Deployment Projects: topology-grpc , topology-rest , topology-migrate
Logs: Grafana Logs [REST], Grafana Logs [GRPC], Cloud Logging Spanner

Emergency Access

For infrastructure-level troubleshooting, Cloud Spanner logs, or emergency rollbacks via Cloud Run UI, you’ll need Breakglass access to the GCP Console:

Cloud Logging for Cloud Spanner
Cloud Run Console for container-level operations

GCP Console (UI):

Production PAM
Click “Request Access” → Select breakglass-entitlement-gitlab-runway-topo-svc-prod → Enter incident link → Submit

gcloud CLI (Breakglass):

# Production
gcloud beta pam grants create \
  --entitlement="breakglass-entitlement-gitlab-runway-topo-svc-prod" \
  --requested-duration="1800s" \
  --justification="$INCIDENT_LINK" \
  --location=global \
  --project="gitlab-runway-topo-svc-prod"

Emergency Contacts

Topology Service + Spanner: Cells
Immediate help: #g_cells_infrastructure
Business hours: #f_protocells

Component Overview & Common Failures

Component	What It Does	Dashboard	Logs
REST API (topology-rest)	HTTP endpoint for HTTP Router to classify requests	REST Dashboard	Grafana Logs & Cloud Run Logs
gRPC API (topology-grpc)	Internal API for cells: claim resources, classify, manage ID sequences	gRPC Dashboard	Grafana Logs & Cloud Run Logs
Cloud Spanner	Stores classifications, claims, ID sequences	Spanner Dashboard	Cloud Logging

Alert Types & Response

Alert	Think	Check
ApdexSLOViolation (gRPC/REST)	Requests too slow or failing	Spanner → Service Service Panel → Spanner Service Logs
ErrorSLOViolation (gRPC/REST)	Service returning errors	Service logs → Spanner status → Recent deployments
TrafficCessation (gRPC/REST)	No traffic (was flowing 1hr ago)	Cloud Run instances → Deployment pipeline
Regional (suffix)	Single region problem	Same as above, region-specific

Mental models:

High latency? Example: think Spanner CPU → Spanner resource limits → Network
Errors? Example: think Spanner connection → Service crash → Bad deployment
No traffic? Example: think Instances down → Load balancer → Deployment
Post-deployment weirdness? Check: Recent deployments → Service logs → Spanner status

Deployment Procedures

Never rollback. Always roll forward. Spanner schema migrations are one-way only.

Why We Roll Forward Only Topology Service uses a dual-codebase deployment pattern and Cloud Spanner schema migrations. Rolling back code risks:

(1) deployment/application version mismatches causing failed deploys.

(2) schema incompatibility causing startup failures or database corruption during the zero-instance deployment window.

To recover safely in the event of an incident and a code change is suspected as the root cause:

Create a revert, branch from known-good code, rebase onto main, add forward-compatible schema changes if needed, then deploy normally. Asking for approvals normally.

Transient failures: Retry pipeline stage in topology-service-deployer
Code issues: Push fix via MR to topology-service or deployer
Critical bugs: Revert commit in Git, deploy the revert

Metrics Quick Reference

Two independent metric paths:

Application: {job="topology-service", type="topology-[grpc|rest]", environment="gprd"} - Business logic, gRPC method stats
- GRPC Query in Grafana Explore & REST Query in Grafana Explore
Infrastructure: {job="runway-exporter", project_id="gitlab-runway-topo-svc-prod"} - Cloud Run CPU/memory, Spanner metrics
- Query in Grafana Explore

Query via Grafana Explore (mimir-runway datasource).

Remember

Access expires: Please remember that PAM grants do have a duration
Document everything: Add findings to incident timeline
Escalate early: Team prefers early escalation over solo struggle
Roll forward, never rollback: Always deploy fixes via new commits
When in doubt: Ask in #g_cells_infrastructure or #f_protocells