Skip to content
Runbooks
Search
Ctrl
K
Cancel
GitLab
Select theme
Dark
Light
Auto
GitLab Runbooks
about
about.gitlab.com Service
agentic-duo-chat
index
ai-assisted
AI-Assisted Service
ai-gateway
AI Gateway Service
alerts
AiGatewayServiceRunwayIngressTrafficCessationRegional
Code Suggestions
AI Gateway rate limits
alerts
ApdexSLOViolation
ErrorSLOViolation
TrafficAbsent and TrafficCessation
amp
Amp Service
api
GitLab API Service
atlantis
Atlantis Service
Atlantis Web UI
audit-evidence-gathering
index
bastions
db-benchmarking-bastions
db-lab bastion hosts
dev.gitlab.org host
gprd-bastions
gstg-bastions
ops-bastions
pre-bastions
release-bastions
blackbox
Blackbox Exporters Service
alerts
BlackboxProbeFailures
Blackbox git exporter is down
build-tooling
Distribution Build Tooling Service
camoproxy
Camoproxy Service
Camoproxy troubleshooting
Upgrade camoproxy
cells
Cells
Cells and Amp Documentation
Cells and Auto-Deploy
Auto-upgrading Dependency Versions
Breakglass
Cells DNS
Cell Infrastucture Debugging and Development
Patch Cell's Tenant Model
Cell Provisioning and De-Provisioning
Validate Instrumentor Changes within Cells Infrastructure
certificates
Gitlab Certificate Run Books
AWS Managed Certificates
chef_hybrid
chef_server
chef_vault
cloudflare
forum
gcp
gkms
zendesk
ci_deleted_objects_processing
CI Deleted Objects Processing Triage
ci-runners
CI Runners Service
alerts
ci_pending_builds
ci_too_many_archiving_trace_failures
ci_workhorse-queuing
ci-apdex-violating-slo
CiRunnersServiceCiRunnerJobsApdexSLOViolationSingleShard
CiRunnersServicePollingErrorSLOViolation
CiRunnersServiceQueuingQueriesDurationApdexSLOViolation
ci_graphs
Network Info
linux
Linux CI/CD Runners fleet configuration management
architecture
autoscaling
Blue Green Deployments
Linux CI/CD Runners fleet configuration changes
Hosted Runners Debugging Guide
docker-machine
Docker machine fails to create machine
Deploy docker-machine
Linux CI/CD Runners fleet graceful shutdown procedure
Linux CI/CD Runners fleet deployments when Ops/Deployer is down
new-shards
org-ci
org-ci runners
scale-existing-shards
macos
MacOS Runners
access
Debugging MacOS Runners
MacOS resources in AWS
providers
gcp
Quotas
release-cycle
Blue_Green_Deployment
runner-projects
CI Runner Troubleshooting Guide
windows
Windows Autoscaling Runners
Connecting to a Windows machine
clickhouse
ClickHouse Cloud Service
ClickHouse Cloud Failure Remediation, Backup & Restore Process
cloud_connector
Cloud Connector
alerts
CloudflareCloudConnectorRateLimitExhaustion
cloud-sql
Google Cloud SQL Service
alerts
CloudSQLDatabaseDown
Cloud SQL Troubleshooting
cloudflare
Cloudflare Web Application Firewall Service
Cloudflare Audit Log Rule Processing
Cloudflare
Cloudflare Logs
Cloudflare: Managing Traffic
Cloudflare for the on-call
Service Locations
CloudFlare Troubleshooting
Accessing and Using CloudFlare
config_management
Config Management
alerts
ChefClientErrorCritical
ComponentResourceRunningOut_disk_space
Chef Guidelines
VM Build Process with Terraform and Chef
Chef Server
Chef troubleshooting
Chef Vault Basics
Chef Tips and Tools
Chefspec
console
Console Access Service
Accessing the Rails Console as an SRE
consul
Consul Service
Interacting with Consul
contributors
contributors.gitlab.com Service
contributors.gitlab.com
customersdot
CustomersDot Service
customers.gitlab.com
Backups
Disk space alerts (production)
CustomersDot main troubleshoot documentation
data-server-rebuild-ansible
Data-Server Rebuild Ansible Service
decomposition
CI Mirrored Tables
design
design.gitlab.com Service
design.gitlab.com Runbook
dev-gitlab-org
dev.gitlab.org Service
disaster-recovery
index
alerts
GCPScheduledSnapshots
gameday
Disaster Recovery Gameday Schedule
Google Cloud Snapshots
Zonal and Regional Recovery Guide
Measuring Recovery Activities
docs-website
docs.gitlab.com Service
duo
GitLab Code Suggestion Failover Solution
Duo Enterprise License Access Process for Staging Environment
GitLab Duo Triage
duo-chat
Duo Chat Runbook
duo-workflow-svc
Duo Workflow Service
editor-extensions
Editor Extensions Runbook
elastic
Quick start
Quick start
disaster_recovery
Advanced Search Disaster recovery
Elastic Nodes Disk Space Saturation
Quick start
exercises
elastic_ebay_exercise
Find an optimal size for a cluster that will be able to consume logs from one of the production Pub/Subs
Beginner
Quick start
troubleshooting
Troubleshooting
elk_mapper_parsing_exception
errortracking
ErrorTracking Service
ErrorTracking main troubleshooting document
ext-pvs
External Pipeline Validation Service
external_license_db
External License DB Service
External License DB main troubleshooting documentation
external-dns
ExternalDNS Service
forum
forum.gitlab.com Service
Management for forum.gitlab.com
frontend
HAProxy (Frontend) Service
`asset_proxy` is `DOWN`
Blocking individual IPs and Net Blocks on HA Proxy
Blocking and Disabling Things in HAProxy
gitlab-com-is-down
HAProxy Management at GitLab
HAProxy Logging
Increased Error Rate
Possible Breach of SSH MaxStartups
SSL Certificate Expiring or Expired
gamedays
Game days
scenarios
Databasebase backup health check
GCP snapshot health check
Complete zonal failure recovery procedure
git
Git Access Service
Deploying a change to gitlab.rb
Git
Git Stuck Processes
gitlab-review-app-certs
Summary
Purge Git data
Workhorse Session Alerts
gitaly
Gitaly Service
alerts
GitalyFileServerDown
GitalyServiceGoserverTrafficCessationSingleNode
GitalyVersionMismatch
Find a project from its hashed storage path
Copying or moving a Git repository by hand
git-high-cpu-and-memory-usage
Debugging gitaly with gitaly-debug
Gitaly is down
Gitaly error rate is too high
Gitaly latency is too high
Upgrading the OS of Gitaly VMs
Gitaly profiling
Gitaly Queuing
Gitaly repository cgroups
Restoring gitaly data corruption on a project after an unclean shutdown
Gitaly Repository Export
Gitaly token rotation
Gitaly unusual activity alert
Gitaly version mismatch
`gitalyctl`
Gracefully restart gitaly-ruby
Moving repositories from one Gitaly node to another
Gitaly multi-project migration
Adding new storage capacity
Prometheus Storage Inconsistent
Gitaly Snapshot Verification
GitLab Storage Re-balancing
Managing GitLab Storage Shards (Gitaly)
gitlab-com-artifact-registry
Artifact Registry Service
overview
gitlab-com-pkgs
Package GCS Bucket Service
overview
gitlab-static
gitlab-static.net zone hosted on Cloudflare Service
Web IDE Assets
glgo
Identity layer service for the Google Cloud integration
glql
GitLab Query Language (GLQL) Service
GitLab Query Language (GLQL)
google-cloud-storage
Google Cloud Storage Service
CI Artifacts CDN
hosted-runners
Hosted Runner On-call Run Books
HostedRunnersServiceCiRunnerJobsApdexSLOViolationSingleShard
Troubleshooting HostedRunnersLoggingServiceUsageLogsErrorSLOViolationSingleShard
HostedRunnersServicePendingBuildsSaturationSingleShard
HostedRunnersServiceCiRunnerJobsErrorSLOViolationSingleShard
HostedRunnersServiceRunnerManagerDownSingleShard
Troubleshooting HostedRunnersLoggingServiceUsageReplicationErrorSLOViolation
http-router
Cloudflare HTTP Router Service
Deployments
Disabling routing requests through `http-router`
HTTP Router Worker Logs
Missing Metrics in HTTP Router Dashboard
incident-io
Incident.io Service
Changelog
GitLab Production Onboarding for Incident.io
Overview
Incident Workflow
incidents
Incidents
When GitLab.com is down
internal-api
GitLab Internal API Service
ir.gitlab.com
Investors Relations (ir.gitlab.com) main troubleshoot documentation
istio
Istio Service
jaeger
Jaeger Service
kas
Kubernetes Agent Service
`kas` Basic Troubleshooting
`kas` Disable Integrations
kube
Kubernetes Service
alerts
component_saturation_slo_out_of_bounds:kube_persistent_volume_claim_disk_space
KubeContainersWaitingInError
KubernetesClusterZombieProcesses
Helm Upgrade is Stuck
Ad hoc observability tools on Kubernetes nodes
Rebuilding a kubernetes cluster
GKE Cluster Upgrade Procedure
Isolating a pod
GitLab.com on Kubernetes
Summary
GitLab
How to resize Persistent Volumes in Kubernetes
How to take a snapshot of an application running in a StatefulSet
GKE/Kubernetes Administration
Kubernetes
StatefulSet Guidelines
logging
Logging Service
exercises
Rollover an index
logging_exercies_1
logging_bigquery_schemas
Cloudflare Logpush Schema
Loading StackDriver(SD) Archives from Google Cloud Storage (GCS) into BiqQuery
Scaling Elastic Cloud Clusters
troubleshooting
Troubleshooting
loki
Grafana Loki Service
mailgun
Mailgun Service
How GitLab.com uses Mailgun
Mailgun Events
mailroom
Mailroom Incoming Mail Service
memorystore
Google Cloud Memorystore Service
metrics-catalog
Metrics Catalog
Service-Level Monitoring
Traffic Cessation Alerts
mimir
Grafana Mimir Service
Auditing Metrics
Cardinality Management
Mimir Onboarding
monitoring
Monitoring Service
Advisory Database Unresponsive Hosts/Outdated Repositories
Tuning and Modifying Alerts
Alertmanager Notification Failures
alerts
AlertmanagerNotificationsFailing
Accessing a GKE Alertmanager
Alerting
Apdex alerts troubleshooting
Get a list of Prometheus jobs
Service Apdex
Service Error Rate
Service Operation Rate
An impatient SRE's guide to deleting alerts
Filesystem errors are reported in LOG files
filesystem_alerts_inodes
Grafana graph is empty
Mixins
Node memory alerts
Prometheus Checkpointing Slow
Prometheus Empty Service Discovery
prometheus-failed-checkpoints
prometheus-failed-compactions
prometheus-failed-wal-truncations
prometheus-failing-rule-evaluations
Prometheus FileSD read errors
Prometheus High Memory Utilization
Prometheus Indexing Backlog
Prometheus Invalid Configuration File
prometheus-is-down
Prometheus Not Ingesting
Prometheus Notifications Backlog
Prometheus Invalid Configuration File
Prometheus Persist Errors
Prometheus Persistence Pressure Too High
Prometheus pod crashlooping
prometheus-scrape-errors
Prometheus Rule Evaluation Slow
Prometheus Scraping Slowly
Prometheus Series Maintenance Stalled
Prometheus Dead Man's Snitch
Prometheus WAL Corruptions
Push Gateway
set_maintenance_window
Thanos
Upgrading Monitoring Components
nat
NAT Service
Cloud NAT Troubleshooting
nginx
NGINX Service
omnibus
GitLab Omnibus Package Service
onboarding
Onboarding
Session: Application architecture
Gitlab.com on Kubernetes
Diagnosis with Kibana
ops-gitlab-net
ops.gitlab.net Service
Database dump of ops.gitlab.net
Restore Gitaly data on `ops.gitlab.net`
packagecloud
PackageCloud (packages.gitlab.com) Service
Packagecloud Infrastructure and Backups
GPG Keys for Repository Metadata Signing
Re-indexing a package
packaging
GPG Keys for Package Signing
pages
Pages Service
Block specific pages domains through HAproxy
GitLab Pages returning 404
Determine The GitLab Project Associated with a Domain
Troubleshooting LetsEncrypt for Pages
patroni
Postgres (Patroni) Service
alerts
PatroniGCSSnapshotDelayed
PatroniLongRunningTransactionDetected
PatroniScrapeFailures
PostgresSplitBrain
walgBaseBackupDelayed, WALGBaseBackupFailed
Steps to create (or recreate) a Standby CLuster using a Snapshot from a Production cluster as Master cluster (instead of pg_basebackup)
Check the status of transaction wraparound Runbook
Custom PostgreSQL Package Build Process for Ubuntu Xenial 16.04
database_peak_analysis
How and when to deprovision the db-benchmarking environment
Patroni GCS Snapshots
Geo Patroni Cluster Management
gitlab-com-wale-backups
gitlab-com-walg-backups
Log analysis on PostgreSQL, Pgbouncer, Patroni and consul Runbook
Making a manual clone of the DB for the data team
Mapping Postgres Statements, Slowlogs, Activity Monitoring and Traces
OS Upgrade Reference Architecture
Patroni Cluster Management
performance-degradation-troubleshooting
pg_collect_query_data
Postgresql minor upgrade
Pg_repack using gitlab-pgrepack
`pg_xid_wraparound` Saturation Alert
`pg_txid_xmin_age` Saturation Alert
PostgreSQL HA
pgbadger Runbook
Postgresql troubleshooting
postgres_exporter
GitLab application-side reindexing
postgres-backups-verification-failures
postgres-checkup
Dealing with Data Corruption in PostgreSQL
Diagnosing long running transactions
Postgres maintenance
Postgresql
PostgreSQL Backups: WAL-G
postgresql-buffermapping-lwlock-contention
PostgreSQL
postgresql-locking
How to evaluate load from queries
PostgreSQL Trigram Indexes
Adding a PostgreSQL replica
Credential rotation
PostgreSQL subtransactions
PostgreSQL VACUUM
Primary Database Node CPU Saturation Analysis
How to provision the benchmark environment
SQL query analysis and optimization for Postgres
Rails SQL Apdex alerts
Rotating Rails' PostgreSQL password
Scale Down Patroni
Scale Up Patroni
High-level performance analysis and troubleshooting of a Postgres node
Handling Unhealthy Patroni Replica
Roles/Users grants and permission Runbook
using-wale-gpg
Postgres wait events analysis (a.k.a. Active Session History; ASH dashboard)
WAL logs analysis
Zero Downtime Postgres Database Decomposition
patroni-ci
CI Postgres (Patroni) Service
Recovering from CI Patroni cluster lagging too much or becoming completely broken
patroni-registry
Registry Postgres (Patroni) Service
patroni-sec
Sec Postgres (Patroni) Service
pd-event-logger-7760xa
Logs PagerDuty incident events to ElasticSearch Service
events
pgbouncer
PGBouncer Primary Database Pool Service
alerts
component_saturation_slo_out_of_bounds:pgbouncer_single_core
patroni-consul-postgres-pgbouncer-interactions
Add a new PgBouncer instance
pgbouncer-applications
PgBouncer connection management and troubleshooting
Removing a PgBouncer instance
Sidekiq or Web/API is using most of its PgBouncer connections
Pgbouncer Service
pgbouncer-ci
CI PGBouncer Primary Database Pool Service
pgbouncer-registry
Registry PGBouncer Primary Database Pool Service
pgbouncer-sec
Sec PGBouncer Primary Database Pool Service
pingdom
Pingdom Service
Pingdom
pipeline-validation-service
Pipeline Validation Service
plantuml
PlantUML Service
PlantUML
portal-poc
Experimental Engineering Portal Service
postgres-archive
Postgres DR Archive Service
Postgres archive replicas
postgres-dr-delayed
Postgres DR Delayed Replica Service
Postgres Replicas
product_analytics
Product Analytics Service
Product Analytics ClickHouse Failure Remediation, Backup & Restore Process
Product Analytics Kubernetes Architecture
Product Analytics SSL Troubleshooting
psql-timings
PSQL Timings Service
pubsub
Pubsub for Logging Service
PubSub Queuing Rate Increasing
pvs
Pvs Service
rate-limiting
index
redis
Persistent Redis Service
Blocking individual IPs using Redis and Rack Attack
Clearing sessions for anonymous users
Redis on Kubernetes
Memory space analysis with cupcake-rdb
Provisioning Redis Cluster
Troubleshooting
Redis Cluster
Why partition?
Redis RDB Snapshots
Redis-Sidekiq catchall workloads reduction
A survival guide for SREs to working with Redis at GitLab
Scaling Redis Cluster
redis-cluster-cache
Redis Cluster Cache Service
Removing cache entries from Redis
redis-cluster-chat-cache
Redis Cluster Chat Cache Service
redis-cluster-database-lb
Redis Cluster Database Loadbalancing Service
redis-cluster-feature-flag
Redis Cluster Feature Flag Service
redis-cluster-queues-meta
Redis Cluster Queues Meta Service
redis-cluster-ratelimiting
Redis Cluster RateLimiting Service
redis-cluster-registry
Redis Cluster Registry Service
redis-cluster-repo-cache
Redis Cluster Repo Cache Service
redis-cluster-sessions
Redis Cluster Sessions Service
redis-cluster-shared-state
Redis Cluster SharedState Service
redis-feature-flag
Redis-feature-flag Service
redis-pubsub
Redis Pub/Sub Service
redis-ratelimiting
Redis-ratelimiting Service
redis-registry-cache
Redis Registry Cache Service
redis-sessions
Redis Sessions Service
redis-sidekiq
Redis Sidekiq Service
redis-tracechunks
Redis TraceChunks Service
registry
Container Registry Service
alerts
ContainerRegistryDBLoadBalancerReplicaPoolSize
ContainerRegistryNotificationsErrorCountTooHigh
ContainerRegistryNotificationsFailedStatusCode
ContainerRegistryNotificationsPendingCountTooHigh
PatroniRegistryServiceDnsLookupsApdexSLOViolation
Database Connection Pool Saturation
Container Registry Batched Background Migrations
Container Registry CDN
Container Registry Database Index Bloat
Container Registry Database Load Balancing
Container Registry database post-deployment migrations
gitlab-registry
High Number of Overdue Online GC Tasks
High Number of Pending or Failed Outgoing Webhook Notifications
release-management
High build pressure
High deploy pressure
release-tooling
Delivery Release Tooling Service
renovate
Renovate at GitLab: Current Implementation Documentation
repository-mirroring
Repository Mirroring Service
Mirror Updates Silently Failing
Pull Mirroring Timeout with Large LFS Files
runway
Runway Platform Service
Restore/Backup Runway-managed Cloud SQL
Cloud SQL Restore Pipeline Troubleshooting
Privileged Access Management
runway-db-example
Example Runway-managed Postgres Service
runway-redis-example
Example Runway-managed Redis Service
sast-service
SAST Scanner Service for SAST in the IDE
search
Global Search Service
secret-detection
Detects secret leaks in the given payloads Service
security-patching
alerts
UbuntuLivepatch
linux-os
Linux OS Patching
Patching Notifications
systems
Bastions
Console
Deploy
Gitaly
GKE
HAProxy
Patroni
PGBouncer
Redis
Runner Managers
sentry
Monitoring Service
Managing Sentry in Kubernetes
service_desk
Debugging Service Desk
sidekiq
Sidekiq Background Jobs Service
alerts
Title: SidekiqQueueTooLarge
Disabling Sidekiq workers
Pull mirror overdue queue is too large
Sidekiq queue migration
sharding
sidekiq_error_rate_high
Sidekiq Concurrency Limit
Poking around at sidekiq's running state
Sidekiq queue no longer being processed
`sidekiq_queueing` apdex violation
Sidekiq SLIs
A survival guide for SREs to working with Sidekiq at GitLab
Exporting projects silently
spamcheck
Spamcheck Service
stackdriver
Stackdriver Metrics Service
staging-ref
Staging ref
GET Monitoring Setup
storage
fs
zfs
zlonk
Zlonk Service
switchboard
Switchboard Service
teleport
Teleport
Connecting To a Database via Teleport
Connecting To a Rails Console via Teleport
Teleport Administration
Teleport Approver Workflow
Teleport Disaster Recovery
(Title: Name of alert)
(Title: Service Name)
thanos
thanos Service
token-rotation-management
Token rotation management Service
topology-grpc
Topology Service gRPC
topology-rest
Topology Service Rest
Breakglass
tracing
Distributed Tracing Service
tutorials
Tutorials
Example Tutorial Template
How to use flamegraphs for performance profiling
Life of a Git Request
Life of a Web Request
Tips for writing tutorials
uncategorized
about.gitlab.com
access-gcp-hosts
Access Requests
Alert about SSL certificate expiration
Alert Routing
alerts
component_saturation_slo_out_of_bounds:gcp_quota_limit
Alerts Should Have Runbook Annotations
Aptly
Auto DevOps
Benchmarking Database Instances
Blocking a project causing high load
Blocked user login attempts are high
Canary in GCP production and staging
cloudsql
cloudsql
Summary
Release Artifact Bucket
Debug failed chef provisioning
Deleting a project manually
Deleted Project Restoration
Deploy Cmd for Chatops
GitLab dev environment
disable-chef-runs-on-a-vm
Domain Registration
Error budget weekly reporting
externalvendors
Getting help with GCP support and Rackspace
Feature Flags
Getting setup with Google gcloud CLI
gcp-network-intelligence
GCP Projects
Managing Geekbot standups
gke-runner
index
INSTRUCTIONS
Chef secrets using GKMS
granting-rails-or-db-access
GitLab Job Completion
Managing Chef
Manage DNS entries
Migration Skipping
Missing Repositories
Google mtail for prometheus metrics
namespace-restore
Node CPU alerts
Node Reboots
Omnibus package troubleshooting
OPS-GITLAB-NET Users and Access Tokens
Summary
patching-production
Periodic Job Monitoring
Project exports
Rails is down
release.gitlab.net
`release.gitlab.net`
Remove Blobs
Removing kernels from fleet
Ruby profiling
Shared Configurations
snowplow
index
GitLab staging environment
subnet-allocations
Terraform Broken Main
Application Database Queries
Upgrades and Rollbacks of Application Code
How to upload a file to Google Cloud Storage from any system without a credentials configuration
Uploads
Workers under heavy load because of being used as a CDN
Configuring and Using the Yubikey
Rate of successful user logins is zero
vault
Vault Secrets Management
Access Management for Vault
Vault Administration
Troubleshooting Hashicorp Vault
How to Use Vault for Secrets Management in Infrastructure
Vault Audit Log Analysis
Vault Secrets Management
version
version.gitlab.com Service
version.gitlab.com Runbook
web
GitLab.com Web Service
Diagnostic Reports
Rails middleware: path traversal
Static objects caching
Static repository objects caching
Workhorse Image Scaler
web-pages
GitLab Pages Service
websockets
Websockets Service
Custom Websocket Alerts
wiz-runtime-sensor
Wiz Sensor Service
woodhouse
Woodhouse Service
woodhouse-slack
Woodhouse Slack bot Service
Woodhouse-Slack Overview
workhorse
Workhorse Service
Workhorse Apdex Degradation
workspaces
Remote Development Workspaces Runbook
zoekt
Global Code Search Service
GitLab
Select theme
Dark
Light
Auto
Kubernetes Service
Service Overview
Alerts
:
https://alerts.gitlab.net/#/alerts?filter=%7Btype%3D%22kube%22%2C%20tier%3D%22inf%22%7D
Label
: gitlab-com/gl-infra/production~“Service::Kube”
Logging
Section titled “Logging”
Stackdriver