Skip to content
Runbooks
Search
Ctrl
K
Cancel
GitLab
Code Context
Select theme
Dark
Light
Auto
GitLab Runbooks
about
about.gitlab.com Service
agentic-duo-chat
index
ai-active-context
ActiveContext
ai-assisted
AI-Assisted Service
ai-gateway
alerts
AiGatewayServiceRunwayIngressTrafficCessationRegional
Code Suggestions
AI Gateway Service
AI Gateway rate limits
alerts
ApdexSLOViolation
ErrorSLOViolation
TrafficAbsent and TrafficCessation
amp
Amp Service
api
GitLab API Service
atlantis
Atlantis Service
Atlantis Setup Guide for Infrastructure Deployments
Atlantis Web UI
audit-evidence-gathering
Runbook for audit evidence gathering procedures
bastions
db-benchmarking-bastions
db-lab bastion hosts
dev.gitlab.org host
gprd-bastions
gstg-bastions
ops-bastions
pre-bastions
release-bastions
blackbox
alerts
BlackboxProbeFailures
Blackbox git exporter is down
Blackbox Exporters Service
build-tooling
Distribution Build Tooling Service
camoproxy
Camoproxy troubleshooting
Camoproxy Service
Upgrade camoproxy
cells
Cells and Amp Documentation
Cells and Auto-Deploy
Auto-upgrading Dependency Versions
Breakglass
Cells DNS
Cells
Cell Infrastucture Debugging and Development
Patch Cell's Tenant Model
Cell Provisioning and De-Provisioning
Validate Instrumentor Changes within Cells Infrastructure
certificates
AWS Managed Certificates
chef_hybrid
chef_server
chef_vault
cloudflare
forum
gcp
gkms
Gitlab Certificate Run Books
zendesk
ci_deleted_objects_processing
CI Deleted Objects Processing Triage
ci-runners
alerts
ci_pending_builds
ci_too_many_archiving_trace_failures
ci_workhorse-queuing
ci-apdex-violating-slo
CiRunnersServiceCiRunnerJobsApdexSLOViolationSingleShard
CiRunnersServicePollingErrorSLOViolation
CiRunnersServiceQueuingQueriesDurationApdexSLOViolation
ci_graphs
Network Info
CI Runners Service
linux
architecture
autoscaling
Blue Green Deployments
Linux CI/CD Runners fleet configuration changes
Hosted Runners Debugging Guide
docker-machine
Docker machine fails to create machine
Deploy docker-machine
Linux CI/CD Runners fleet graceful shutdown procedure
Linux CI/CD Runners fleet configuration management
Linux CI/CD Runners fleet deployments when Ops/Deployer is down
Provisioning a new shard
org-ci
org-ci runners
scale-existing-shards
macos
access
Debugging MacOS Runners
AWS MacOS Dedicated Host Characteristics
MacOS Images
MacOS Runners
MacOS resources in AWS
providers
gcp
Google Cloud Metrics Investigation
Quotas
release-cycle
Blue_Green_Deployment
runner-projects
CI Runner Troubleshooting Guide
windows
Connecting to a Windows machine
Windows Autoscaling Runners
clickhouse
ClickHouse Cloud Failure Remediation, Backup & Restore Process
ClickHouse Cloud Service
cloud_connector
alerts
AI Gateway JWKS fetch failed (Slack notification)
CloudflareCloudConnectorRateLimitExhaustion
Cloud Connector - Authentication
Cloud Connector - Cloudflare
Cloud Connector
JWKS keys fetch for token-based Authentication
cloud-sql
alerts
CloudSQLDatabaseDown
Cloud SQL Troubleshooting
Google Cloud SQL Service
cloudflare
Cloudflare Audit Log Rule Processing
Cloudflare Web Application Firewall Service
Cloudflare
Cloudflare Logs
Cloudflare: Managing Traffic
Cloudflare for the on-call
Service Locations
CloudFlare Troubleshooting
Accessing and Using CloudFlare
config_management
alerts
ChefClientErrorCritical
ComponentResourceRunningOut_disk_space
Chef Guidelines
VM Build Process with Terraform and Chef
Chef Server
Chef troubleshooting
Chef Vault Basics
Chef Tips and Tools
Chefspec
Config Management
console
Accessing the Rails Console as an SRE
Console Access Service
consul
Consul Service
Interacting with Consul
contributors
contributors.gitlab.com
contributors.gitlab.com Service
customersdot
customers.gitlab.com
Backups
Disk space alerts (production)
CustomersDot Service
CustomersDot main troubleshoot documentation
data-server-rebuild-ansible
Data-Server Rebuild Ansible Service
decomposition
CI Mirrored Tables
design
design.gitlab.com Runbook
design.gitlab.com Service
dev-gitlab-org
dev.gitlab.org Service
disaster-recovery
alerts
GCPScheduledSnapshots
gameday
Disaster Recovery Gameday Schedule
Google Cloud Snapshots
index
Zonal and Regional Recovery Guide
Measuring Recovery Activities
docs-website
docs.gitlab.com Service
duo
GitLab Code Suggestion Failover Solution
Duo Enterprise License Access Process for Staging Environment
GitLab Duo Triage
duo-chat
Duo Chat Runbook
duo-workflow-svc
Duo Workflow Service
editor-extensions
Editor Extensions Runbook
elastic
Advanced Search
disaster_recovery
Advanced Search Disaster recovery
Elastic Nodes Disk Space Saturation
Elastic Cloud
exercises
elastic_ebay_exercise
Elastic exercises
Kibana exercises
Elastic
Kibana
troubleshooting
elk_mapper_parsing_exception
Troubleshooting
errortracking
ErrorTracking Service
ErrorTracking main troubleshooting document
ext-pvs
External Pipeline Validation Service
external_license_db
External License DB Service
External License DB main troubleshooting documentation
external-dns
ExternalDNS Service
forum
Management for forum.gitlab.com
forum.gitlab.com Service
frontend
`asset_proxy` is `DOWN`
Blocking individual IPs and Net Blocks on HA Proxy
Blocking and Disabling Things in HAProxy
gitlab-com-is-down
HAProxy Management at GitLab
HAProxy Logging
Increased Error Rate
HAProxy (Frontend) Service
Possible Breach of SSH MaxStartups
SSL Certificate Expiring or Expired
gamedays
Game days
scenarios
Databasebase backup health check
GCP snapshot health check
Complete zonal failure recovery procedure
git
Deploying a change to gitlab.rb
Git
Git Stuck Processes
gitlab-review-app-certs
Git Access Service
Summary
Purge Git data
Workhorse Session Alerts
gitaly
alerts
GitalyFileServerDown
GitalyServiceGoserverTrafficCessationSingleNode
GitalyVersionMismatch
Find a project from its hashed storage path
Copying or moving a Git repository by hand
git-high-cpu-and-memory-usage
Debugging gitaly with gitaly-debug
Gitaly is down
Gitaly error rate is too high
Gitaly latency is too high
Upgrading the OS of Gitaly VMs
Gitaly profiling
Gitaly Queuing
Gitaly repository cgroups
Restoring gitaly data corruption on a project after an unclean shutdown
Gitaly Repository Export
Gitaly token rotation
Gitaly unusual activity alert
Gitaly version mismatch
`gitalyctl`
Gracefully restart gitaly-ruby
Gitaly Service
Moving repositories from one Gitaly node to another
Gitaly multi-project migration
Adding new storage capacity
Prometheus Storage Inconsistent
Gitaly Snapshot Verification
GitLab Storage Re-balancing
Managing GitLab Storage Shards (Gitaly)
gitlab-com-artifact-registry
Artifact Registry Service
overview
gitlab-com-pkgs
Package GCS Bucket Service
overview
gitlab-static
gitlab-static.net zone hosted on Cloudflare Service
Web IDE Assets
glgo
Identity layer service for the Google Cloud integration
glql
GitLab Query Language (GLQL) Service
GitLab Query Language (GLQL)
google-cloud-storage
CI Artifacts CDN
Google Cloud Storage Service
hosted-runners
Hosted Runner On-call Run Books
HostedRunnersServiceCiRunnerJobsApdexSLOViolationSingleShard
Troubleshooting HostedRunnersLoggingServiceUsageLogsErrorSLOViolationSingleShard
HostedRunnersServicePendingBuildsSaturationSingleShard
HostedRunnersServiceCiRunnerJobsErrorSLOViolationSingleShard
HostedRunnersServiceRunnerManagerDownSingleShard
Troubleshooting HostedRunnersLoggingServiceUsageReplicationErrorSLOViolation
http-router
Deployments
Disabling routing requests through `http-router`
Cloudflare HTTP Router Service
HTTP Router Worker Logs
Missing Metrics in HTTP Router Dashboard
incident-io
Changelog
GitLab Production Onboarding for Incident.io
Incident.io Service
On-Call
Incident Workflow
incidents
Incidents
When GitLab.com is down
internal-api
GitLab Internal API Service
ir.gitlab.com
Investors Relations (ir.gitlab.com) main troubleshoot documentation
istio
Istio Service
jaeger
Jaeger Service
kas
Kubernetes Agent Service
`kas` Basic Troubleshooting
`kas` Disable Integrations
kube
alerts
component_saturation_slo_out_of_bounds:kube_persistent_volume_claim_disk_space
KubeContainersWaitingInError
KubernetesClusterZombieProcesses
Helm Upgrade is Stuck
Kubernetes Service
Ad hoc observability tools on Kubernetes nodes
Rebuilding a GKE cluster
GKE Cluster Upgrade Procedure
Isolating a pod
Creating a new GKE cluster
k8s-oncall-setup
GitLab
How to resize Persistent Volumes in Kubernetes
How to take a snapshot of an application running in a StatefulSet
GKE/Kubernetes Administration
Kubernetes
StatefulSet Guidelines
logging
exercises
ILM exercise
logging_exercies_1
Logging Service
logging_bigquery_schemas
Cloudflare Logpush Schema
Loading StackDriver(SD) Archives from Google Cloud Storage (GCS) into BiqQuery
Scaling Elastic Cloud Clusters
troubleshooting
Troubleshooting
loki
Grafana Loki Service
mailgun
Mailgun Service
How GitLab.com uses Mailgun
Mailgun Events
mailroom
Mailroom Incoming Mail Service
memorystore
Google Cloud Memorystore Service
metrics-catalog
Metrics Catalog
Service-Level Monitoring
Traffic Cessation Alerts
mimir
Auditing Metrics
Cardinality Management
Mimir Onboarding
Grafana Mimir Service
monitoring
Advisory Database Unresponsive Hosts/Outdated Repositories
Tuning and Modifying Alerts
Alertmanager Notification Failures
alerts
AlertmanagerNotificationsFailing
Accessing a GKE Alertmanager
Alerting
Apdex alerts troubleshooting
Get a list of Prometheus jobs
Service Apdex
Service Error Rate
Service Operation Rate
An impatient SRE's guide to deleting alerts
Filesystem errors are reported in LOG files
filesystem_alerts_inodes
Grafana graph is empty
Monitoring Service
Mixins
Node memory alerts
Prometheus Checkpointing Slow
Prometheus Empty Service Discovery
prometheus-failed-checkpoints
prometheus-failed-compactions
prometheus-failed-wal-truncations
prometheus-failing-rule-evaluations
Prometheus FileSD read errors
Prometheus High Memory Utilization
Prometheus Indexing Backlog
Prometheus Invalid Configuration File
prometheus-is-down
Prometheus Not Ingesting
Prometheus Notifications Backlog
Prometheus Invalid Configuration File
Prometheus Persist Errors
Prometheus Persistence Pressure Too High
Prometheus pod crashlooping
prometheus-scrape-errors
Prometheus Rule Evaluation Slow
Prometheus Scraping Slowly
Prometheus Series Maintenance Stalled
Prometheus Dead Man's Snitch
Prometheus WAL Corruptions
Push Gateway
set_maintenance_window
Thanos
Upgrading Monitoring Components
nat
Cloud NAT Troubleshooting
NAT Service
nginx
NGINX Service
omnibus
GitLab Omnibus Package Service
onboarding
Session: Application architecture
Gitlab.com on Kubernetes
Onboarding
Diagnosis with Kibana
ops-gitlab-net
Database dump of ops.gitlab.net
Restore Gitaly data on `ops.gitlab.net`
ops.gitlab.net Service
packagecloud
PackageCloud (packages.gitlab.com) Service
Packagecloud Infrastructure and Backups
GPG Keys for Repository Metadata Signing
Re-indexing a package
packaging
GPG Keys for Package Signing
pages
Block specific pages domains through HAproxy
GitLab Pages returning 404
Pages Service
Determine The GitLab Project Associated with a Domain
Troubleshooting LetsEncrypt for Pages
patroni
alerts
PatroniGCSSnapshotDelayed
PatroniLongRunningTransactionDetected
PatroniScrapeFailures
PostgresSplitBrain
walgBaseBackupDelayed, WALGBaseBackupFailed
Steps to create (or recreate) a Standby CLuster using a Snapshot from a Production cluster as Master cluster (instead of pg_basebackup)
Check the status of transaction wraparound Runbook
Custom PostgreSQL Package Build Process for Ubuntu Xenial 16.04
database_peak_analysis
How and when to deprovision the db-benchmarking environment
Patroni GCS Snapshots
Geo Patroni Cluster Management
gitlab-com-wale-backups
gitlab-com-walg-backups
Postgres (Patroni) Service
Log analysis on PostgreSQL, Pgbouncer, Patroni and consul Runbook
Making a manual clone of the DB for the data team
Mapping Postgres Statements, Slowlogs, Activity Monitoring and Traces
OS Upgrade Reference Architecture
Patroni Cluster Management
performance-degradation-troubleshooting
pg_collect_query_data
Postgresql minor upgrade
Pg_repack using gitlab-pgrepack
`pg_xid_wraparound` Saturation Alert
`pg_txid_xmin_age` Saturation Alert
PostgreSQL HA
pgbadger Runbook
Postgresql troubleshooting
postgres_exporter
GitLab application-side reindexing
postgres-backups-verification-failures
postgres-checkup
Dealing with Data Corruption in PostgreSQL
Diagnosing long running transactions
Postgres maintenance
Postgresql
PostgreSQL Backups: WAL-G
postgresql-buffermapping-lwlock-contention
PostgreSQL
postgresql-locking
How to evaluate load from queries
PostgreSQL Trigram Indexes
Adding a PostgreSQL replica
Credential rotation
PostgreSQL subtransactions
PostgreSQL VACUUM
Primary Database Node CPU Saturation Analysis
How to provision the benchmark environment
SQL query analysis and optimization for Postgres
Rails SQL Apdex alerts
Rotating Rails' PostgreSQL password
Scale Down Patroni
Scale Up Patroni
High-level performance analysis and troubleshooting of a Postgres node
Handling Unhealthy Patroni Replica
Roles/Users grants and permission Runbook
using-wale-gpg
Postgres wait events analysis (a.k.a. Active Session History; ASH dashboard)
WAL logs analysis
Zero Downtime Postgres Database Decomposition
patroni-ci
CI Postgres (Patroni) Service
Recovering from CI Patroni cluster lagging too much or becoming completely broken
patroni-registry
Registry Postgres (Patroni) Service
patroni-sec
Sec Postgres (Patroni) Service
pd-event-logger-7760xa
events
Logs PagerDuty incident events to ElasticSearch Service
pgbouncer
alerts
component_saturation_slo_out_of_bounds:pgbouncer_single_core
PGBouncer Primary Database Pool Service
patroni-consul-postgres-pgbouncer-interactions
Add a new PgBouncer instance
pgbouncer-applications
PgBouncer connection management and troubleshooting
Removing a PgBouncer instance
Sidekiq or Web/API is using most of its PgBouncer connections
Pgbouncer Service
pgbouncer-ci
CI PGBouncer Primary Database Pool Service
pgbouncer-registry
Registry PGBouncer Primary Database Pool Service
pgbouncer-sec
Sec PGBouncer Primary Database Pool Service
pingdom
Pingdom Service
Pingdom
pipeline-validation-service
Pipeline Validation Service
plantuml
PlantUML Service
PlantUML
portal-poc
Experimental Engineering Portal Service
postgres-archive
Postgres DR Archive Service
Postgres archive replicas
postgres-dr-delayed
Postgres DR Delayed Replica Service
Postgres Replicas
product_analytics
Product Analytics ClickHouse Failure Remediation, Backup & Restore Process
Product Analytics Service
Product Analytics Kubernetes Architecture
Product Analytics SSL Troubleshooting
psql-timings
PSQL Timings Service
pubsub
Pubsub for Logging Service
PubSub Queuing Rate Increasing
pvs
Pvs Service
rate-limiting
index
redis
Blocking individual IPs using Redis and Rack Attack
Clearing sessions for anonymous users
Persistent Redis Service
Redis on Kubernetes
Memory space analysis with cupcake-rdb
Provisioning Redis Cluster
Troubleshooting
Redis Cluster
Functional Partitioning
Redis RDB Snapshots
Redis-Sidekiq catchall workloads reduction
A survival guide for SREs to working with Redis at GitLab
Scaling Redis Cluster
redis-cluster-cache
Redis Cluster Cache Service
Removing cache entries from Redis
redis-cluster-chat-cache
Redis Cluster Chat Cache Service
redis-cluster-database-lb
Redis Cluster Database Loadbalancing Service
redis-cluster-feature-flag
Redis Cluster Feature Flag Service
redis-cluster-queues-meta
Redis Cluster Queues Meta Service
redis-cluster-ratelimiting
Redis Cluster RateLimiting Service
redis-cluster-registry
Redis Cluster Registry Service
redis-cluster-repo-cache
Redis Cluster Repo Cache Service
redis-cluster-sessions
Redis Cluster Sessions Service
redis-cluster-shared-state
Redis Cluster SharedState Service
redis-feature-flag
Redis-feature-flag Service
redis-pubsub
Redis Pub/Sub Service
redis-ratelimiting
Redis-ratelimiting Service
redis-registry-cache
Redis Registry Cache Service
redis-sessions
Redis Sessions Service
redis-sidekiq
Redis Sidekiq Service
redis-tracechunks
Redis TraceChunks Service
registry
alerts
ContainerRegistryDBLoadBalancerReplicaPoolSize
ContainerRegistryNotificationsErrorCountTooHigh
ContainerRegistryNotificationsFailedStatusCode
ContainerRegistryNotificationsPendingCountTooHigh
PatroniRegistryServiceDnsLookupsApdexSLOViolation
Database Connection Pool Saturation
Container Registry Batched Background Migrations
Container Registry CDN
Container Registry Database Index Bloat
Container Registry Database Load Balancing
Container Registry database post-deployment migrations
gitlab-registry
Container Registry Service
High Number of Overdue Online GC Tasks
High Number of Pending or Failed Outgoing Webhook Notifications
release-management
High build pressure
High deploy pressure
release-tooling
Delivery Release Tooling Service
renovate
Renovate at GitLab: Current Implementation Documentation
repository-mirroring
Repository Mirroring Service
Mirror Updates Silently Failing
Pull Mirroring Timeout with Large LFS Files
runway
Restore/Backup Runway-managed Cloud SQL
Cloud SQL Restore Pipeline Troubleshooting
Runway Platform Service
Privileged Access Management
runway-db-example
Example Runway-managed Postgres Service
runway-redis-example
Example Runway-managed Redis Service
sast-service
SAST Scanner Service for SAST in the IDE
search
Global Search Service
secret-detection
Detects secret leaks in the given payloads Service
secret-revc-worker
Secret Revocation Worker Service
secret-revocation
Secret Revocation Service
security-patching
alerts
UbuntuLivepatch
linux-os
Linux OS Patching
Patching Notifications
systems
Bastions
Console
Deploy
Gitaly
GKE
HAProxy
Patroni
PGBouncer
Redis
Runner Managers
sentry
Monitoring Service
Troubleshooting
service_desk
Debugging Service Desk
sidekiq
alerts
Title: SidekiqQueueTooLarge
Disabling Sidekiq workers
Sidekiq Background Jobs Service
Pull mirror overdue queue is too large
Sidekiq queue migration
sharding
sidekiq_error_rate_high
Sidekiq Concurrency Limit
Poking around at sidekiq's running state
Sidekiq queue no longer being processed
`sidekiq_queueing` apdex violation
Sidekiq SLIs
A survival guide for SREs to working with Sidekiq at GitLab
Exporting projects silently
spamcheck
Spamcheck Service
stackdriver
Stackdriver Metrics Service
staging-ref
GET Monitoring Setup
Staging ref
storage
fs
zfs
zlonk
Zlonk Service
switchboard
Switchboard Service
teleport
Connecting To a Database via Teleport
Connecting To a Rails Console via Teleport
Teleport
Teleport Administration
Teleport Approver Workflow
Teleport Disaster Recovery
(Title: Name of alert)
(Title: Service Name)
thanos
thanos Service
token-rotation-management
Token rotation management Service
topology-grpc
Topology Service gRPC
topology-rest
Breakglass
Topology Service Rest
tracing
Distributed Tracing Service
tutorials
Example Tutorial Template
How to use flamegraphs for performance profiling
Tutorials
Life of a Git Request
Life of a Web Request
Tips for writing tutorials
uncategorized
about.gitlab.com
access-gcp-hosts
Access Requests
Alert about SSL certificate expiration
Alert Routing
alerts
component_saturation_slo_out_of_bounds:gcp_quota_limit
Alerts Should Have Runbook Annotations
Aptly
Auto DevOps
Benchmarking Database Instances
Blocking a project causing high load
Blocked user login attempts are high
Canary in GCP production and staging
Cloud SQL Data Export
Create:Code Review Group Runbook
Release Artifact Bucket
Debug failed chef provisioning
Deleting a project manually
Deleted Project Restoration
Deploy Cmd for Chatops
GitLab dev environment
disable-chef-runs-on-a-vm
Domain Registration
Error budget weekly reporting
externalvendors
Getting help with GCP support and Rackspace
Feature Flags
Getting setup with Google gcloud CLI
gcp-network-intelligence
GCP Projects
Managing Geekbot standups
gke-runner
index
INSTRUCTIONS
Chef secrets using GKMS
granting-rails-or-db-access
GitLab Job Completion
Managing Chef
Manage DNS entries
Migration Skipping
Missing Repositories
Google mtail for prometheus metrics
namespace-restore
Node CPU alerts
Node Reboots
Omnibus package troubleshooting
OPS-GITLAB-NET Users and Access Tokens
OSQuery
patching-production
Periodic Job Monitoring
Project exports
Rails is down
release.gitlab.net
`release.gitlab.net`
Remove Blobs
Removing kernels from fleet
Ruby profiling
Shared Configurations
snowplow
index
GitLab staging environment
subnet-allocations
Terraform Broken Main
Application Database Queries
Upgrades and Rollbacks of Application Code
How to upload a file to Google Cloud Storage from any system without a credentials configuration
Uploads
Workers under heavy load because of being used as a CDN
Configuring and Using the Yubikey
Rate of successful user logins is zero
vault
Access Management for Vault
Vault Administration
Hashicorp Vault for Infrastructure Service
Troubleshooting Hashicorp Vault
How to Use Vault for Secrets Management in Infrastructure
Vault Audit Log Analysis
Vault Secrets Management
version
version.gitlab.com Service
version.gitlab.com Runbook
web
Diagnostic Reports
GitLab.com Web Service
Rails middleware: path traversal
Static objects caching
Static repository objects caching
Workhorse Image Scaler
web-pages
GitLab Pages Service
websockets
Websockets Service
Custom Websocket Alerts
wikis
Wikis
wiz-runtime-sensor
Wiz Sensor Service
woodhouse
Woodhouse Service
woodhouse-slack
Woodhouse Slack bot Service
Woodhouse-Slack Overview
workhorse
Workhorse Service
Workhorse Apdex Degradation
workspaces
Remote Development Workspaces Service
zoekt
Global Code Search Service
GitLab
Code Context
Select theme
Dark
Light
Auto
Cloudflare Web Application Firewall Service
Service Overview
Alerts
:
https://alerts.gitlab.net/#/alerts?filter=%7Btype%3D%22cloudflare%22%2C%20tier%3D%22lb%22%7D
Label
: gitlab-com/gl-infra/production~“Service::Cloudflare”
Logging
Section titled “Logging”