component_saturation_slo_out_of_bounds:redis_memory
Overview
Section titled “Overview”This alert is triggered when the redis_memory saturation resource for the Redis Sidekiq service goes out of its SLO bounds. This resource measures how much of a node’s total memory Redis is using. Note that the Redis Sidekiq is a Sentinel and not a Cluster like Redis SharedState; the shard label in this case identifies the Redis deployment name, not a cluster shard.
As Redis memory saturates node memory, the likelihood of OOM kills (possibly to the Redis process itself) increases. Known causes include:
- A worker flooding a queue. Most often this is a worker whose jobs are deferred or throttled (to relieve pressure elsewhere) but then accumulate in Redis until memory is exhausted. This is the most common cause and the pattern behind past redis-sidekiq memory incidents.
- Large Sidekiq queue backlogs (e.g. an unbounded
defaultqueue). - Very large job payloads delivered via Sidekiq.
- Memory fragmentation increasing RSS beyond the logical dataset size.
The threshold is kept deliberately low to leave headroom for short-term spikes from Redis snapshotting (copy-on-write), which grows with the rate of change in Redis.
Services
Section titled “Services”Refer to the service catalogue for the service owners and escalation.
Metrics
Section titled “Metrics”This alert is based on the redis_memory saturation point, which compares Redis memory usage (redis_memory_used_rss_bytes / redis_memory_used_bytes) against total node memory (node_memory_MemTotal_bytes), per node.
This resource has a soft and a hard SLO threshold. The thresholds are kept intentionally low to leave headroom: Redis RDB snapshots can put short-term memory pressure on a node via copy-on-write (which grows with the rate of change in Redis), so alerting well before the node is full gives due warning before that pressure can tip the node over.
The exact query and threshold values are defined in code and may change over time; see the Definitions section for the authoritative values.
Alert Behavior
Section titled “Alert Behavior”This alert fires when saturation stays above the hard SLO threshold for the rule’s for duration (gitlab_component_saturation:ratio{component="redis_memory"} > slo:max:hard:...). At that point the resource is close to its capacity limit and at risk of OOM kills.
The soft SLO is lower and is used for capacity-planning warnings and dashboard banding rather than this paging alert.
There are no automated silencing rules. Silence only if the reported value is confirmed inaccurate and a corrective change is due for deployment soon.
The alert expression and for duration are defined in code and may change over time; see the Definitions section for the authoritative values.
Severities
Section titled “Severities”s2, pages (alert_type: cause, pager: pagerduty). The pager: pagerduty label marks the alert as paging; since the GitLab.com EOC/IMOC cutover, gprd paging alerts route to Incident.io.
Saturation past the hard threshold risks OOM kills of the Redis process — a significant availability event for Sidekiq.
Verification
Section titled “Verification”-
Identify the saturated node/deployment (start here):
- redis_memory saturation dashboard (
uid: alerts-sat_redis_memory): find which node or Redis deployment is over the SLO bound. - redis-sidekiq: Overview
- redis_memory saturation dashboard (
-
Sidekiq workload (attribute growth to a specific deployment/queue/worker):
- Queue length: Queue length of workers is useful to for identifying a potential worker with a large set of enqueued jobs.
- sidekiq: Worker Detail: per-worker enqueue rate, queue size, and execution time; useful for spotting a worker flooding a queue and job processing rate.
- sidekiq: Shard Detail: drill down from a saturated deployment to its queues and workers. The
Job Argument Size by Workerpanel links out to a Kibana table of per-worker argument byte size for the selected shard (measured at job execution time,argsonly); useful as a reference for per-worker volume and for mappingdestination_shard_redisto worker. - sidekiq: Queue Detail: per-queue enqueue/execution rates and backlog.
- Skipped jobs: the
sidekiq_jobs_skipped_totalcounter tracks jobs that are deferred or dropped rather than executed (labels:worker,action=deferred/dropped,reason=feature_flag/database_health_check). A worker with a high deferred rate is accumulating its backlog in redis-sidekiq and is the prime suspect for this saturation alert. - Enqueue rate: the
sidekiq_enqueued_jobs_totalcounter is incremented once per job enqueued by a Sidekiq client (labels includeworker,queue,feature_category,scheduling=immediate/delayed, anddestination_shard_redis). Pair it with the skipped-jobs counter to spot a worker flooding a queue: a high enqueue rate combined with deferred execution is the classic pattern behind this alert. Thedestination_shard_redislabel maps a worker directly to the saturated deployment via the Troubleshooting deployment-to-shard table.
-
Rule out other causes:
- Compare redis_memory_used_bytes vs redis_memory_used_rss_bytes to identify fragmentation.
- host stats (per-host): drill into CPU/memory/IO for the affected node once identified.
-
Logs: system logs
Recent changes
Section titled “Recent changes”Redis Sidekiq runs on VMs managed via Chef. Recent configuration changes are found in the chef-repo and in the service’s Chef roles, e.g. gprd-base-db-redis-server-sidekiq.json.
Troubleshooting
Section titled “Troubleshooting”Under normal operation Redis Sidekiq sits well within its memory budget. Saturation almost always traces back to a single worker flooding a queue, most often because its jobs are being deferred rather than executed (the pattern behind past redis-sidekiq memory incidents). Prioritize identifying that worker.
-
Identify which node / Redis deployment is saturated:
-
Map the saturated Redis deployment to its
destination_shard_redisshard:Redis deployment Redis Destination Shard redis-sidekiq-NN-db-gprd(plain)mainredis-sidekiq-catchall-a-NN-db-gprdqueues_shard_catchall_aredis-sidekiq-catchall-b-NN-db-gprdqueues_shard_catchall_bThe Job Argument Size by Worker panel on the sidekiq: Shard Detail dashboard links to a Kibana table that shows both
destination_shard_redisand the actualsidekiq_shardper worker, a handy reference for this mapping. -
Find if a worker is driving the growth. Precise attribution is inherently limited: there is no full point-in-time snapshot of queue contents, and job count is not a reliable proxy for memory because per-job payload size varies. Treat the signals below as ways to identify a suspect, not conclusive proof:
-
Check for deferred jobs first. List currently deferred workers. A worker with
run_sidekiq_jobs_{WorkerName}disabled will accumulate backlog in the redis-sidekiq queue.Terminal window /chatops gitlab run feature list --match run_sidekiq_jobsThe sidekiq_jobs_skipped_total counter (labels
worker,action=deferred/dropped,reason=feature_flag/database_health_check) confirms which workers are actually being deferred or dropped and at what rate — a non-zerodeferredrate identifies the worker filling the queue. TheSidekiqJobsSkippedTooLongalert fires when a worker is deferred via feature flag for over 3 hours, a common sign of a forgottenrun_sidekiq_jobs_{WorkerName}flag. -
Use the sidekiq: Worker Detail dashboard to check a worker’s enqueue rate, queue size, and execution time. Drill down from sidekiq: Shard Detail (left-click a job line to open Worker Detail) to find which worker is flooding the saturated shard’s queues.
-
Inspect what is currently running with the Ruby snippets in Sidekiq inspection (
sudo gitlab-rails console+Sidekiq::Workers.new).
-
-
If no worker is implicated, check other causes:
- A sharp, short-lived spike that is not matched by dataset growth points to a Redis RDB snapshot, where a high rate-of-change drives copy-on-write memory overhead. Correlate the spike timing against snapshot activity rather than assuming dataset growth.
- Memory fragmentation: compare redis_memory_used_rss_bytes to redis_memory_used_bytes. Their ratio (redis_mem_fragmentation_ratio) baselines differently per instance, so judge it against that node’s own history rather than a fixed value. A high but flat ratio is usually benign. Fragmentation only matters when absolute RSS trends toward node memory. Dataset growth is the more common cause here.
- Stuck or oversized keys driving dataset growth.
- For sustained or recurring saturation, run offline key-space analysis of a Redis RDB dump (see Memory space analysis with cupcake-rdb).
-
Confirm whether a kernel OOM kill has already occurred on the affected node: search the system logs in Kibana for
oom-killerorOut of memory: Killed processmessages, scoped to the affected node. For events older than the index retention, checkdmesgor persisted kernel logs on the node directly. This is distinct from Redis hitting its ownmaxmemorylimit.
Possible Resolutions
Section titled “Possible Resolutions”-
If a single worker is flooding a queue, the lever that actually relieves memory pressure is to drop its jobs via ChatOps. Deferring (
run_sidekiq_jobs_{WorkerName}) keeps the backlog in Redis, so it does not help here and can thunder-herd on re-enable; dropping (drop_sidekiq_jobs_{WorkerName}) discards the jobs and frees the memory. Only drop jobs that can be safely discarded (see Disabling a worker for the full drop-vs-defer flow):Terminal window # drop: discard jobs entirely (use only when the backlog can be safely thrown away)/chatops gitlab run feature set drop_sidekiq_jobs_WorkerName true --ignore-feature-flag-consistency-check --ignore-production-check -
Scale up (resize) node memory when the working set has legitimately grown, or as a short-term mitigation to restore memory headroom while the underlying cause is investigated. Redis Sidekiq runs on VMs, so this is a node/VM resize, not a Redis Cluster scaling operation. The resize is a manual operation done under a change request, because Terraform will not apply a
machine_typechange on its own: the redis-sidekiq instances are created from the externalgeneric-stormodule (Terraform sourceops.gitlab.net/gitlab-com/generic-stor/google), whoselifecycleblock setsignore_changes = [..., machine_type], and each module instantiation in config-mgmt (environments/gprd/main.tf) passesallow_stopping_for_update = false.-
Pick the right deployment: each redis-sidekiq deployment has its own key (
redis-sidekiq,redis-sidekiq-catchall-a,redis-sidekiq-catchall-b) and they can differ in size. Resize the deployment you identified in the Troubleshooting section (the deployment-to-shard mapping table), not necessarily the plainredis-sidekiqkey. -
Before operating on any node, check its role with
ssh <vm-fqdn> -- sudo gitlab-redis-cli ROLE. Operate on replicas first; for the primary, perform a Sentinel failover to a healthy replica first, then resize the demoted (now-replica) old primary. This is the established pattern for redis-sentinel VM operations (see the Redis survival guide for SREs). Track each production node’s resize under its own change issue in gitlab-com/gl-infra/production. -
Resize the VM directly with the
gcloudCLI: stop the instance, set the machine type, then start it.Terminal window gcloud compute instances stop <VM_NAME> --zone=<ZONE> --project=<PROJECT_ID>gcloud compute instances set-machine-type <VM_NAME> --machine-type=<NEW_MACHINE_TYPE> --zone=<ZONE> --project=<PROJECT_ID>gcloud compute instances start <VM_NAME> --zone=<ZONE> --project=<PROJECT_ID> -
After the VM is back up, verify it: tail the startup script for chef errors with
gcloud compute instances tail-serial-port-output <VM_NAME> --project=<PROJECT_ID> --zone=<ZONE> | grep startup-script, confirm the node’s role withgitlab-redis-cli ROLE, and check service Apdex and error rates on the redis-sidekiq: Overview dashboard. -
Backport the change to config-mgmt to keep the declared state in sync: update the
machine_typesentry for the affected deployment in environments/gprd/variables.tf. Because Terraform ignoresmachine_type, anatlantis planafter the backport should be a no-op. See config-mgmt!14110 (bump size ofredis-sidekiq-catchall-b) as an example. -
The node’s Redis software/config is managed separately via the chef-repo gprd-base-db-redis-server-sidekiq.json role, which does not control VM size.
-
-
Address the upstream cause of growth (queue backlog, large job payloads, fragmentation).
-
For sustained growth, revisit capacity planning for the affected deployment.
Dependencies
Section titled “Dependencies”Internal and external dependencies which could potentially cause this alert:
- Sidekiq application / workers: the most common cause: a single worker flooding a queue, especially when its jobs are deferred via
run_sidekiq_jobs_*ChatOps flags and accumulate in Redis. Also large job payloads and unbounded queue backlogs. - Redis: internal memory behavior: RDB snapshot copy-on-write under a high rate-of-change, and memory fragmentation (RSS exceeding the logical dataset size).
- Underlying node/VM memory: total node memory is the saturation denominator; an undersized VM raises the ratio for the same Redis footprint.
Escalation
Section titled “Escalation”If the issue cannot be resolved quickly, escalate to the appropriate engineering or operations team for further investigation. Refer to the service catalogue for the service owners and escalation.
Definitions
Section titled “Definitions”The definition for this alert can be found at:
- saturation-monitoring/redis_memory.libsonnet
- redis-sidekiq saturation alerts (gprd)
- redis-sidekiq saturation alerts (gstg)
Related Links
Section titled “Related Links”- Redis Sidekiq service
- Sidekiq survival guide for SREs
- Disabling a worker: how to drop a misbehaving Sidekiq worker
- Sidekiq inspection: inspecting Sidekiq running state from the Rails console
- Memory space analysis with cupcake-rdb: offline key-space analysis of Redis RDB dumps