component_saturation_slo_out_of_bounds:redis_memory

Overview

This alert is triggered when the redis_memory saturation resource for the Redis Sidekiq service goes out of its SLO bounds. This resource measures how much of a node’s total memory Redis is using. Note that the Redis Sidekiq is a Sentinel and not a Cluster like Redis SharedState; the shard label in this case identifies the Redis deployment name, not a cluster shard.

As Redis memory saturates node memory, the likelihood of OOM kills (possibly to the Redis process itself) increases. Known causes include:

A worker flooding a queue. Most often this is a worker whose jobs are deferred or throttled (to relieve pressure elsewhere) but then accumulate in Redis until memory is exhausted. This is the most common cause and the pattern behind past redis-sidekiq memory incidents.
Large Sidekiq queue backlogs (e.g. an unbounded default queue).
Very large job payloads delivered via Sidekiq.
Memory fragmentation increasing RSS beyond the logical dataset size.

The threshold is kept deliberately low to leave headroom for short-term spikes from Redis snapshotting (copy-on-write), which grows with the rate of change in Redis.

Services

Refer to the service catalogue for the service owners and escalation.

Metrics

This alert is based on the redis_memory saturation point, which compares Redis memory usage (redis_memory_used_rss_bytes / redis_memory_used_bytes) against total node memory (node_memory_MemTotal_bytes), per node.

This resource has a soft and a hard SLO threshold. The thresholds are kept intentionally low to leave headroom: Redis RDB snapshots can put short-term memory pressure on a node via copy-on-write (which grows with the rate of change in Redis), so alerting well before the node is full gives due warning before that pressure can tip the node over.

The exact query and threshold values are defined in code and may change over time; see the Definitions section for the authoritative values.

Alert Behavior

This alert fires when saturation stays above the hard SLO threshold for the rule’s for duration (gitlab_component_saturation:ratio{component="redis_memory"} > slo:max:hard:...). At that point the resource is close to its capacity limit and at risk of OOM kills.

The soft SLO is lower and is used for capacity-planning warnings and dashboard banding rather than this paging alert.

There are no automated silencing rules. Silence only if the reported value is confirmed inaccurate and a corrective change is due for deployment soon.

The alert expression and for duration are defined in code and may change over time; see the Definitions section for the authoritative values.

Severities

s2, pages (alert_type: cause, pager: pagerduty). The pager: pagerduty label marks the alert as paging; since the GitLab.com EOC/IMOC cutover, gprd paging alerts route to Incident.io.

Saturation past the hard threshold risks OOM kills of the Redis process — a significant availability event for Sidekiq.

Verification

Identify the saturated node/deployment (start here):
- redis_memory saturation dashboard (uid: alerts-sat_redis_memory): find which node or Redis deployment is over the SLO bound.
- redis-sidekiq: Overview
Sidekiq workload (attribute growth to a specific deployment/queue/worker):
- Queue length: Queue length of workers is useful to for identifying a potential worker with a large set of enqueued jobs.
- sidekiq: Worker Detail: per-worker enqueue rate, queue size, and execution time; useful for spotting a worker flooding a queue and job processing rate.
- sidekiq: Shard Detail: drill down from a saturated deployment to its queues and workers. The Job Argument Size by Worker panel links out to a Kibana table of per-worker argument byte size for the selected shard (measured at job execution time, args only); useful as a reference for per-worker volume and for mapping destination_shard_redis to worker.
- sidekiq: Queue Detail: per-queue enqueue/execution rates and backlog.
- Skipped jobs: the sidekiq_jobs_skipped_total counter tracks jobs that are deferred or dropped rather than executed (labels: worker, action = deferred/dropped, reason = feature_flag/database_health_check). A worker with a high deferred rate is accumulating its backlog in redis-sidekiq and is the prime suspect for this saturation alert.
- Enqueue rate: the sidekiq_enqueued_jobs_total counter is incremented once per job enqueued by a Sidekiq client (labels include worker, queue, feature_category, scheduling = immediate/delayed, and destination_shard_redis). Pair it with the skipped-jobs counter to spot a worker flooding a queue: a high enqueue rate combined with deferred execution is the classic pattern behind this alert. The destination_shard_redis label maps a worker directly to the saturated deployment via the Troubleshooting deployment-to-shard table.
Rule out other causes:
- Compare redis_memory_used_bytes vs redis_memory_used_rss_bytes to identify fragmentation.
- host stats (per-host): drill into CPU/memory/IO for the affected node once identified.
Logs: system logs

Recent changes

Redis Sidekiq runs on VMs managed via Chef. Recent configuration changes are found in the chef-repo and in the service’s Chef roles, e.g. gprd-base-db-redis-server-sidekiq.json.

Troubleshooting

Under normal operation Redis Sidekiq sits well within its memory budget. Saturation almost always traces back to a single worker flooding a queue, most often because its jobs are being deferred rather than executed (the pattern behind past redis-sidekiq memory incidents). Prioritize identifying that worker.

Identify which node / Redis deployment is saturated:
- redis_memory saturation dashboard.
Map the saturated Redis deployment to its destination_shard_redis shard:

Redis deployment Redis Destination Shard
redis-sidekiq-NN-db-gprd (plain) main
redis-sidekiq-catchall-a-NN-db-gprd queues_shard_catchall_a
redis-sidekiq-catchall-b-NN-db-gprd queues_shard_catchall_b

The Job Argument Size by Worker panel on the sidekiq: Shard Detail dashboard links to a Kibana table that shows both destination_shard_redis and the actual sidekiq_shard per worker, a handy reference for this mapping.
Find if a worker is driving the growth. Precise attribution is inherently limited: there is no full point-in-time snapshot of queue contents, and job count is not a reliable proxy for memory because per-job payload size varies. Treat the signals below as ways to identify a suspect, not conclusive proof:
- Check for deferred jobs first. List currently deferred workers. A worker with run_sidekiq_jobs_{WorkerName} disabled will accumulate backlog in the redis-sidekiq queue.
  Terminal window
```
/chatops gitlab run feature list --match run_sidekiq_jobs
```
  The sidekiq_jobs_skipped_total counter (labels worker, action = deferred/dropped, reason = feature_flag/database_health_check) confirms which workers are actually being deferred or dropped and at what rate — a non-zero deferred rate identifies the worker filling the queue. The SidekiqJobsSkippedTooLong alert fires when a worker is deferred via feature flag for over 3 hours, a common sign of a forgotten run_sidekiq_jobs_{WorkerName} flag.
- Use the sidekiq: Worker Detail dashboard to check a worker’s enqueue rate, queue size, and execution time. Drill down from sidekiq: Shard Detail (left-click a job line to open Worker Detail) to find which worker is flooding the saturated shard’s queues.
- Inspect what is currently running with the Ruby snippets in Sidekiq inspection (sudo gitlab-rails console + Sidekiq::Workers.new).
If no worker is implicated, check other causes:
- A sharp, short-lived spike that is not matched by dataset growth points to a Redis RDB snapshot, where a high rate-of-change drives copy-on-write memory overhead. Correlate the spike timing against snapshot activity rather than assuming dataset growth.
- Memory fragmentation: compare redis_memory_used_rss_bytes to redis_memory_used_bytes. Their ratio (redis_mem_fragmentation_ratio) baselines differently per instance, so judge it against that node’s own history rather than a fixed value. A high but flat ratio is usually benign. Fragmentation only matters when absolute RSS trends toward node memory. Dataset growth is the more common cause here.
- Stuck or oversized keys driving dataset growth.
- For sustained or recurring saturation, run offline key-space analysis of a Redis RDB dump (see Memory space analysis with cupcake-rdb).
Confirm whether a kernel OOM kill has already occurred on the affected node: search the system logs in Kibana for oom-killer or Out of memory: Killed process messages, scoped to the affected node. For events older than the index retention, check dmesg or persisted kernel logs on the node directly. This is distinct from Redis hitting its own maxmemory limit.

Redis deployment	Redis Destination Shard
`redis-sidekiq-NN-db-gprd` (plain)	`main`
`redis-sidekiq-catchall-a-NN-db-gprd`	`queues_shard_catchall_a`
`redis-sidekiq-catchall-b-NN-db-gprd`	`queues_shard_catchall_b`

Possible Resolutions

If a single worker is flooding a queue, the lever that actually relieves memory pressure is to drop its jobs via ChatOps. Deferring (run_sidekiq_jobs_{WorkerName}) keeps the backlog in Redis, so it does not help here and can thunder-herd on re-enable; dropping (drop_sidekiq_jobs_{WorkerName}) discards the jobs and frees the memory. Only drop jobs that can be safely discarded (see Disabling a worker for the full drop-vs-defer flow):
Terminal window
```
# drop: discard jobs entirely (use only when the backlog can be safely thrown away)
/chatops gitlab run feature set drop_sidekiq_jobs_WorkerName true --ignore-feature-flag-consistency-check --ignore-production-check
```
Scale up (resize) node memory when the working set has legitimately grown, or as a short-term mitigation to restore memory headroom while the underlying cause is investigated. Redis Sidekiq runs on VMs, so this is a node/VM resize, not a Redis Cluster scaling operation. The resize is a manual operation done under a change request, because Terraform will not apply a machine_type change on its own: the redis-sidekiq instances are created from the external generic-stor module (Terraform source ops.gitlab.net/gitlab-com/generic-stor/google), whose lifecycle block sets ignore_changes = [..., machine_type], and each module instantiation in config-mgmt (environments/gprd/main.tf) passes allow_stopping_for_update = false.
- Pick the right deployment: each redis-sidekiq deployment has its own key (redis-sidekiq, redis-sidekiq-catchall-a, redis-sidekiq-catchall-b) and they can differ in size. Resize the deployment you identified in the Troubleshooting section (the deployment-to-shard mapping table), not necessarily the plain redis-sidekiq key.
- Before operating on any node, check its role with ssh <vm-fqdn> -- sudo gitlab-redis-cli ROLE. Operate on replicas first; for the primary, perform a Sentinel failover to a healthy replica first, then resize the demoted (now-replica) old primary. This is the established pattern for redis-sentinel VM operations (see the Redis survival guide for SREs). Track each production node’s resize under its own change issue in gitlab-com/gl-infra/production.
- Resize the VM directly with the gcloud CLI: stop the instance, set the machine type, then start it.
  Terminal window
```
gcloud compute instances stop <VM_NAME> --zone=<ZONE> --project=<PROJECT_ID>
gcloud compute instances set-machine-type <VM_NAME> --machine-type=<NEW_MACHINE_TYPE> --zone=<ZONE> --project=<PROJECT_ID>
gcloud compute instances start <VM_NAME> --zone=<ZONE> --project=<PROJECT_ID>
```
- After the VM is back up, verify it: tail the startup script for chef errors with gcloud compute instances tail-serial-port-output <VM_NAME> --project=<PROJECT_ID> --zone=<ZONE> | grep startup-script, confirm the node’s role with gitlab-redis-cli ROLE, and check service Apdex and error rates on the redis-sidekiq: Overview dashboard.
- Backport the change to config-mgmt to keep the declared state in sync: update the machine_types entry for the affected deployment in environments/gprd/variables.tf. Because Terraform ignores machine_type, an atlantis plan after the backport should be a no-op. See config-mgmt!14110 (bump size of redis-sidekiq-catchall-b) as an example.
- The node’s Redis software/config is managed separately via the chef-repo gprd-base-db-redis-server-sidekiq.json role, which does not control VM size.
Address the upstream cause of growth (queue backlog, large job payloads, fragmentation).
For sustained growth, revisit capacity planning for the affected deployment.

Dependencies

Internal and external dependencies which could potentially cause this alert:

Sidekiq application / workers: the most common cause: a single worker flooding a queue, especially when its jobs are deferred via run_sidekiq_jobs_* ChatOps flags and accumulate in Redis. Also large job payloads and unbounded queue backlogs.
Redis: internal memory behavior: RDB snapshot copy-on-write under a high rate-of-change, and memory fragmentation (RSS exceeding the logical dataset size).
Underlying node/VM memory: total node memory is the saturation denominator; an undersized VM raises the ratio for the same Redis footprint.

Escalation

If the issue cannot be resolved quickly, escalate to the appropriate engineering or operations team for further investigation. Refer to the service catalogue for the service owners and escalation.

Definitions

The definition for this alert can be found at:

Redis Sidekiq service
Sidekiq survival guide for SREs
Disabling a worker: how to drop a misbehaving Sidekiq worker
Sidekiq inspection: inspecting Sidekiq running state from the Rails console
Memory space analysis with cupcake-rdb: offline key-space analysis of Redis RDB dumps