Pipeline Processing Sidekiq Worker SLO Violations
Covers:
CiOrchestrationServicePipelineProcessingSidekiqQueueingApdexSLOViolationCiOrchestrationServicePipelineProcessingSidekiqExecutionApdexSLOViolationCiOrchestrationServicePipelineProcessingSidekiqExecutionErrorSLOViolation
Overview
Section titled “Overview”These alerts fire when pipeline processing Sidekiq workers violate their SLO burn rate thresholds, indicating that pipeline state transitions are either slow (apdex) or failing (error rate).
Pipeline processing workers (Ci::InitialPipelineProcessWorker, PipelineProcessWorker, Ci::BuildFinishedWorker, BuildQueueWorker) handle the core pipeline state machine: transitioning jobs between states, processing build completions, and queuing the next set of jobs. Degradation here impacts how quickly a pipeline progresses from one stage to the next.
Impact
Section titled “Impact”- Pipelines appear to “hang” between stages
- Job completion events are delayed (runner finishes but status doesn’t update in UI)
- Cascading delays across multi-stage pipelines
- If error rate is elevated: pipeline state transitions silently failing
Contributing Factors
Section titled “Contributing Factors”- Sidekiq concurrency limits being hit (jobs deferred)
- Database contention on CI tables during heavy pipeline activity
- High volume of concurrent pipeline state transitions (e.g., after a mass retry)
- Application errors in pipeline processing logic
- Redis latency affecting Sidekiq job dispatch
Services
Section titled “Services”- ci-orchestration service overview
- Pipeline Observability dashboard
- Team: Verify
- Slack:
#s_verify_alerts
Metrics
Section titled “Metrics”Queue Duration Apdex
Section titled “Queue Duration Apdex”Uses gitlab_sli_sidekiq_queueing_apdex_success_total / gitlab_sli_sidekiq_queueing_apdex_total filtered to worker=~"Ci::InitialPipelineProcessWorker|PipelineProcessWorker|Ci::BuildFinishedWorker|BuildQueueWorker".
- SLO: 99% apdex
- MWMBR fires at: < 94% (6h window) / < 85.6% (1h window)
Execution Apdex
Section titled “Execution Apdex”Uses gitlab_sli_sidekiq_execution_apdex_success_total / gitlab_sli_sidekiq_execution_apdex_total.
- SLO: 99% apdex
- MWMBR fires at: < 94% (6h window) / < 85.6% (1h window)
Execution Error Rate
Section titled “Execution Error Rate”Uses gitlab_sli_sidekiq_execution_error_total / gitlab_sli_sidekiq_execution_total.
- SLO: 99.95% success rate (errorRatio: 0.9995)
- MWMBR fires at: > 0.3% error rate (6h window) / > 0.72% (1h window)
Alert Behavior
Section titled “Alert Behavior”- Severity: S3 (Slack-only, no paging)
- Routes to:
#s_verify_alerts - MWMBR requires both the short window (5m/1h) and long window (30m/6h) to breach simultaneously
- Silencing: Safe to silence during known Sidekiq maintenance windows or planned deployments. Use Alertmanager silence with matchers
type=ci-orchestration, component=~pipeline_processing_sidekiq.* - Expected frequency: Rare under normal conditions. Most likely to fire during database contention or when concurrency limits are hit
Severities
Section titled “Severities”Default severity is S3. Consider upgrading to S2 if:
- Pipeline processing is completely stalled
Ci::BuildFinishedWorkererror rate > 5% (job completions not being processed)- Multiple worker types affected simultaneously
Verification
Section titled “Verification”# Queue duration apdex (5m)gitlab_component_apdex:ratio_5m{component="pipeline_processing_sidekiq_queueing", type="ci-orchestration", environment="gprd"}
# Execution apdex (5m)gitlab_component_apdex:ratio_5m{component="pipeline_processing_sidekiq_execution", type="ci-orchestration", environment="gprd"}
# Execution error rate (5m)gitlab_component_errors:ratio_5m{component="pipeline_processing_sidekiq_execution", type="ci-orchestration", environment="gprd"}
# Concurrency limit deferred jobs (potential cause)sum by (worker) (rate(sidekiq_concurrency_limit_deferred_jobs_total{worker=~"Ci::InitialPipelineProcessWorker|PipelineProcessWorker|Ci::BuildFinishedWorker|BuildQueueWorker"}[5m]))- ci-orchestration service overview dashboard — burn rate panels
- Pipeline Observability dashboard — Pipeline Processing section
- Kibana — Sidekiq pipeline processing worker logs
Recent Changes
Section titled “Recent Changes”- Production change requests
- Rollback: If a recent deploy is suspected, follow the upgrade and rollback runbook
Troubleshooting
Section titled “Troubleshooting”1. Identify the Affected Worker
Section titled “1. Identify the Affected Worker”Check the Pipeline Observability dashboard “Pipeline Processing” section to see which specific worker is degraded.
2. Check Concurrency Pressure
Section titled “2. Check Concurrency Pressure”Pipeline processing workers have concurrency limits. Check the “Pipeline processing workers concurrency” panel on the Pipeline Observability dashboard. If workers are hitting their limit, jobs are deferred instead of executed.
3. Check Database Duration
Section titled “3. Check Database Duration”The “Pipeline processing avg DB duration” panel shows average DB time per worker. If DB duration is elevated (> 5s), the root cause is likely database contention:
- Check Patroni CI dashboard for lock contention and replication lag
- Check for long-running transactions on CI tables
4. Check for Mass Retries
Section titled “4. Check for Mass Retries”A large number of pipeline retries (e.g., from a flaky test suite fix) can cause a sudden spike in processing load. Check pipelines_created_total for unusual volume.
5. Check Recent Deployments
Section titled “5. Check Recent Deployments”Possible Resolutions
Section titled “Possible Resolutions”No past incidents have been recorded yet for this alert. This section will be updated as incidents occur.
Dependencies
Section titled “Dependencies”- Sidekiq: Worker execution environment
- PostgreSQL (CI): Pipeline and job state transitions
- Redis: Sidekiq job queuing, dequeuing, and concurrency limit tracking
Escalation
Section titled “Escalation”When to Escalate
Section titled “When to Escalate”- Alert persists > 1 hour with no improvement
Ci::BuildFinishedWorkerfailures (critical — job completions not processed)- Correlated with customer reports of stuck pipelines
Support Channels
Section titled “Support Channels”#s_verify_alerts(primary)#g_pipeline-execution(Pipeline Execution team)#production(if S2+ severity)