Skip to content

Pipeline Processing Sidekiq Worker SLO Violations

Covers:

  • CiOrchestrationServicePipelineProcessingSidekiqQueueingApdexSLOViolation
  • CiOrchestrationServicePipelineProcessingSidekiqExecutionApdexSLOViolation
  • CiOrchestrationServicePipelineProcessingSidekiqExecutionErrorSLOViolation

These alerts fire when pipeline processing Sidekiq workers violate their SLO burn rate thresholds, indicating that pipeline state transitions are either slow (apdex) or failing (error rate).

Pipeline processing workers (Ci::InitialPipelineProcessWorker, PipelineProcessWorker, Ci::BuildFinishedWorker, BuildQueueWorker) handle the core pipeline state machine: transitioning jobs between states, processing build completions, and queuing the next set of jobs. Degradation here impacts how quickly a pipeline progresses from one stage to the next.

  • Pipelines appear to “hang” between stages
  • Job completion events are delayed (runner finishes but status doesn’t update in UI)
  • Cascading delays across multi-stage pipelines
  • If error rate is elevated: pipeline state transitions silently failing
  • Sidekiq concurrency limits being hit (jobs deferred)
  • Database contention on CI tables during heavy pipeline activity
  • High volume of concurrent pipeline state transitions (e.g., after a mass retry)
  • Application errors in pipeline processing logic
  • Redis latency affecting Sidekiq job dispatch

Uses gitlab_sli_sidekiq_queueing_apdex_success_total / gitlab_sli_sidekiq_queueing_apdex_total filtered to worker=~"Ci::InitialPipelineProcessWorker|PipelineProcessWorker|Ci::BuildFinishedWorker|BuildQueueWorker".

  • SLO: 99% apdex
  • MWMBR fires at: < 94% (6h window) / < 85.6% (1h window)

Uses gitlab_sli_sidekiq_execution_apdex_success_total / gitlab_sli_sidekiq_execution_apdex_total.

  • SLO: 99% apdex
  • MWMBR fires at: < 94% (6h window) / < 85.6% (1h window)

Uses gitlab_sli_sidekiq_execution_error_total / gitlab_sli_sidekiq_execution_total.

  • SLO: 99.95% success rate (errorRatio: 0.9995)
  • MWMBR fires at: > 0.3% error rate (6h window) / > 0.72% (1h window)
  • Severity: S3 (Slack-only, no paging)
  • Routes to: #s_verify_alerts
  • MWMBR requires both the short window (5m/1h) and long window (30m/6h) to breach simultaneously
  • Silencing: Safe to silence during known Sidekiq maintenance windows or planned deployments. Use Alertmanager silence with matchers type=ci-orchestration, component=~pipeline_processing_sidekiq.*
  • Expected frequency: Rare under normal conditions. Most likely to fire during database contention or when concurrency limits are hit

Default severity is S3. Consider upgrading to S2 if:

  • Pipeline processing is completely stalled
  • Ci::BuildFinishedWorker error rate > 5% (job completions not being processed)
  • Multiple worker types affected simultaneously
# Queue duration apdex (5m)
gitlab_component_apdex:ratio_5m{component="pipeline_processing_sidekiq_queueing", type="ci-orchestration", environment="gprd"}
# Execution apdex (5m)
gitlab_component_apdex:ratio_5m{component="pipeline_processing_sidekiq_execution", type="ci-orchestration", environment="gprd"}
# Execution error rate (5m)
gitlab_component_errors:ratio_5m{component="pipeline_processing_sidekiq_execution", type="ci-orchestration", environment="gprd"}
# Concurrency limit deferred jobs (potential cause)
sum by (worker) (rate(sidekiq_concurrency_limit_deferred_jobs_total{worker=~"Ci::InitialPipelineProcessWorker|PipelineProcessWorker|Ci::BuildFinishedWorker|BuildQueueWorker"}[5m]))

Check the Pipeline Observability dashboard “Pipeline Processing” section to see which specific worker is degraded.

Pipeline processing workers have concurrency limits. Check the “Pipeline processing workers concurrency” panel on the Pipeline Observability dashboard. If workers are hitting their limit, jobs are deferred instead of executed.

The “Pipeline processing avg DB duration” panel shows average DB time per worker. If DB duration is elevated (> 5s), the root cause is likely database contention:

  • Check Patroni CI dashboard for lock contention and replication lag
  • Check for long-running transactions on CI tables

A large number of pipeline retries (e.g., from a flaky test suite fix) can cause a sudden spike in processing load. Check pipelines_created_total for unusual volume.

No past incidents have been recorded yet for this alert. This section will be updated as incidents occur.

  • Sidekiq: Worker execution environment
  • PostgreSQL (CI): Pipeline and job state transitions
  • Redis: Sidekiq job queuing, dequeuing, and concurrency limit tracking
  • Alert persists > 1 hour with no improvement
  • Ci::BuildFinishedWorker failures (critical — job completions not processed)
  • Correlated with customer reports of stuck pipelines
  • #s_verify_alerts (primary)
  • #g_pipeline-execution (Pipeline Execution team)
  • #production (if S2+ severity)