Skip to content

CI Orchestration Service

ci-orchestration is a virtual service that monitors CI/CD pipeline orchestration metrics emitted by Rails (Sidekiq workers and API endpoints). It has no dedicated infrastructure — it aggregates signals from existing services (sidekiq, api, ci-jobs-api, web) to provide a unified view of pipeline health.

The SLIs are organized around three UX-oriented service boundaries (see the original proposal for details):

BoundaryUser-facing questionSLIs
ci-job start”How quickly does my job start after I push?”pipeline_creation_sidekiq_*, pipeline_processing_sidekiq_*
ci-job execution”How long does my job wait for a runner?”shared_runner_job_queue_duration, non_shared_runner_job_queue_duration
ci-pipeline execution”Are pipelines failing for infra reasons?”job_infra_failure_ratio
DashboardUIDPurpose
ci-orchestration service overviewci-orchestration-mainAuto-generated SLI burn rates and error budgets
Pipeline Observabilityci-orchestration-pipeline-observabilityOperational view — segmented system-vs-customer failures
CI Pipeline Reliability SLIsmgzzp76Leadership view — total customer impact

Since ci-orchestration is a virtual service, troubleshooting typically involves investigating the underlying services:

  1. Check the service overview dashboard for which SLI is degraded
  2. Identify the emitting service — worker SLIs come from Sidekiq, job queue duration from API, failure reasons from all four service types
  3. Follow the relevant service’s runbook for the underlying issue (e.g., Sidekiq queue depth, API latency)

When a customer or SRE reports a symptom, this table maps it to the SLIs and dashboard sections to check first.

Customer reportWhat to check firstLikely SLIs
”Pipelines stuck in created state”Pipeline Observability dashboard — Pipeline Processing section. The state-machine workers (Ci::InitialPipelineProcessWorker, PipelineProcessWorker, Ci::BuildFinishedWorker, BuildQueueWorker) advance pipelines from created onward — if degraded, jobs sit in created. Also check the pipelines_created traffic-cessation alert (zero pipelines created = upstream creation broken).pipelines_created, pipeline_processing_sidekiq_queueing, pipeline_processing_sidekiq_execution
”Pipelines slow to start after I push”Pipeline Observability dashboard — Pipeline Creation section. These are the workers that build the pipeline from .gitlab-ci.yml.pipeline_creation_sidekiq_queue_duration, pipeline_creation_sidekiq_execution
”Jobs not picking up / waiting for a runner”Pipeline Observability dashboard — Job Queueing section.shared_runner_job_queue_duration, non_shared_runner_job_queue_duration
”Pipelines failing for infra reasons”Pipeline Observability dashboard — Job Execution section. Shows per-reason breakdown of system-caused failures.job_infra_failure_ratio

ci-orchestration SLO violation alerts route to #s_verify_alerts at S3 severity (Slack-only, no paging). The pipelines_created traffic cessation/absent alerts page via PagerDuty at S2 — a drop to zero pipeline creations is a strong outage signal.

For triage of worker-health alerts, start with the Pipeline Observability dashboard (segmented system-vs-customer view) rather than the service overview — it surfaces the underlying degradation patterns directly.

AlertSLITypeRunbook
CiOrchestrationServicePipelineCreationSidekiqQueueDurationApdexSLOViolationpipeline_creation_sidekiq_queue_durationApdexRunbook
CiOrchestrationServicePipelineCreationSidekiqExecutionApdexSLOViolationpipeline_creation_sidekiq_executionApdexRunbook
CiOrchestrationServicePipelineCreationSidekiqExecutionErrorSLOViolationpipeline_creation_sidekiq_executionErrorRunbook
CiOrchestrationServicePipelineProcessingSidekiqQueueingApdexSLOViolationpipeline_processing_sidekiq_queueingApdexRunbook
CiOrchestrationServicePipelineProcessingSidekiqExecutionApdexSLOViolationpipeline_processing_sidekiq_executionApdexRunbook
CiOrchestrationServicePipelineProcessingSidekiqExecutionErrorSLOViolationpipeline_processing_sidekiq_executionErrorRunbook
CiOrchestrationServiceJobInfraFailureRatioErrorSLOViolationjob_infra_failure_ratioErrorRunbook
CiOrchestrationServiceSharedRunnerJobQueueDurationApdexSLOViolationshared_runner_job_queue_durationApdexRunbook
CiOrchestrationServiceNonSharedRunnerJobQueueDurationApdexSLOViolationnon_shared_runner_job_queue_durationApdexRunbook
CiOrchestrationServicePipelinesCreatedTrafficCessationpipelines_createdTraffic cessationFires when the pipeline creation rate is zero for 30m (with a 1h-prior baseline of ≥ 0.167 ops/s). A drop to zero is a strong signal of a platform-wide pipeline-creation outage. Customers may report pipelines stuck in created state as a downstream symptom; investigate PipelineCreationMetricsWorker health, then the pipeline creation chain in Rails.
CiOrchestrationServicePipelinesCreatedTrafficAbsentpipelines_createdTraffic absentFires when the pipelines_created_total signal disappears entirely for 30m. Usually indicates a metrics-pipeline issue (Sidekiq down, Prometheus scrape broken) rather than the underlying service being down — but verify by querying pipelines_created_total directly in Mimir.