CI Orchestration Service

Service Overview
Alerts: https://alerts.gitlab.net/#/alerts?filter=%7Btype%3D%22ci-orchestration%22%2C%20tier%3D%22sv%22%7D
Label: gitlab-com/gl-infra/production~“Service::CI Orchestration”

Summary

ci-orchestration is a virtual service that monitors CI/CD pipeline orchestration metrics emitted by Rails (Sidekiq workers and API endpoints). It has no dedicated infrastructure — it aggregates signals from existing services (sidekiq, api, ci-jobs-api, web) to provide a unified view of pipeline health.

The SLIs are organized around three UX-oriented service boundaries (see the original proposal for details):

Boundary	User-facing question	SLIs
ci-job start	”How quickly does my job start after I push?”	`pipeline_creation_sidekiq_`, `pipeline_processing_sidekiq_`
ci-job execution	”How long does my job wait for a runner?”	`shared_runner_job_queue_duration`, `non_shared_runner_job_queue_duration`
ci-pipeline execution	”Are pipelines failing for infra reasons?”	`job_infra_failure_ratio`

Observability

Dashboard	UID	Purpose
ci-orchestration service overview	`ci-orchestration-main`	Auto-generated SLI burn rates and error budgets
Pipeline Observability	`ci-orchestration-pipeline-observability`	Operational view — segmented system-vs-customer failures
CI Pipeline Reliability SLIs	`mgzzp76`	Leadership view — total customer impact

Troubleshooting

Since ci-orchestration is a virtual service, troubleshooting typically involves investigating the underlying services:

Check the service overview dashboard for which SLI is degraded
Identify the emitting service — worker SLIs come from Sidekiq, job queue duration from API, failure reasons from all four service types
Follow the relevant service’s runbook for the underlying issue (e.g., Sidekiq queue depth, API latency)

Common customer-facing symptoms

When a customer or SRE reports a symptom, this table maps it to the SLIs and dashboard sections to check first.

Customer report	What to check first	Likely SLIs
”Pipelines stuck in `created` state”	Pipeline Observability dashboard — Pipeline Processing section. The state-machine workers (`Ci::InitialPipelineProcessWorker`, `PipelineProcessWorker`, `Ci::BuildFinishedWorker`, `BuildQueueWorker`) advance pipelines from `created` onward — if degraded, jobs sit in `created`. Also check the `pipelines_created` traffic-cessation alert (zero pipelines created = upstream creation broken).	`pipelines_created`, `pipeline_processing_sidekiq_queueing`, `pipeline_processing_sidekiq_execution`
”Pipelines slow to start after I push”	Pipeline Observability dashboard — Pipeline Creation section. These are the workers that build the pipeline from `.gitlab-ci.yml`.	`pipeline_creation_sidekiq_queue_duration`, `pipeline_creation_sidekiq_execution`
”Jobs not picking up / waiting for a runner”	Pipeline Observability dashboard — Job Queueing section.	`shared_runner_job_queue_duration`, `non_shared_runner_job_queue_duration`
”Pipelines failing for infra reasons”	Pipeline Observability dashboard — Job Execution section. Shows per-reason breakdown of system-caused failures.	`job_infra_failure_ratio`

Alerts

ci-orchestration SLO violation alerts route to #s_verify_alerts at S3 severity (Slack-only, no paging). The pipelines_created traffic cessation/absent alerts page via PagerDuty at S2 — a drop to zero pipeline creations is a strong outage signal.

For triage of worker-health alerts, start with the Pipeline Observability dashboard (segmented system-vs-customer view) rather than the service overview — it surfaces the underlying degradation patterns directly.

Alert	SLI	Type	Runbook
`CiOrchestrationServicePipelineCreationSidekiqQueueDurationApdexSLOViolation`	`pipeline_creation_sidekiq_queue_duration`	Apdex	Runbook
`CiOrchestrationServicePipelineCreationSidekiqExecutionApdexSLOViolation`	`pipeline_creation_sidekiq_execution`	Apdex	Runbook
`CiOrchestrationServicePipelineCreationSidekiqExecutionErrorSLOViolation`	`pipeline_creation_sidekiq_execution`	Error	Runbook
`CiOrchestrationServicePipelineProcessingSidekiqQueueingApdexSLOViolation`	`pipeline_processing_sidekiq_queueing`	Apdex	Runbook
`CiOrchestrationServicePipelineProcessingSidekiqExecutionApdexSLOViolation`	`pipeline_processing_sidekiq_execution`	Apdex	Runbook
`CiOrchestrationServicePipelineProcessingSidekiqExecutionErrorSLOViolation`	`pipeline_processing_sidekiq_execution`	Error	Runbook
`CiOrchestrationServiceJobInfraFailureRatioErrorSLOViolation`	`job_infra_failure_ratio`	Error	Runbook
`CiOrchestrationServiceSharedRunnerJobQueueDurationApdexSLOViolation`	`shared_runner_job_queue_duration`	Apdex	Runbook
`CiOrchestrationServiceNonSharedRunnerJobQueueDurationApdexSLOViolation`	`non_shared_runner_job_queue_duration`	Apdex	Runbook
`CiOrchestrationServicePipelinesCreatedTrafficCessation`	`pipelines_created`	Traffic cessation	Fires when the pipeline creation rate is zero for 30m (with a 1h-prior baseline of ≥ 0.167 ops/s). A drop to zero is a strong signal of a platform-wide pipeline-creation outage. Customers may report pipelines stuck in `created` state as a downstream symptom; investigate `PipelineCreationMetricsWorker` health, then the pipeline creation chain in Rails.
`CiOrchestrationServicePipelinesCreatedTrafficAbsent`	`pipelines_created`	Traffic absent	Fires when the `pipelines_created_total` signal disappears entirely for 30m. Usually indicates a metrics-pipeline issue (Sidekiq down, Prometheus scrape broken) rather than the underlying service being down — but verify by querying `pipelines_created_total` directly in Mimir.