CI Orchestration Service
- Service Overview
- Alerts: https://alerts.gitlab.net/#/alerts?filter=%7Btype%3D%22ci-orchestration%22%2C%20tier%3D%22sv%22%7D
- Label: gitlab-com/gl-infra/production~“Service::CI Orchestration”
Summary
Section titled “Summary”ci-orchestration is a virtual service that monitors CI/CD pipeline orchestration metrics emitted by Rails (Sidekiq workers and API endpoints). It has no dedicated infrastructure — it aggregates signals from existing services (sidekiq, api, ci-jobs-api, web) to provide a unified view of pipeline health.
The SLIs are organized around three UX-oriented service boundaries (see the original proposal for details):
| Boundary | User-facing question | SLIs |
|---|---|---|
| ci-job start | ”How quickly does my job start after I push?” | pipeline_creation_sidekiq_*, pipeline_processing_sidekiq_* |
| ci-job execution | ”How long does my job wait for a runner?” | shared_runner_job_queue_duration, non_shared_runner_job_queue_duration |
| ci-pipeline execution | ”Are pipelines failing for infra reasons?” | job_infra_failure_ratio |
Observability
Section titled “Observability”| Dashboard | UID | Purpose |
|---|---|---|
| ci-orchestration service overview | ci-orchestration-main | Auto-generated SLI burn rates and error budgets |
| Pipeline Observability | ci-orchestration-pipeline-observability | Operational view — segmented system-vs-customer failures |
| CI Pipeline Reliability SLIs | mgzzp76 | Leadership view — total customer impact |
Troubleshooting
Section titled “Troubleshooting”Since ci-orchestration is a virtual service, troubleshooting typically involves investigating the underlying services:
- Check the service overview dashboard for which SLI is degraded
- Identify the emitting service — worker SLIs come from Sidekiq, job queue duration from API, failure reasons from all four service types
- Follow the relevant service’s runbook for the underlying issue (e.g., Sidekiq queue depth, API latency)
Common customer-facing symptoms
Section titled “Common customer-facing symptoms”When a customer or SRE reports a symptom, this table maps it to the SLIs and dashboard sections to check first.
| Customer report | What to check first | Likely SLIs |
|---|---|---|
”Pipelines stuck in created state” | Pipeline Observability dashboard — Pipeline Processing section. The state-machine workers (Ci::InitialPipelineProcessWorker, PipelineProcessWorker, Ci::BuildFinishedWorker, BuildQueueWorker) advance pipelines from created onward — if degraded, jobs sit in created. Also check the pipelines_created traffic-cessation alert (zero pipelines created = upstream creation broken). | pipelines_created, pipeline_processing_sidekiq_queueing, pipeline_processing_sidekiq_execution |
| ”Pipelines slow to start after I push” | Pipeline Observability dashboard — Pipeline Creation section. These are the workers that build the pipeline from .gitlab-ci.yml. | pipeline_creation_sidekiq_queue_duration, pipeline_creation_sidekiq_execution |
| ”Jobs not picking up / waiting for a runner” | Pipeline Observability dashboard — Job Queueing section. | shared_runner_job_queue_duration, non_shared_runner_job_queue_duration |
| ”Pipelines failing for infra reasons” | Pipeline Observability dashboard — Job Execution section. Shows per-reason breakdown of system-caused failures. | job_infra_failure_ratio |
Alerts
Section titled “Alerts”ci-orchestration SLO violation alerts route to #s_verify_alerts at S3 severity (Slack-only, no paging). The pipelines_created traffic cessation/absent alerts page via PagerDuty at S2 — a drop to zero pipeline creations is a strong outage signal.
For triage of worker-health alerts, start with the Pipeline Observability dashboard (segmented system-vs-customer view) rather than the service overview — it surfaces the underlying degradation patterns directly.
| Alert | SLI | Type | Runbook |
|---|---|---|---|
CiOrchestrationServicePipelineCreationSidekiqQueueDurationApdexSLOViolation | pipeline_creation_sidekiq_queue_duration | Apdex | Runbook |
CiOrchestrationServicePipelineCreationSidekiqExecutionApdexSLOViolation | pipeline_creation_sidekiq_execution | Apdex | Runbook |
CiOrchestrationServicePipelineCreationSidekiqExecutionErrorSLOViolation | pipeline_creation_sidekiq_execution | Error | Runbook |
CiOrchestrationServicePipelineProcessingSidekiqQueueingApdexSLOViolation | pipeline_processing_sidekiq_queueing | Apdex | Runbook |
CiOrchestrationServicePipelineProcessingSidekiqExecutionApdexSLOViolation | pipeline_processing_sidekiq_execution | Apdex | Runbook |
CiOrchestrationServicePipelineProcessingSidekiqExecutionErrorSLOViolation | pipeline_processing_sidekiq_execution | Error | Runbook |
CiOrchestrationServiceJobInfraFailureRatioErrorSLOViolation | job_infra_failure_ratio | Error | Runbook |
CiOrchestrationServiceSharedRunnerJobQueueDurationApdexSLOViolation | shared_runner_job_queue_duration | Apdex | Runbook |
CiOrchestrationServiceNonSharedRunnerJobQueueDurationApdexSLOViolation | non_shared_runner_job_queue_duration | Apdex | Runbook |
CiOrchestrationServicePipelinesCreatedTrafficCessation | pipelines_created | Traffic cessation | Fires when the pipeline creation rate is zero for 30m (with a 1h-prior baseline of ≥ 0.167 ops/s). A drop to zero is a strong signal of a platform-wide pipeline-creation outage. Customers may report pipelines stuck in created state as a downstream symptom; investigate PipelineCreationMetricsWorker health, then the pipeline creation chain in Rails. |
CiOrchestrationServicePipelinesCreatedTrafficAbsent | pipelines_created | Traffic absent | Fires when the pipelines_created_total signal disappears entirely for 30m. Usually indicates a metrics-pipeline issue (Sidekiq down, Prometheus scrape broken) rather than the underlying service being down — but verify by querying pipelines_created_total directly in Mimir. |