CiOrchestrationServiceJobInfraFailureRatioErrorSLOViolation
Overview
Section titled “Overview”This alert fires when the infra-attributable share of CI job failures exceeds its SLO burn rate threshold, indicating that an unusually high proportion of job failures are caused by infrastructure issues rather than user errors.
The job_infra_failure_ratio SLI uses gitlab_ci_job_failure_reasons as both numerator and denominator: the error rate is the fraction of total job failures whose reason label matches an explicit positive list of infra-attributable reasons (e.g., runner_system_failure, scheduler_failure, data_integrity_failure). All other reasons — user/external errors like script_failure, and ambiguous reasons like unknown_failure — only contribute to the denominator. They are tracked separately in the Pipeline Observability dashboard’s “Job failures - others” panel for anomaly detection.
The positive list lives in metrics-catalog/services/lib/ci-job-failure-reasons.libsonnet (systemCausedReasons) and is the single source of truth used by both the SLI and the dashboard.
Impact
Section titled “Impact”- CI jobs failing for reasons outside user control
- Reduced pipeline success rates across the platform
- User frustration and wasted compute (retries)
- Potential data integrity issues if
data_integrity_failureis elevated
Contributing Factors
Section titled “Contributing Factors”Each entry is a reason in systemCausedReasons and a typical underlying cause:
runner_system_failure: Runner panics, Kubernetes pod disruptions, trace patch failures, network egress issuesscheduler_failure: Sidekiq job-scheduling failuresdata_integrity_failure: Internal consistency errorsenvironment_creation_failure: Failure to create a deployment environmentjob_router_failure: Internal Job Router service failuresstuck_pending_with_matching_runners: Jobs stuck pending despite available matching runnersno_updates_running/no_updates_canceling: Job state machine not receiving heartbeatsstale_schedule: Delayed jobs (when: delayed) left inscheduledstate > 1h past theirscheduled_at, dropped byCi::StuckBuilds::DropScheduledServicestuck_or_timeout_failure(legacy): retained for backward compatibility — pre-19.0 in-flight jobs and historical data (gitlab#595752)- Gitaly overload causing clone/fetch failures (surfaces as
runner_system_failure) - Registry/Dependency Proxy issues causing image-pull failures (surfaces as
runner_system_failure)
Services
Section titled “Services”- ci-orchestration service overview
- Pipeline Observability dashboard
- Team: Verify
- Slack:
#s_verify_alerts
Metrics
Section titled “Metrics”The SLI uses the same metric for both request rate and error rate:
- Request rate:
rate(gitlab_ci_job_failure_reasons[5m])— all job failures - Error rate:
rate(gitlab_ci_job_failure_reasons{reason=~"<10 infra-attributable reasons>"}[5m])— failures whosereasonmatches the positive include list (source of truth:services/lib/ci-job-failure-reasons.libsonnet) - Emitted by:
api,sidekiq,web - SLO: 95% success rate (errorRatio: 0.95, meaning max 5% infra failure share)
- MWMBR fires at: > 30% infra share (6h window) / > 72% (1h window — in practice dominated by the 6h window)
Counted reasons (infra-attributable)
Section titled “Counted reasons (infra-attributable)”The error rate counter matches these reasons, defined in services/lib/ci-job-failure-reasons.libsonnet as systemCausedReasons:
runner_system_failure, scheduler_failure, stuck_pending_with_matching_runners, no_updates_running, no_updates_canceling, data_integrity_failure, environment_creation_failure, job_router_failure, stale_schedule, stuck_or_timeout_failure (legacy — kept for backward compatibility)
Reasons NOT counted
Section titled “Reasons NOT counted”Every other reason value is treated as non-infra and contributes only to the denominator. Notable examples:
- User/external-attributable:
script_failure,ci_quota_exceeded,no_matching_runner,runner_unsupported,job_execution_timeout,missing_dependency_failure, etc. - Ambiguous:
unknown_failure(catch-all for unrecognised runner-side failure reasons) - Runner-classified (introduced via gitlab#595703):
runner_configuration_error,runner_external_dependency_failure,runner_interrupted
These appear in the Pipeline Observability dashboard’s “Job failures - others” panel for separate anomaly detection.
Alert Behavior
Section titled “Alert Behavior”- Severity: S3 (Slack-only, no paging)
- Routes to:
#s_verify_alerts - MWMBR requires both the short window (5m/1h) and long window (30m/6h) to breach simultaneously
- The ratio can spike during incidents affecting runners or Gitaly
- Silencing: Safe to silence during known runner fleet maintenance or Gitaly incidents where the root cause is already being addressed. Use Alertmanager silence with matchers
type=ci-orchestration, component=job_infra_failure_ratio - Expected frequency: May fire during infrastructure incidents. Under normal conditions, the infra failure share is well below the 5% SLO ceiling
Severities
Section titled “Severities”Default severity is S3. Consider upgrading to S2 if:
- Infra failure ratio > 5% sustained for > 30 minutes
- A single failure reason dominates (e.g.,
runner_system_failurespike indicating fleet-wide runner issue) - Correlated with multiple customer reports
Verification
Section titled “Verification”# Current infra failure ratio (pre-aggregated)gitlab_component_errors:ratio_5m{component="job_infra_failure_ratio", type="ci-orchestration", environment="gprd"}
# Breakdown by infra failure reason (matches the SLI's positive include list)sum by (reason) (sli_aggregations:gitlab_ci_job_failure_reasons:rate_5m{environment="gprd", reason=~"runner_system_failure|scheduler_failure|stuck_pending_with_matching_runners|no_updates_running|no_updates_canceling|data_integrity_failure|environment_creation_failure|job_router_failure|stale_schedule|stuck_or_timeout_failure"})
# All other reasons (non-infra), for contextsum by (reason) (sli_aggregations:gitlab_ci_job_failure_reasons:rate_5m{environment="gprd", reason!~"runner_system_failure|scheduler_failure|stuck_pending_with_matching_runners|no_updates_running|no_updates_canceling|data_integrity_failure|environment_creation_failure|job_router_failure|stale_schedule|stuck_or_timeout_failure"})- ci-orchestration service overview dashboard — burn rate panels
- Pipeline Observability dashboard — Job Execution section — per-reason breakdown
Recent Changes
Section titled “Recent Changes”Troubleshooting
Section titled “Troubleshooting”1. Identify the Dominant Failure Reason
Section titled “1. Identify the Dominant Failure Reason”Check the “Job failures - system-caused” panel on the Pipeline Observability dashboard. The top reason by volume tells you where to investigate:
| Reason | Investigate |
|---|---|
runner_system_failure | Runner fleet health, Kubernetes node issues, runner manager logs, network egress |
scheduler_failure | Sidekiq scheduling issues |
data_integrity_failure | Database consistency, recent migrations |
environment_creation_failure | Deployment-environment APIs, infrastructure provisioning |
job_router_failure | Internal Job Router service |
stuck_pending_with_matching_runners / no_updates_running / no_updates_canceling | Runner heartbeat / job-state-machine issues |
stale_schedule | Delayed jobs (when: delayed) left in scheduled state > 1h past their scheduled_at, dropped by Ci::StuckBuilds::DropScheduledService. Check the stuck-builds cron and BuildScheduleWorker processing |
stuck_or_timeout_failure (legacy) | Stuck builds cron, runner tag mismatches, plan-gating issues (pre-19.0 jobs only) |
unknown_failureis not counted by this SLI. If it dominates on the “Job failures - others” panel, investigate runner-side failure reasons not recognised by Rails (often pointing to a new Runner version emitting an unmapped string).
2. Check Runner Fleet Health
Section titled “2. Check Runner Fleet Health”- CI Runners dashboard for runner availability
- Look for node-level issues, pod evictions, or autoscaling problems
3. Check Gitaly Health
Section titled “3. Check Gitaly Health”Gitaly overload causes job clone/fetch failures that surface as runner_system_failure:
- Gitaly dashboard for latency and error rates
4. Check Registry/Dependency Proxy
Section titled “4. Check Registry/Dependency Proxy”Image pull failures also surface as runner_system_failure:
- Registry dashboard for error rates
5. Check for Incident Correlation
Section titled “5. Check for Incident Correlation”Infrastructure incidents (Gitaly DDoS, runner fleet issues, database problems) often cause spikes in this SLI. Check #production and #incident-management for ongoing incidents.
Possible Resolutions
Section titled “Possible Resolutions”No past incidents have been recorded yet for this alert. This section will be updated as incidents occur.
Dependencies
Section titled “Dependencies”- CI Runners: Runner fleet availability and health
- Gitaly: Git clone/fetch operations within jobs
- Container Registry: Image pulls for job containers
- PostgreSQL (CI): Job state recording
Escalation
Section titled “Escalation”When to Escalate
Section titled “When to Escalate”- Single failure reason > 20% of total failures
runner_system_failurespike correlated with runner fleet degradation- Alert persists > 1 hour with no identified cause
Support Channels
Section titled “Support Channels”#s_verify_alerts(primary)#g_runner(Runner team — for runner_system_failure)#g_pipeline-execution(Pipeline Execution team)#production(if S2+ severity)