CiOrchestrationServiceJobInfraFailureRatioErrorSLOViolation
Overview
Section titled “Overview”This alert fires when the infra-attributable share of CI job failures exceeds its SLO burn rate threshold, indicating that an unusually high proportion of job failures are caused by infrastructure issues rather than user errors.
The job_infra_failure_ratio SLI uses gitlab_ci_job_failure_reasons as both numerator and denominator: the error rate is the fraction of total job failures where the reason label is not in the user/external-attributable exclusion list (e.g., script_failure, ci_quota_exceeded, no_matching_runner). The remaining reasons (runner_system_failure, stuck_or_timeout_failure, unknown_failure, etc.) are counted as infra-attributable.
Impact
Section titled “Impact”- CI jobs failing for reasons outside user control
- Reduced pipeline success rates across the platform
- User frustration and wasted compute (retries)
- Potential data integrity issues if
data_integrity_failureis elevated
Contributing Factors
Section titled “Contributing Factors”runner_system_failure: Runner panics, Kubernetes pod disruptions, trace patch failuresunknown_failure: Catch-all for unrecognized failure reasons (often runner-side)stuck_or_timeout_failure: Jobs pending > 24h (mostly user misconfiguration, but includes genuinely stuck jobs)data_integrity_failure: Internal consistency errorsscheduler_failure: Job scheduling errors- Gitaly overload causing clone/fetch failures in jobs
- Registry/Dependency Proxy issues causing
failed_to_pull_image(classified as runner_system_failure)
Services
Section titled “Services”- ci-orchestration service overview
- Pipeline Observability dashboard
- Team: Verify
- Slack:
#s_verify_alerts
Metrics
Section titled “Metrics”The SLI uses the same metric for both request rate and error rate:
- Request rate:
rate(gitlab_ci_job_failure_reasons[5m])— all job failures - Error rate:
rate(gitlab_ci_job_failure_reasons{reason!~"<27 excluded reasons>"}[5m])— infra-attributable failures only - Emitted by:
api,ci-jobs-api,sidekiq,web - SLO: 99.9% success rate (errorRatio: 0.999, meaning max 0.1% infra failure share)
- MWMBR fires at: > 0.6% infra share (6h window) / > 1.44% (1h window)
Excluded reasons (user/external-attributable)
Section titled “Excluded reasons (user/external-attributable)”script_failure, ci_quota_exceeded, builds_disabled, user_blocked, stale_schedule, forward_deployment_failure, failed_outdated_deployment_job, api_failure, downstream_pipeline_creation_failed, downstream_bridge_project_not_found, insufficient_bridge_permissions, protected_environment_failure, no_matching_runner, runner_unsupported, secrets_provider_not_found, ip_restriction_failure, deployment_rejected, duo_workflow_not_allowed, invalid_bridge_trigger, job_token_expired, pipeline_loop_detected, reached_max_descendant_pipelines_depth, trace_size_exceeded, unmet_prerequisites, upstream_bridge_project_not_found, job_execution_timeout, missing_dependency_failure
Alert Behavior
Section titled “Alert Behavior”- Severity: S3 (Slack-only, no paging)
- Routes to:
#s_verify_alerts - MWMBR requires both the short window (5m/1h) and long window (30m/6h) to breach simultaneously
- The ratio can spike during incidents affecting runners or Gitaly
- Silencing: Safe to silence during known runner fleet maintenance or Gitaly incidents where the root cause is already being addressed. Use Alertmanager silence with matchers
type=ci-orchestration, component=job_infra_failure_ratio - Expected frequency: May fire during infrastructure incidents. Under normal conditions, the infra failure share is well below 0.1%
Severities
Section titled “Severities”Default severity is S3. Consider upgrading to S2 if:
- Infra failure ratio > 5% sustained for > 30 minutes
- A single failure reason dominates (e.g.,
runner_system_failurespike indicating fleet-wide runner issue) - Correlated with multiple customer reports
Verification
Section titled “Verification”# Current infra failure ratio (pre-aggregated)gitlab_component_errors:ratio_5m{component="job_infra_failure_ratio", type="ci-orchestration", environment="gprd"}
# Breakdown by infra failure reasonsum by (reason) (sli_aggregations:gitlab_ci_job_failure_reasons:rate_5m{environment="gprd", reason!~"script_failure|ci_quota_exceeded|builds_disabled|user_blocked|stale_schedule|forward_deployment_failure|failed_outdated_deployment_job|api_failure|downstream_pipeline_creation_failed|downstream_bridge_project_not_found|insufficient_bridge_permissions|protected_environment_failure|no_matching_runner|runner_unsupported|secrets_provider_not_found|ip_restriction_failure|deployment_rejected|duo_workflow_not_allowed|invalid_bridge_trigger|job_token_expired|pipeline_loop_detected|reached_max_descendant_pipelines_depth|trace_size_exceeded|unmet_prerequisites|upstream_bridge_project_not_found|job_execution_timeout|missing_dependency_failure"})- ci-orchestration service overview dashboard — burn rate panels
- Pipeline Observability dashboard — Job Execution section — per-reason breakdown
Recent Changes
Section titled “Recent Changes”Troubleshooting
Section titled “Troubleshooting”1. Identify the Dominant Failure Reason
Section titled “1. Identify the Dominant Failure Reason”Check the “Job failures - system-caused” panel on the Pipeline Observability dashboard. The top reason by volume tells you where to investigate:
| Reason | Investigate |
|---|---|
runner_system_failure | Runner fleet health, Kubernetes node issues, runner manager logs |
unknown_failure | Runner-side issues (failure reason not recognized by server) |
stuck_or_timeout_failure | Stuck builds cron, runner tag mismatches, plan-gating issues |
data_integrity_failure | Database consistency, recent migrations |
scheduler_failure | Sidekiq scheduling issues |
2. Check Runner Fleet Health
Section titled “2. Check Runner Fleet Health”- CI Runners dashboard for runner availability
- Look for node-level issues, pod evictions, or autoscaling problems
3. Check Gitaly Health
Section titled “3. Check Gitaly Health”Gitaly overload causes job clone/fetch failures that surface as runner_system_failure:
- Gitaly dashboard for latency and error rates
4. Check Registry/Dependency Proxy
Section titled “4. Check Registry/Dependency Proxy”Image pull failures also surface as runner_system_failure:
- Registry dashboard for error rates
5. Check for Incident Correlation
Section titled “5. Check for Incident Correlation”Infrastructure incidents (Gitaly DDoS, runner fleet issues, database problems) often cause spikes in this SLI. Check #production and #incident-management for ongoing incidents.
Possible Resolutions
Section titled “Possible Resolutions”No past incidents have been recorded yet for this alert. This section will be updated as incidents occur.
Dependencies
Section titled “Dependencies”- CI Runners: Runner fleet availability and health
- Gitaly: Git clone/fetch operations within jobs
- Container Registry: Image pulls for job containers
- PostgreSQL (CI): Job state recording
Escalation
Section titled “Escalation”When to Escalate
Section titled “When to Escalate”- Single failure reason > 20% of total failures
runner_system_failurespike correlated with runner fleet degradation- Alert persists > 1 hour with no identified cause
Support Channels
Section titled “Support Channels”#s_verify_alerts(primary)#g_runner(Runner team — for runner_system_failure)#g_pipeline-execution(Pipeline Execution team)#production(if S2+ severity)