CiOrchestrationServiceJobInfraFailureRatioErrorSLOViolation

Overview

This alert fires when the infra-attributable share of CI job failures exceeds its SLO burn rate threshold, indicating that an unusually high proportion of job failures are caused by infrastructure issues rather than user errors.

The job_infra_failure_ratio SLI uses gitlab_ci_job_failure_reasons as both numerator and denominator: the error rate is the fraction of total job failures where the reason label is not in the user/external-attributable exclusion list (e.g., script_failure, ci_quota_exceeded, no_matching_runner). The remaining reasons (runner_system_failure, stuck_or_timeout_failure, unknown_failure, etc.) are counted as infra-attributable.

Impact

CI jobs failing for reasons outside user control
Reduced pipeline success rates across the platform
User frustration and wasted compute (retries)
Potential data integrity issues if data_integrity_failure is elevated

Contributing Factors

runner_system_failure: Runner panics, Kubernetes pod disruptions, trace patch failures
unknown_failure: Catch-all for unrecognized failure reasons (often runner-side)
stuck_or_timeout_failure: Jobs pending > 24h (mostly user misconfiguration, but includes genuinely stuck jobs)
data_integrity_failure: Internal consistency errors
scheduler_failure: Job scheduling errors
Gitaly overload causing clone/fetch failures in jobs
Registry/Dependency Proxy issues causing failed_to_pull_image (classified as runner_system_failure)

Services

Metrics

The SLI uses the same metric for both request rate and error rate:

Request rate: rate(gitlab_ci_job_failure_reasons[5m]) — all job failures
Error rate: rate(gitlab_ci_job_failure_reasons{reason!~"<27 excluded reasons>"}[5m]) — infra-attributable failures only
Emitted by: api, ci-jobs-api, sidekiq, web
SLO: 99.9% success rate (errorRatio: 0.999, meaning max 0.1% infra failure share)
MWMBR fires at: > 0.6% infra share (6h window) / > 1.44% (1h window)

Excluded reasons (user/external-attributable)

script_failure, ci_quota_exceeded, builds_disabled, user_blocked, stale_schedule, forward_deployment_failure, failed_outdated_deployment_job, api_failure, downstream_pipeline_creation_failed, downstream_bridge_project_not_found, insufficient_bridge_permissions, protected_environment_failure, no_matching_runner, runner_unsupported, secrets_provider_not_found, ip_restriction_failure, deployment_rejected, duo_workflow_not_allowed, invalid_bridge_trigger, job_token_expired, pipeline_loop_detected, reached_max_descendant_pipelines_depth, trace_size_exceeded, unmet_prerequisites, upstream_bridge_project_not_found, job_execution_timeout, missing_dependency_failure

Alert Behavior

Severity: S3 (Slack-only, no paging)
Routes to: #s_verify_alerts
MWMBR requires both the short window (5m/1h) and long window (30m/6h) to breach simultaneously
The ratio can spike during incidents affecting runners or Gitaly
Silencing: Safe to silence during known runner fleet maintenance or Gitaly incidents where the root cause is already being addressed. Use Alertmanager silence with matchers type=ci-orchestration, component=job_infra_failure_ratio
Expected frequency: May fire during infrastructure incidents. Under normal conditions, the infra failure share is well below 0.1%

Severities

Default severity is S3. Consider upgrading to S2 if:

Infra failure ratio > 5% sustained for > 30 minutes
A single failure reason dominates (e.g., runner_system_failure spike indicating fleet-wide runner issue)
Correlated with multiple customer reports

Verification

# Current infra failure ratio (pre-aggregated)
gitlab_component_errors:ratio_5m{component="job_infra_failure_ratio", type="ci-orchestration", environment="gprd"}

# Breakdown by infra failure reason
sum by (reason) (sli_aggregations:gitlab_ci_job_failure_reasons:rate_5m{environment="gprd", reason!~"script_failure|ci_quota_exceeded|builds_disabled|user_blocked|stale_schedule|forward_deployment_failure|failed_outdated_deployment_job|api_failure|downstream_pipeline_creation_failed|downstream_bridge_project_not_found|insufficient_bridge_permissions|protected_environment_failure|no_matching_runner|runner_unsupported|secrets_provider_not_found|ip_restriction_failure|deployment_rejected|duo_workflow_not_allowed|invalid_bridge_trigger|job_token_expired|pipeline_loop_detected|reached_max_descendant_pipelines_depth|trace_size_exceeded|unmet_prerequisites|upstream_bridge_project_not_found|job_execution_timeout|missing_dependency_failure"})

ci-orchestration service overview dashboard — burn rate panels
Pipeline Observability dashboard — Job Execution section — per-reason breakdown

Recent Changes

Troubleshooting

1. Identify the Dominant Failure Reason

Check the “Job failures - system-caused” panel on the Pipeline Observability dashboard. The top reason by volume tells you where to investigate:

Reason	Investigate
`runner_system_failure`	Runner fleet health, Kubernetes node issues, runner manager logs
`unknown_failure`	Runner-side issues (failure reason not recognized by server)
`stuck_or_timeout_failure`	Stuck builds cron, runner tag mismatches, plan-gating issues
`data_integrity_failure`	Database consistency, recent migrations
`scheduler_failure`	Sidekiq scheduling issues

2. Check Runner Fleet Health

CI Runners dashboard for runner availability
Look for node-level issues, pod evictions, or autoscaling problems

3. Check Gitaly Health

Gitaly overload causes job clone/fetch failures that surface as runner_system_failure:

Gitaly dashboard for latency and error rates

4. Check Registry/Dependency Proxy

Image pull failures also surface as runner_system_failure:

Registry dashboard for error rates

5. Check for Incident Correlation

Infrastructure incidents (Gitaly DDoS, runner fleet issues, database problems) often cause spikes in this SLI. Check #production and #incident-management for ongoing incidents.

Possible Resolutions

No past incidents have been recorded yet for this alert. This section will be updated as incidents occur.

CiOrchestrationServiceJobInfraFailureRatioErrorSLOViolation

Overview

Impact

Contributing Factors

Services

Metrics

Excluded reasons (user/external-attributable)

Alert Behavior

Severities

Verification

Recent Changes

Troubleshooting

1. Identify the Dominant Failure Reason

2. Check Runner Fleet Health

3. Check Gitaly Health

4. Check Registry/Dependency Proxy

5. Check for Incident Correlation

Possible Resolutions

Dependencies

Escalation

When to Escalate

Support Channels

Definitions

CiOrchestrationServiceJobInfraFailureRatioErrorSLOViolation

Overview

Impact

Contributing Factors

Services

Metrics

Excluded reasons (user/external-attributable)

Alert Behavior

Severities

Verification

Recent Changes

Troubleshooting

1. Identify the Dominant Failure Reason

2. Check Runner Fleet Health

3. Check Gitaly Health

4. Check Registry/Dependency Proxy

5. Check for Incident Correlation

Possible Resolutions

Dependencies

Escalation

When to Escalate

Support Channels

Definitions

Related Links