Skip to content

CiOrchestrationServiceJobInfraFailureRatioErrorSLOViolation

This alert fires when the infra-attributable share of CI job failures exceeds its SLO burn rate threshold, indicating that an unusually high proportion of job failures are caused by infrastructure issues rather than user errors.

The job_infra_failure_ratio SLI uses gitlab_ci_job_failure_reasons as both numerator and denominator: the error rate is the fraction of total job failures where the reason label is not in the user/external-attributable exclusion list (e.g., script_failure, ci_quota_exceeded, no_matching_runner). The remaining reasons (runner_system_failure, stuck_or_timeout_failure, unknown_failure, etc.) are counted as infra-attributable.

  • CI jobs failing for reasons outside user control
  • Reduced pipeline success rates across the platform
  • User frustration and wasted compute (retries)
  • Potential data integrity issues if data_integrity_failure is elevated
  • runner_system_failure: Runner panics, Kubernetes pod disruptions, trace patch failures
  • unknown_failure: Catch-all for unrecognized failure reasons (often runner-side)
  • stuck_or_timeout_failure: Jobs pending > 24h (mostly user misconfiguration, but includes genuinely stuck jobs)
  • data_integrity_failure: Internal consistency errors
  • scheduler_failure: Job scheduling errors
  • Gitaly overload causing clone/fetch failures in jobs
  • Registry/Dependency Proxy issues causing failed_to_pull_image (classified as runner_system_failure)

The SLI uses the same metric for both request rate and error rate:

  • Request rate: rate(gitlab_ci_job_failure_reasons[5m]) — all job failures
  • Error rate: rate(gitlab_ci_job_failure_reasons{reason!~"<27 excluded reasons>"}[5m]) — infra-attributable failures only
  • Emitted by: api, ci-jobs-api, sidekiq, web
  • SLO: 99.9% success rate (errorRatio: 0.999, meaning max 0.1% infra failure share)
  • MWMBR fires at: > 0.6% infra share (6h window) / > 1.44% (1h window)

Excluded reasons (user/external-attributable)

Section titled “Excluded reasons (user/external-attributable)”

script_failure, ci_quota_exceeded, builds_disabled, user_blocked, stale_schedule, forward_deployment_failure, failed_outdated_deployment_job, api_failure, downstream_pipeline_creation_failed, downstream_bridge_project_not_found, insufficient_bridge_permissions, protected_environment_failure, no_matching_runner, runner_unsupported, secrets_provider_not_found, ip_restriction_failure, deployment_rejected, duo_workflow_not_allowed, invalid_bridge_trigger, job_token_expired, pipeline_loop_detected, reached_max_descendant_pipelines_depth, trace_size_exceeded, unmet_prerequisites, upstream_bridge_project_not_found, job_execution_timeout, missing_dependency_failure

  • Severity: S3 (Slack-only, no paging)
  • Routes to: #s_verify_alerts
  • MWMBR requires both the short window (5m/1h) and long window (30m/6h) to breach simultaneously
  • The ratio can spike during incidents affecting runners or Gitaly
  • Silencing: Safe to silence during known runner fleet maintenance or Gitaly incidents where the root cause is already being addressed. Use Alertmanager silence with matchers type=ci-orchestration, component=job_infra_failure_ratio
  • Expected frequency: May fire during infrastructure incidents. Under normal conditions, the infra failure share is well below 0.1%

Default severity is S3. Consider upgrading to S2 if:

  • Infra failure ratio > 5% sustained for > 30 minutes
  • A single failure reason dominates (e.g., runner_system_failure spike indicating fleet-wide runner issue)
  • Correlated with multiple customer reports
# Current infra failure ratio (pre-aggregated)
gitlab_component_errors:ratio_5m{component="job_infra_failure_ratio", type="ci-orchestration", environment="gprd"}
# Breakdown by infra failure reason
sum by (reason) (sli_aggregations:gitlab_ci_job_failure_reasons:rate_5m{environment="gprd", reason!~"script_failure|ci_quota_exceeded|builds_disabled|user_blocked|stale_schedule|forward_deployment_failure|failed_outdated_deployment_job|api_failure|downstream_pipeline_creation_failed|downstream_bridge_project_not_found|insufficient_bridge_permissions|protected_environment_failure|no_matching_runner|runner_unsupported|secrets_provider_not_found|ip_restriction_failure|deployment_rejected|duo_workflow_not_allowed|invalid_bridge_trigger|job_token_expired|pipeline_loop_detected|reached_max_descendant_pipelines_depth|trace_size_exceeded|unmet_prerequisites|upstream_bridge_project_not_found|job_execution_timeout|missing_dependency_failure"})

Check the “Job failures - system-caused” panel on the Pipeline Observability dashboard. The top reason by volume tells you where to investigate:

ReasonInvestigate
runner_system_failureRunner fleet health, Kubernetes node issues, runner manager logs
unknown_failureRunner-side issues (failure reason not recognized by server)
stuck_or_timeout_failureStuck builds cron, runner tag mismatches, plan-gating issues
data_integrity_failureDatabase consistency, recent migrations
scheduler_failureSidekiq scheduling issues
  • CI Runners dashboard for runner availability
  • Look for node-level issues, pod evictions, or autoscaling problems

Gitaly overload causes job clone/fetch failures that surface as runner_system_failure:

Image pull failures also surface as runner_system_failure:

Infrastructure incidents (Gitaly DDoS, runner fleet issues, database problems) often cause spikes in this SLI. Check #production and #incident-management for ongoing incidents.

No past incidents have been recorded yet for this alert. This section will be updated as incidents occur.

  • CI Runners: Runner fleet availability and health
  • Gitaly: Git clone/fetch operations within jobs
  • Container Registry: Image pulls for job containers
  • PostgreSQL (CI): Job state recording
  • Single failure reason > 20% of total failures
  • runner_system_failure spike correlated with runner fleet degradation
  • Alert persists > 1 hour with no identified cause
  • #s_verify_alerts (primary)
  • #g_runner (Runner team — for runner_system_failure)
  • #g_pipeline-execution (Pipeline Execution team)
  • #production (if S2+ severity)