Skip to content

CiOrchestrationServiceJobInfraFailureRatioErrorSLOViolation

This alert fires when the infra-attributable share of CI job failures exceeds its SLO burn rate threshold, indicating that an unusually high proportion of job failures are caused by infrastructure issues rather than user errors.

The job_infra_failure_ratio SLI uses gitlab_ci_job_failure_reasons as both numerator and denominator: the error rate is the fraction of total job failures whose reason label matches an explicit positive list of infra-attributable reasons (e.g., runner_system_failure, scheduler_failure, data_integrity_failure). All other reasons — user/external errors like script_failure, and ambiguous reasons like unknown_failure — only contribute to the denominator. They are tracked separately in the Pipeline Observability dashboard’s “Job failures - others” panel for anomaly detection.

The positive list lives in metrics-catalog/services/lib/ci-job-failure-reasons.libsonnet (systemCausedReasons) and is the single source of truth used by both the SLI and the dashboard.

  • CI jobs failing for reasons outside user control
  • Reduced pipeline success rates across the platform
  • User frustration and wasted compute (retries)
  • Potential data integrity issues if data_integrity_failure is elevated

Each entry is a reason in systemCausedReasons and a typical underlying cause:

  • runner_system_failure: Runner panics, Kubernetes pod disruptions, trace patch failures, network egress issues
  • scheduler_failure: Sidekiq job-scheduling failures
  • data_integrity_failure: Internal consistency errors
  • environment_creation_failure: Failure to create a deployment environment
  • job_router_failure: Internal Job Router service failures
  • stuck_pending_with_matching_runners: Jobs stuck pending despite available matching runners
  • no_updates_running / no_updates_canceling: Job state machine not receiving heartbeats
  • stale_schedule: Delayed jobs (when: delayed) left in scheduled state > 1h past their scheduled_at, dropped by Ci::StuckBuilds::DropScheduledService
  • stuck_or_timeout_failure (legacy): retained for backward compatibility — pre-19.0 in-flight jobs and historical data (gitlab#595752)
  • Gitaly overload causing clone/fetch failures (surfaces as runner_system_failure)
  • Registry/Dependency Proxy issues causing image-pull failures (surfaces as runner_system_failure)

The SLI uses the same metric for both request rate and error rate:

  • Request rate: rate(gitlab_ci_job_failure_reasons[5m]) — all job failures
  • Error rate: rate(gitlab_ci_job_failure_reasons{reason=~"<10 infra-attributable reasons>"}[5m]) — failures whose reason matches the positive include list (source of truth: services/lib/ci-job-failure-reasons.libsonnet)
  • Emitted by: api, sidekiq, web
  • SLO: 95% success rate (errorRatio: 0.95, meaning max 5% infra failure share)
  • MWMBR fires at: > 30% infra share (6h window) / > 72% (1h window — in practice dominated by the 6h window)

The error rate counter matches these reasons, defined in services/lib/ci-job-failure-reasons.libsonnet as systemCausedReasons:

runner_system_failure, scheduler_failure, stuck_pending_with_matching_runners, no_updates_running, no_updates_canceling, data_integrity_failure, environment_creation_failure, job_router_failure, stale_schedule, stuck_or_timeout_failure (legacy — kept for backward compatibility)

Every other reason value is treated as non-infra and contributes only to the denominator. Notable examples:

  • User/external-attributable: script_failure, ci_quota_exceeded, no_matching_runner, runner_unsupported, job_execution_timeout, missing_dependency_failure, etc.
  • Ambiguous: unknown_failure (catch-all for unrecognised runner-side failure reasons)
  • Runner-classified (introduced via gitlab#595703): runner_configuration_error, runner_external_dependency_failure, runner_interrupted

These appear in the Pipeline Observability dashboard’s “Job failures - others” panel for separate anomaly detection.

  • Severity: S3 (Slack-only, no paging)
  • Routes to: #s_verify_alerts
  • MWMBR requires both the short window (5m/1h) and long window (30m/6h) to breach simultaneously
  • The ratio can spike during incidents affecting runners or Gitaly
  • Silencing: Safe to silence during known runner fleet maintenance or Gitaly incidents where the root cause is already being addressed. Use Alertmanager silence with matchers type=ci-orchestration, component=job_infra_failure_ratio
  • Expected frequency: May fire during infrastructure incidents. Under normal conditions, the infra failure share is well below the 5% SLO ceiling

Default severity is S3. Consider upgrading to S2 if:

  • Infra failure ratio > 5% sustained for > 30 minutes
  • A single failure reason dominates (e.g., runner_system_failure spike indicating fleet-wide runner issue)
  • Correlated with multiple customer reports
# Current infra failure ratio (pre-aggregated)
gitlab_component_errors:ratio_5m{component="job_infra_failure_ratio", type="ci-orchestration", environment="gprd"}
# Breakdown by infra failure reason (matches the SLI's positive include list)
sum by (reason) (sli_aggregations:gitlab_ci_job_failure_reasons:rate_5m{environment="gprd", reason=~"runner_system_failure|scheduler_failure|stuck_pending_with_matching_runners|no_updates_running|no_updates_canceling|data_integrity_failure|environment_creation_failure|job_router_failure|stale_schedule|stuck_or_timeout_failure"})
# All other reasons (non-infra), for context
sum by (reason) (sli_aggregations:gitlab_ci_job_failure_reasons:rate_5m{environment="gprd", reason!~"runner_system_failure|scheduler_failure|stuck_pending_with_matching_runners|no_updates_running|no_updates_canceling|data_integrity_failure|environment_creation_failure|job_router_failure|stale_schedule|stuck_or_timeout_failure"})

Check the “Job failures - system-caused” panel on the Pipeline Observability dashboard. The top reason by volume tells you where to investigate:

ReasonInvestigate
runner_system_failureRunner fleet health, Kubernetes node issues, runner manager logs, network egress
scheduler_failureSidekiq scheduling issues
data_integrity_failureDatabase consistency, recent migrations
environment_creation_failureDeployment-environment APIs, infrastructure provisioning
job_router_failureInternal Job Router service
stuck_pending_with_matching_runners / no_updates_running / no_updates_cancelingRunner heartbeat / job-state-machine issues
stale_scheduleDelayed jobs (when: delayed) left in scheduled state > 1h past their scheduled_at, dropped by Ci::StuckBuilds::DropScheduledService. Check the stuck-builds cron and BuildScheduleWorker processing
stuck_or_timeout_failure (legacy)Stuck builds cron, runner tag mismatches, plan-gating issues (pre-19.0 jobs only)

unknown_failure is not counted by this SLI. If it dominates on the “Job failures - others” panel, investigate runner-side failure reasons not recognised by Rails (often pointing to a new Runner version emitting an unmapped string).

  • CI Runners dashboard for runner availability
  • Look for node-level issues, pod evictions, or autoscaling problems

Gitaly overload causes job clone/fetch failures that surface as runner_system_failure:

Image pull failures also surface as runner_system_failure:

Infrastructure incidents (Gitaly DDoS, runner fleet issues, database problems) often cause spikes in this SLI. Check #production and #incident-management for ongoing incidents.

No past incidents have been recorded yet for this alert. This section will be updated as incidents occur.

  • CI Runners: Runner fleet availability and health
  • Gitaly: Git clone/fetch operations within jobs
  • Container Registry: Image pulls for job containers
  • PostgreSQL (CI): Job state recording
  • Single failure reason > 20% of total failures
  • runner_system_failure spike correlated with runner fleet degradation
  • Alert persists > 1 hour with no identified cause
  • #s_verify_alerts (primary)
  • #g_runner (Runner team — for runner_system_failure)
  • #g_pipeline-execution (Pipeline Execution team)
  • #production (if S2+ severity)