Skip to content

CiRunnersServicePollingErrorSLOViolation

This alert indicates that CI Runners are experiencing elevated error rates when requesting jobs from GitLab. The runners make API requests to check for available work, and this alert fires when these requests fail at a rate exceeding our SLO.

Impact:

  • Delayed job execution
  • Increased pipeline duration
  • Potential runner scaling issues
  • Service degradation for CI/CD

Contributing factors:

  • Network connectivity issues
  • Database performance problems
  • Runner manager resource saturation
  • API endpoint availability issues
  • GCP quota limitations
  • Bugs introduced by recent deployment

Ci_runner_error_ratios

sum(rate(gitlab_runner_request_failures_total{environment="gprd"}[5m])) /
sum(rate(gitlab_runner_requests_total{environment="gprd"}[5m])) * 100
sum(rate(gitlab_runner_api_request_statuses_total{status=~"5.."}[5m])) by (status, endpoint)
sum(rate(gitlab_runner_api_request_statuses_total{status=~"5.."}[5m])) by (status, endpoint)

error_ratio

API request status

error_ratio

controller_action:gitlab_sql_duration_seconds_sum:rate1m{env="gprd",type="api",action="POST /api/jobs/request",controller="Grape"}

sum(gitlab_runner_jobs{executor_stage="docker_pulling_image"}) by (instance)

docker pull images

sum(gitlab_runner_jobs{executor_stage="docker_run"}) by (instance)
sum(gitlab_runner_jobs) by (executor_stage)

This alert:

  • Triggers on sustained polling errors
  • May auto-resolve if temporary
  • Often correlates with application issues
  • Can indicate broader network problems

Common patterns from incidents:

  • Network routing changes
  • Database performance issues
  • Runner manager scaling events
  • API endpoint availability
  • GCP quota limitations
  • Recent Deployments

Default servity is ~severity::3 if there is limited pipeline impact but should be updaraged to ~severity::2 if multiple runners managers are affected

  • PostgreSQL database
  • Runner manager VMs
  • Internal load balancers
  • GCP infrastructure

  • Alert persists for >30 minutes.
  • Multiple runner shards affected.
  • Significant impact on pipeline completion times.
  • #production Slack channel
  • #g_hosted_runners Slack channel
  • #g_runner Slack channel
  • #f_hosted_runners_on_linux Slack channel

  • Alert Definition
  • Tuning Considerations: Adjust thresholds based on peak usage patterns and user feedback.