CiRunnersServicePollingErrorSLOViolation

Overview

This alert indicates that CI Runners are experiencing elevated error rates when requesting jobs from GitLab. The runners make API requests to check for available work, and this alert fires when these requests fail at a rate exceeding our SLO.

Impact:

Delayed job execution
Increased pipeline duration
Potential runner scaling issues
Service degradation for CI/CD

Contributing factors:

Network connectivity issues
Database performance problems
Runner manager resource saturation
API endpoint availability issues
GCP quota limitations
Bugs introduced by recent deployment

Services

CI Runners Service Overview
Team: Verify:Runner

Key Dashboards

Metrics

Primary Alert Metrics

Ci_runner_error_ratios

sum(rate(gitlab_runner_request_failures_total{environment="gprd"}[5m])) /
sum(rate(gitlab_runner_requests_total{environment="gprd"}[5m])) * 100

Error ratio for CI Runner polling

sum(rate(gitlab_runner_api_request_statuses_total{status=~"5.."}[5m])) by (status, endpoint)

ci-runners Service Error Ratio

sum(rate(gitlab_runner_api_request_statuses_total{status=~"5.."}[5m])) by (status, endpoint)

error_ratio

CI runner error totals

API request status

Kibana Logs

error_ratio

Request duration monitoring

controller_action:gitlab_sql_duration_seconds_sum:rate1m{env="gprd",type="api",action="POST /api/jobs/request",controller="Grape"}

Job Processing Status

An increase in the number of jobs in the Pending state, would lead to a build up of the Pending jobs queue.
Jobs in pulling stage: Check for accumalation of docker pull requests

sum(gitlab_runner_jobs{executor_stage="docker_pulling_image"}) by (instance)

docker pull images

Jobs in running stage

sum(gitlab_runner_jobs{executor_stage="docker_run"}) by (instance)

Total jobs by stage

sum(gitlab_runner_jobs) by (executor_stage)

Alert Behavior

This alert:

Triggers on sustained polling errors
May auto-resolve if temporary
Often correlates with application issues
Can indicate broader network problems

Common patterns from incidents:

Network routing changes
Database performance issues
Runner manager scaling events
API endpoint availability
GCP quota limitations
Recent Deployments

Severities

Default servity is ~severity::3 if there is limited pipeline impact but should be updaraged to ~severity::2 if multiple runners managers are affected

Recent Incidents

Recent changes

Dependencies

PostgreSQL database
Runner manager VMs
Internal load balancers
GCP infrastructure

Escalation

When to Escalate

Alert persists for >30 minutes.
Multiple runner shards affected.
Significant impact on pipeline completion times.

Support Channels

#production Slack channel
#g_hosted_runners Slack channel
#g_runner Slack channel
#f_hosted_runners_on_linux Slack channel

Definitions

Alert Definition
Tuning Considerations: Adjust thresholds based on peak usage patterns and user feedback.