ci-apdex-violating-slo
Runner Manager’s queues violating the SLI of the ci-runners service
Section titled “Runner Manager’s queues violating the SLI of the ci-runners service”To Check the overall health of the runners:
- Check the CI-Runners standard SLI dashboard to check the impact of degradation
- Note that job queue charts are inaccurate in the following ways that are tracked in https://gitlab.com/gitlab-com/gl-infra/reliability/-/issues/12850 and https://gitlab.com/gitlab-org/gitlab/-/merge_requests/19517:
- it’s outdated, because gitlab_exporter is pointed at the archive replica (which is lagging behind)
- it’s incomplete, because most of the times the Postgres queries for pulling this data are timing out
- Note that job queue charts are inaccurate in the following ways that are tracked in https://gitlab.com/gitlab-com/gl-infra/reliability/-/issues/12850 and https://gitlab.com/gitlab-org/gitlab/-/merge_requests/19517:
- Job queue duration histogram percentiles may also point to a degradation, note that these are only for jobs that have been picked up by a runner.
This alert has the following possible causes, in the first few minutes it is important to determine the high-level cause before investigating further, the following are the common three causes of this alert:
GCP Quotas causing scaling issues
Section titled “GCP Quotas causing scaling issues”Look for quota-exceeded errors in logs to determine if we are hitting any GCP gitlab-ci
project quotas that are causing scaling issues: https://log.gprd.gitlab.net/goto/8f65b43718b6e95ccf5f6972e7ca1887
Check the Quotas Runbook for more details.
If we believe there is a GCP scaling or quota issue:
- Contact the Runner team 24/7 using this contact sheet
Database issue or API Errors / Saturation
Section titled “Database issue or API Errors / Saturation”- Check the Patroni overview
- Check the API overview
- Check
/api/job/request
timings in Thanos - Check API requests for 500 errors
If we believe there is a problem with PostgreSQL:
- Notify the DBRE
@Jose Finotto
with a link to the incident channel - Page Ongres support by creating an incident in PD
See https://gitlab.com/gitlab-com/runbooks/-/blob/master/docs/ci-runners/ci-abuse-handling.md