HostedRunnersServiceCiRunnerJobsApdexSLOViolationSingleShard

This alert triggers when the Apdex score for GitLab Hosted Runners drops below the predefined threshold, signaling potential performance degradation. The Apdex violation occurs when a runner fails to pick up jobs within the expected time, indicating a decline in user experience.

Possible Causes

• Traffic spikes: Unexpected traffic can lead to resource exhaustion (e.g., CPU, memory). • Database issues: Slow queries, connection problems, or database performance degradation. • Recent deployments: New code releases could introduce bugs or performance problems. • Network or server problems: Performance impacted by underlying infrastructure issues.

General Troubleshooting Steps

Identify slow requests via SLI metrics • Review Service Level Indicators (SLIs) to identify metrics with elevated request times. • Examine logs and metrics around these slow requests to understand the performance degradation. • Check API request for 500 errors: sum(increase(gitlab_runner_api_request_statuses_total{status=~"5.."}[5m])) by (status, endpoint)
Job Queue • Pending job queue duration histogram percentiles may also point to a degradation, note that these are only for jobs that have been picked up by a runner.
Review logs and metrics • Logs: Search for errors, timeouts, or slow queries related to the affected services. • Metrics: Use Prometheus/Grafana to observe CPU, memory, and network utilization metrics for anomalies.
Investigate recent deployments • Identify if any recent code, configuration changes, or infrastructure updates have occurred. • Rollback or redeploy services if the issue is related to a faulty deployment.
Examine traffic patterns and spikes • Analyze traffic logs and monitoring dashboards for unusual spikes. • Assess whether traffic surges correlate with the Apdex violations and resource exhaustion.