Skip to content

HostedRunnersServiceCiRunnerJobsApdexSLOViolationSingleShard

This alert triggers when the Apdex score for GitLab Hosted Runners drops below the predefined threshold, signaling potential performance degradation. The Apdex violation occurs when a runner fails to pick up jobs within the expected time, indicating a decline in user experience. Apdex for runners is primarily calculated from job queue times.

Possible Causes

Traffic spikes: Unexpected traffic can lead to resource exhaustion (e.g., CPU, memory). • Database issues: Slow queries, connection problems, or database performance degradation. • Recent deployments: New code releases could introduce bugs or performance problems. • Network or server problems: Performance impacted by underlying infrastructure issues.

General Troubleshooting Steps

  1. Identify slow requests via SLI metrics • Review Service Level Indicators (SLIs) to identify metrics with elevated request times. • Examine logs and metrics around these slow requests to understand the performance degradation. • Check API request for 500 errors: sum(increase(gitlab_runner_api_request_statuses_total{status=~"5.."}[5m])) by (status, endpoint)

  2. Job Queue • Pending job queue duration histogram percentiles may also point to a degradation. Note this metric is only calculated when jobs are actually being picked up, so in a total hosted runner outage it will not appear like there is any queue until the runners start processing jobs again, at which point there will suddenly be a large queue.

    If there is an increase in the pending job queue, and runner saturation of concurrent is higher than 80%, you may be nearing the concurrent job limit for that runner stack. If so, you can increase the number of concurrent jobs able to be processed at a time by increasing that runner stacks scaleMax. scaleMax in a runner model is equivalent to concurrent and max_instances in the runners config.toml.

    To increase scaleMax, go into Switchboard, go into that runners model, open the runner model overrides section, increase scaleMax, save and redeploy that runner by running the provision, shutdown and cleanup jobs for that runner. It is important you make this change in the overrides section, not directly in the runner model.

    Then you should be able to go back into the metrics in Grafana and see an increased concurrent job limit, a decreased runner saturaton of concurrent and a decrease in the Pending job queue duration histogram percentiles. Note that the active runner while have switched from blue to green or vice versa, so you may need to select a different runner in the dashboard dropdown to see the changes.

  3. Review logs and metricsLogs: Search for errors, timeouts, or slow queries related to the affected services. • Metrics: Use Prometheus/Grafana to observe CPU, memory, and network utilization metrics for anomalies.

  4. Investigate recent deployments • Identify if any recent code, configuration changes, or infrastructure updates have occurred. • Rollback or redeploy services if the issue is related to a faulty deployment.

  5. Examine traffic patterns and spikes • Analyze traffic logs and monitoring dashboards for unusual spikes. • Assess whether traffic surges correlate with the Apdex violations and resource exhaustion.