HostedRunnersServiceApiRequestsErrorSLOViolationSingleShard
This alert triggers when the Hosted runners API is returning too many errors (5xx status codes) on a single shard. There are two alert rules, and either of them will fire this alert.
Rule 1 (1-hour window): Fires if error rate exceeds 1.44% for both the last hour AND last 5 minutes, with sustained traffic. Rule 2 (6-hour window): Fires if error rate exceeds 0.6% for both the last 6 hours AND last 30 minutes, with sustained traffic.
General Troubleshooting Steps
- Take a look at the Hosted Runner Dashboards on the Customers Grafana Instance and notice any anomalies in the metrics
- Use the Explore Function on the Customers Grafana Instance to see the specific data which has caused this alert. These queries might help get you started.
rate(gitlab_runner_api_request_statuses_total{job="hosted-runners-prometheus-agent",shard="{{ brokenshard }}",status=~"5.."}[5m])
gitlab_component_shard_errors:ratio_6h{component="api_requests",type="hosted-runners",shard="{{ brokenshard }}"}
gitlab_component_shard_errors:ratio_30m{component="api_requests",type="hosted-runners",shard="{{ brokenshard }}"}
gitlab_component_shard_errors:ratio_1h{component="api_requests",type="hosted-runners",shard="{{ brokenshard }}"}
gitlab_component_shard_errors:ratio_5m{component="api_requests",type="hosted-runners",shard="{{ brokenshard }}"}
gitlab_component_shard_ops:rate_6h{component="api_requests",type="hosted-runners",shard="{{ brokenshard }}"}
gitlab_component_shard_ops:rate_30m{component="api_requests",type="hosted-runners",shard="{{ brokenshard }}"}
gitlab_component_shard_ops:rate_1h{component="api_requests",type="hosted-runners",shard="{{ brokenshard }}"}
gitlab_component_shard_ops:rate_5m{component="api_requests",type="hosted-runners",shard="{{ brokenshard }}"}- Look at the Customers logs in Opensearch and try to find logs for the failed api requests. Here are some filters which could get you started.
fluentd_tag: "cloudwatch.kinesis.{{ brokenshard }}-manager-logs" AND message: "level=error"
fluentd_tag: "cloudwatch.kinesis.{{ brokenshard }}-manager-logs" AND message: 500
fluentd_tag: "cloudwatch.kinesis.{{ brokenshard }}-fleeting-logs" AND message: "level=error"