HostedRunnersServiceCiRunnerJobsErrorSLOViolationSingleShard
This alert indicates that jobs are failing due to runner system failures. These failures are often related to the runner infrastructure, fleeting plugin, auto-scaling issues, or network problems. The Failed Job Errors chart can be used to confirm the issue.
Possible Causes
Section titled “Possible Causes”- Runner infrastructure issues
- Docker/fleeting auto-scaling problems
- Network-related failures
General Troubleshooting Steps
Section titled “General Troubleshooting Steps”-
Check AWS network status
-
Check AWS auto-scaling activity status
- Review the status of AWS fleeting nodes to ensure they are scaling correctly and not causing failures.
-
Review GitLab Runner logs in OpenSearch
- Use the OpenSearch dashboard to examine
gitlab-runner
logs for any system failures or errors. - If OpenSearch logging is not enabled (e.g., for customers without OpenSearch logging): SSM into the runner manager instance and check the logs directly via the command:
Terminal window sudo journalctl -u gitlab-runner - Use the OpenSearch dashboard to examine
If you find relevant information in the logs, this doc could help you resolve specific issues: GitLab Runner troubleshooting