HostedRunnersServiceRunnerManagerDownSingleShard
This alert indicates that the GitLab Runner Manager is down or unavailable, which may cause pipeline jobs to fail or stay in pending state for longer time. There are three primary causes for this alert:
Steps to Troubleshoot
Section titled “Steps to Troubleshoot”-
Check if the Customer Deleted the Runner: The customer might have intentionally or unintentionally deleted the runner from the admin side. To confirm:
-
Access the GitLab Rails console.
-
Run the following command (replace $RUNNER_ID with the actual runner ID from hosted runner dashboard):
Ci::RunnerManager.where(runner_id: $RUNNER_ID) -
If the command returns
null
, it means the customer has deleted the runner. In this case:- Action: Communicate with the customer to confirm the reason for deletion.
- Fix: The best option is to deprovision the runner and create a new one via the Switchboard UI, which will also generate a new token.
-
-
The EC2 node hosting the Runner Manager is down or absent: This is more likely be related to the tenant maintenance window, where a new VM for the Runner Manager is being provisioned. If the provisioning takes more than 5 minutes, the alert will be triggered. If the issue isn’t related to the maintenance window simply running another provision job will create new runner manager.
-
The Runner Manager encountered issues: Check Logs in Tenant’s OpenSearch dashboard. Filter the logs using the following Fluentd tag and Analyze the logs for any errors or issues to determine the root cause.
fluentd_tag: cloudwatch.${RUNNER_NAME}-fleeting-logs