provision_post_deploy_healthcheck_failed

hosted_runner_provision post deploy healthcheck failed

Important Preliminary Understandings

First, know that it is very likely that only the inactive shard of the Dedicated Hosted Runner (DHR) is experiencing a problem, while the active shard is likely continuing to process jobs. You can verify this via looking at the Hosted Runners Overview dashboard and make sure that the active shard is still actually processing jobs. As a general rule of thumb, it is safe to re-run hosted_runner_provision or other failed jobs in a hosted runner deployment.

Once hosted_runner_provision succeeds, follow the instructions under Once hosted_runner_provision succeeds to make sure you leave the deployment tidy.

Overview of troubleshooting paths

Open the job logs for the failed hosted_runner_provision job in Switchboard and read the error.
Rerun hosted_runner_provision. If that doesn’t work -
Assuming that the Hosted Runner Provision post-deploy healthcheck failed, but no terrform errors are present in the logs - your first path is to trigger a recreation of the relevant terraform resources.
If recreating the terraform resources and rerunning hosted_runner_provision does not cause the post deploy healthcheck to pass, your second path is to breakglass in and troubleshoot why the gitlab-runner binary in the gitlab-runner container on the runner manager is not returning healthy.

Recreating the relevant terraform resources

There are multiple methods for recreating the relevant terraform resources. They are presented here ordered by easiest to hardest. If any one of these methods successfully recreates the relevant terraform resources as evidenced by the terraform logs in hosted_runner_provision, there is no need to try the other resource recreation methods.

NOTE: If you do not have specific reason to believe otherwise, it is usually ok to proceed with the assumption that the resource which is it most relevant to recreate is the {INACTIVE_SHARD}-{RUNNER_NAME}_runner-manager.

Rerunning hosted_runner_provision
Running hosted_runner_cleanup, then rerunning hosted_runner_provision
Manually deleting the relevant terraform resources via the AWS console, then rerunning hosted_runner_provision
Breaking glass into an amp pod, tainting the relevant terraform resources in the terraform state, then rerunning hosted_runner_provision

Rerunning hosted_runner_provision

NOTE: This method will only create the resources that the terraform state does not already believe exist, and so you will not know whether the resources you are concerned with will be created without either a) trying it (good option, job can be rerun idempotently) or b) breaking glass and checking the terraform state (bad option, unnecessary extra effort at this stage)

Check the job logs for the failed hosted_runner_provision job in Switchboard.
If the last attempt at hosted_runner_provision job logged Apply complete! Resources: 0 added, 0 changed, 0 destroyed., and you have not taken any other action since that last job - then do not bother re-running hosted_runner_provision, as terraform will make no changes. Select a different method to recreate resources instead. Otherwise -
Rerun hosted_runner_provision
Check the job logs to see if it created relevant resources

Running hosted_runner_cleanup, then rerunning hosted_runner_provision

NOTE: This method will recreate all the terraform resources for the inactive shard in the provision stage.

NOTE: We never run hosted_runner_cleanup without first running hosted_runner_shutdown UNLESS we know that the inactive runner is in an unhealthy state, as evidenced by multiple post deploy healthcheck failures in hosted_runner_provision on the same inactive shard.

Check the job logs for the last two failed hosted_runner_provision jobs in Switchboard. Note that in both cases Transistor was attempting to deploy to the same inactive shard, which failed its post-deploy healthcheck both times. The fact that the hosted_runner_provision continues to attempt to deploy to the same shard proves that shard is the one marked as inactive in the ssm parameter, and the fact that the post-deploy health check continues to fail proves that the inactive shard is in an unhealthy state.
Run the hosted_runner_cleanup job. This will target that same inactive shard and delete that shards terraform resources. The job should log something like Destroy complete! Resources: 13 destroyed.
Rerun the hosted_runner_provision job. The job should log something like Apply complete! Resources: 13 added, 0 changed, 0 destroyed.

Manually deleting the relevant cloud resources via the AWS console, then rerunning hosted_runner_provision

NOTE: This method requires breakglass access into the DHR Production AWS Account

NOTE: This method as written will only recreate {INACTIVE_SHARD}-{RUNNER_NAME}_runner-manager, however you may adapt it by deleting other resources in the inactive shard in the provision stage as well if required.

Breakglass into the DHR Production AWS Account.
Get very clear about which runner stack and shard is going to be deleted, and declare it in the incident.
Announce your intention in the incident to delete {INACTIVE_SHARD}-{RUNNER_NAME}_runner-manager.
Navigate to EC2 Instances in AWS, click on {INACTIVE_SHARD}-{RUNNER_NAME}_runner-manager and use Instance State > Terminate (delete) instance to delete {INACTIVE_SHARD}-{RUNNER_NAME}_runner-manager
Rerun hosted_runner_provision, which should recreate {INACTIVE_SHARD}-{RUNNER_NAME}_runner-manager.

Breaking glass into an amp pod, tainting the relevant terraform resources in the terraform state, then rerunning hosted_runner_provision

NOTE: This method requires breakglass access into the Hub Production AWS Account

NOTE: This method as written will only recreate {INACTIVE_SHARD}-{RUNNER_NAME}_runner-manager, however you may adapt it by tainting other resources in the inactive shards terraform state in the provision stage as well if required.

Get very clear about which runner stack and shard is going to be tainted, and declare it in the incident.
Intialize terraform state prior to manual state operations via amp
Run terraform state show "module.runner_manager.module.ec2[0].aws_instance.runner_manager" and confirm that is the resource you want to taint
Run terraform taint "module.runner_manager.module.ec2[0].aws_instance.runner_manager"
Rerun hosted_runner_provision, which should recreate {INACTIVE_SHARD}-{RUNNER_NAME}_runner-manager.

Troubleshoot why the gitlab-runner binary in the gitlab-runner container on the runner manager is not returning healthy

Get very clear about which runner stack and shard is going to be investigated, and declare it in the incident.
Breakglass into the DHR Production AWS Account.
Troubleshoot the gitlab-runner binary in the gitlab-runner container on the runner manager

If you decide to manually start the gitlab-runner binary on the inactive shard

If you manually start the gitlab-runner binary on the inactive shard, these things will happen

The inactive shard will immediately begin to process jobs for the customer. Since the active shard is still processing jobs as well, now both shards are processing customer jobs.
If you delete the runner manager or stop the gitlab-runner binary on the inactive shard after successfully starting it, this may be experienced by the customer as downtime, failed jobs, errors etc.

Consider these risks carefully and make sure other less potentially disruptive options have been exhausted or are not feasible before manually starting the gitlab-runner binary in the inactive shard.

If you attempt to manually start the gitlab-runner binary on the inactive shard and it fails

Continue troubleshooting

If you attempt to manually start the gitlab-runner binary on the inactive shard and it succeeds

If you attempt to manually start the gitlab-runner binary on the inactive shard and it succeeds, you still need to achieve a clean deployment of hosted_runner_provision and mark the newly healthy shard as active. For ease of understanding, pretend the inactive shard you have been troubleshooting is the pink shard, and the active shard which has been processing jobs this whole time is the purple shard.

Use Grafana to confirm that the pink shard is now picking up jobs.
Edit the ssm parameter "/gitlab/dedicated/runner/{RUNNER_NAME}/deployment_status" to mark the pink shard as active, and the purple shard as inactive
Run hosted_runner_shutdown and hosted_runner_cleanup which will shutdown and cleanup the purple shard.

Now you are (theorically) in the state you would have been if the hosted runner maintenance never failed. The shards have flipped, and the pink shard is now active, and the purple shard is now inactive. However, you have not yet achieved a clean deployment of hosted_runner_provision. So to finalize, we suggest

Rerun the entire hosted_runner_deploy pipeline to get a clean execution of prepare, onboard, provision, shutdown and cleanup.

Other Utility Steps

Clarifying which runner stack and shard to be editted

Before making any changes to infrastructure after breaking glass into a production environment, I recommend getting very clear about which runner stack and shard is going to be edited. This is assumed to be the inactive shard of the relevant runner stack.

State clearly in the incident which runner stack you are troubleshooting. There may be multiple runner stacks in a single DHR AWS account, and it is important to only change the correct runner stack.
Check the job logs for the last two failed hosted_runner_provision jobs for that runner stack in Switchboard. Confirm that in both cases Transistor was attempting to deploy to the same inactive shard, and it failed its post-deploy healthcheck both times. The fact that the hosted_runner_provision continues to attempt to deploy to the same shard proves that shard is the one marked as inactive in the ssm parameter, and the continuously failing post-deploy healthcheck proves that the shard is in an unhealthy state.
Breakglass into the DHR Production AWS Account.
Confirm using the ssm parameter "/gitlab/dedicated/runner/{RUNNER_NAME}/deployment_status" which shard is currently inactive. If the hosted_runner_provision job logs and the ssm parameter do not agree on which shard is inactive STOP HERE AND DO NOT PROCEED. You will need to use metrics to confirm which shard for that runner stack (if any) is currently processing jobs. Otherwise -
State clearly in the incident which runner shard you have confirmed is inactive
Proceed with planned changes.

Once hosted_runner_provision succeeds

After you rerun provision successfully, please also always run shutdown and cleanup so that we don’t waste money and cause confusion by having both colours live at the same time.

If you had to do some serious shenanigans to get a successful run of hosted_runner_provision, I would advise re-running the entire hosted_runner_deploy pipeline for that runner stack and getting a clean successful maintenance before moving on. This is especially relevant if you had to fix an infrastructure issue on one shard, as the same problem may be present on the other shard.