provision_pre_deploy_healthcheck_failed
hosted_runner_provision pre deploy healthcheck failed with Inactive shard appears healthy, aborting to avoid unsafe deploy
Section titled “hosted_runner_provision pre deploy healthcheck failed with Inactive shard appears healthy, aborting to avoid unsafe deploy”Quick Fix: Try jumping straight to Graceful Shutdown and Redeploy and seeing if that resolves your problem. If not, return to the top of the runbook and work through systematically.
Glossary
Section titled “Glossary”| Term | Meaning |
|---|---|
| SSM Parameter | /gitlab/dedicated/runner/{RUNNER_NAME}/deployment_status in Parameter Store in DHR AWS Account |
deployment_status | /gitlab/dedicated/runner/{RUNNER_NAME}/deployment_status in Parameter Store in DHR AWS Account |
| healthy | shard is returning healthy on a wait-healthy deployer health check. Shard is probably processing jobs for the customer. |
| active | shard is marked as active_shard in the SSM parameter /gitlab/dedicated/runner/{RUNNER_NAME}/deployment_status |
Important Preliminary Understandings
Section titled “Important Preliminary Understandings”If the hosted_runner_provision predeploy healthcheck is returning Inactive shard appears healthy, aborting to avoid unsafe deploy this means that a shard which is NOT marked active_shard in the deployment_status SSM parameter is returning healthy to a wait-healthy deployer check, and so hosted_runner_provision will fail. This is a very important safeguard in Zero Downtime Deployments to make sure we never deploy to a healthy shard which could be processing jobs.
Regardless of the status of the inactive shard of the Dedicated Hosted Runner (DHR), there may also be an active shard which is likely healthy and continuing to process jobs. You can verify this via looking at the Hosted Runners Overview dashboard and make sure that the active shard is still actually processing jobs.
Our goal is to get you into a state where:
- you only have 1 healthy shard processing jobs for the customer,
- that shard is marked active in the SSM Parameter, and
- you have had a successful clean run of
hosted_runner_deploy.
However, we want to make sure we get to that state without causing customer facing errors or runner downtime. Ideally we want to gracefully shutdown the inactive shard (NOT suddenly delete or destroy it) then rerun the whole hosted_runner_deploy pipeline.
Once hosted_runner_provision succeeds, follow the instructions under Once hosted_runner_provision succeeds to make sure you leave the deployment tidy.
Initial information gathering
Section titled “Initial information gathering”You can verify which shards are healthy via looking at the Hosted Runners Overview dashboard. It is generally safe to assume that shards which are marked as Runner Manager Status Online OR are processing jobs will be returning healthy to a deployer wait-healthy check.
You can verify which shards are marked as active using active_shard in the deployment_status ssm parameter
You can verify which shards are marked as having been previously deployed using deployed_shards in the deployment_status ssm parameter
Post in the incident which shards are active, which have been previously deployed and which are healthy.
Overview of troubleshooting paths
Section titled “Overview of troubleshooting paths”- Graceful Shutdown and Redeploy
- Forceful provision with —ignore-shard-healthcheck true. Only try if Graceful Shutdown Path is unsuccessful, as may cause customer facing errors or runner downtime
Graceful Shutdown and Redeploy
Section titled “Graceful Shutdown and Redeploy”- Run
hosted_runner_shutdownand thenhosted_runner_cleanupjobs in Switchboard to gracefully shutdown and then delete the resources for the inactive shard. Thehosted_runner_shutdownandhosted_runner_cleanupjobs should accurately choose the inactive shard to shutdown and cleanup.- If
hosted_runner_shutdownand/orhosted_runner_cleanupjobs fail, troubleshoot. Otherwise, proceed.
- If
- Read the logs to see if
hosted_runner_shutdownandhosted_runner_cleanupactually performed a shutdown and cleanup. The logs should sayShard has been {shutdown/cleaned up}.- If
hosted_runner_shutdownandhosted_runner_cleanuppassed but returnedRunner RUNNER_NAME has 1 deployed shards — skipping shutdownoronly one shard deployed, skipping, no graceful shutdown has actually occurred. The inactive shard is still healthy. In that case, there is likely a mismatch between the actual health of the infrastructure and the state of thedeployment_statusSSM parameter. You should swap now to troubleshooting guide Inaccuracies between deployment_status SSM Parameter and state of infrastructure instead. - If the
hosted_runner_shutdownandhosted_runner_cleanupactually performed a shutdown and cleanup, then you can now rerunhosted_runner_deploypipeline and we would expect it to succeed.
- If
Forceful provision with --ignore-shard-healthcheck true
Section titled “Forceful provision with --ignore-shard-healthcheck true”-
Read Manually Running Provision, Onboard, Shutdown or Cleanup
-
Read Explaination of each flag in ZDD especially the sections Potential Pitfalls of flag use and —ignore-shard-health Provision. Only then proceed with awareness.
-
Breakglass into the DHR Production AWS Account.
-
Bring up an amp operator shell on a DHR account in the provision state.
-
Run
transistor provision --ignore-shard-health true -
Follow the instructions under Once hosted_runner_provision succeeds to make sure you leave the deployment tidy.