inaccuracies-between-deployment_status-ssm-parameter-and-state-of-infrastructure
Inaccuracies between deployment_status SSM Parameter and state of infrastructure
Section titled “Inaccuracies between deployment_status SSM Parameter and state of infrastructure”Glossary
Section titled “Glossary”| Term | Meaning |
|---|---|
| SSM Parameter | /gitlab/dedicated/runner/{RUNNER_NAME}/deployment_status in Parameter Store in DHR AWS Account |
deployment_status | /gitlab/dedicated/runner/{RUNNER_NAME}/deployment_status in Parameter Store in DHR AWS Account |
| healthy | shard is returning healthy on a wait-healthy deployer health check. Shard is probably processing jobs for the customer. |
| active | shard is marked as active_shard in the SSM parameter /gitlab/dedicated/runner/{RUNNER_NAME}/deployment_status |
Important Preliminary Understandings
Section titled “Important Preliminary Understandings”The SSM Parameter deployment_status is an essential control for deciding which shard to provision, shutdown and cleanup during Zero Downtime Deployments.
The SSM Parameter deployment_status is updated as the last step in the hosted_runner_provision job. If any part of the provision job fails before the end, the SSM Parameter will not be updated and therefore may not accurately reflect the health of the shards.
We deliberately choose to only update the SSM Parameter as the very last step in a hosted_runner_provision job after a successful post-deploy health check. Given a choice between accidentally failing to mark a healthy shard as active_shard or deployed_shard, and accidentally marking an unhealthy shard or a never-deployed shard as active_shard or deployed_shard, we have a strong preference for the former.
Read deployment_status SSM Parameter for more.
Initial Information Gathering
Section titled “Initial Information Gathering”You can verify which shards are healthy via looking at the Hosted Runners Overview dashboard in Grafana. It is generally safe to assume that shards which are marked as Runner Manager Status Online OR are processing jobs will be returning healthy to a deployer wait-healthy check.
You can verify which shards are marked as active using active_shard in the deployment_status ssm parameter
You can verify which shards are marked as having been previously deployed using deployed_shards in the deployment_status ssm parameter
Post in the incident which shards are active, which have been previously deployed and which are healthy.
Common Inaccuracies between deployment_status SSM Parameter and state of infrastructure
Section titled “Common Inaccuracies between deployment_status SSM Parameter and state of infrastructure”A shard is healthy and is marked as active_shard
Section titled “A shard is healthy and is marked as active_shard”This is the expected state of the active shard.
A shard is unhealthy, and is marked as active_shard
Section titled “A shard is unhealthy, and is marked as active_shard”Follow the runbook for Runners Manager is Down
A shard is unhealthy and is NOT marked as active_shard
Section titled “A shard is unhealthy and is NOT marked as active_shard”This is the expected state of the inactive shard. It may or may not be in deployed_shards, depending on whether it has previously been deployed in the past. It may or may not have existing infrastructure in terraform, depending on whether or not hosted_runner_cleanup has finished running.
A shard is healthy, but is NOT marked as active_shard, and is in deployed_shards
Section titled “A shard is healthy, but is NOT marked as active_shard, and is in deployed_shards”This is the normal behaviour of a shard after a successful hosted_runner_provision job, but before the graceful_shutdown has completed in the hosted_runner_shutdown job. Either run hosted_runner_shutdown or wait for an existing hosted_runner_shutdown job to complete. Then run hosted_runner_cleanup.
Alternatively, this is also the behaviour that occurs if the hosted_runner_provision job failed on post-deploy healthcheck, but the shard is actually healthy. Follow the runbook for Hosted_runner_provision post deploy healthcheck failed if you are struggling with hosted_runner_provision job failed on post-deploy healthcheck.
A shard is unhealthy and is NOT in deployed_shards
Section titled “A shard is unhealthy and is NOT in deployed_shards”This is the expected state of a shard that has never been deployed in the past.
A shard is healthy, but is NOT in deployed_shards
Section titled “A shard is healthy, but is NOT in deployed_shards”If the hosted_runner_provision job fails during the very first deployment of a new shard:
- but the
hosted_runner_provisionjob got far enough through thehosted_runner_provisionjob before failure that the shard is healthy, - and then the second attempted execution of
hosted_runner_provisionon the same shard fails with “Inactive shard appears healthy, aborting to avoid unsafe deploy” - and the shard is not in
deployed_shards…
You may decide to manually edit the SSM parameter.
- Breakglass into the DHR Production AWS Account.
- Reconfirm the healthy, active and
deployed_shardsfrom Initial Information Gathering. Post this information in the incident. - For example, if we’re trying to deploy the runner for the first time to the blue shard, the SSM parameter will look like this:
{}(an empty JSON object). Change it to{"active_shard":"blue","deployment_ts":"2026-03-10T19:19:53Z","deployed_shards":["blue"]}(with accurate timestamp). Runhosted_runner_shutdown,hosted_runner_cleanup(which will simply returnRunner RUNNER_NAME has 1 deployed shards — skipping shutdownoronly one shard deployed, skipping- but that’s fine and accurate at this stage), then re-run the entirehosted_runner_deploypipeline (which will create the green shard and swap to"active_shard":"green"). - For example, if we’re trying to deploy to the green shard for the first time (and the blue shard has been deployed to previously), the SSM parameter will look like this
{"active_shard":"blue","deployment_ts":"2026-03-10T19:19:53Z","deployed_shards":["blue"]}. Change to{"active_shard":"blue","deployment_ts":"2026-03-10T19:19:53Z","deployed_shards":["blue", "green"]}with accurate timestamp. Runhosted_runner_shutdown,hosted_runner_cleanup(which will shutdown the green shard) and then re-run the entirehosted_runner_deploypipeline (which will recreate the green shard and swap to"active_shard":"green").