Troubleshooting `data-server-rebuild-ansible` Pipeline Failures
The data-server-rebuild-ansible
pipeline periodically rebuilds the Patroni “data analytics” replica VMs in the
gitlab-db-benchmarking GCP project from fresh production snapshots. Each
shard (main, ci, sec, test) refreshes the data disk on a corresponding
patroni-*-data-analytics-01-db-db-benchmarking VM in us-east1-c and is
then promoted out of standby so analytics queries hit recent-ish data.
These are internal data-analytics replicas, not customer-facing. Failures typically map to severity-4 and do not affect production traffic. There is no alerting tied to this pipeline; incidents are filed manually after a failed pipeline is noticed.
Mental Model
Section titled “Mental Model”The pipeline runs three playbooks in order, against a target VM that is expected to be already running:
| Stage | Playbook | Where it runs | What it does |
|---|---|---|---|
replacedisk | disableservices.yml | on the VM | Wait for SSH (1200s timeout), stop patroni/pgbouncer |
replacedisk | gcpdiskreplace.yml | on localhost | Stop VM, detach …-data disk, delete it, create from snapshot, attach, start VM |
postscript | rebuild.yml | on the VM | Wait for SSH, reset patroni state, start postgres, promote out of standby |
Only gcpdiskreplace.yml ever stops or starts the VM. No playbook is designed
to cold-start a VM that is already TERMINATED when the pipeline begins.
disableservices.yml’s first task is wait_for on port 22 with a 1200s
(20-minute) timeout, so if the VM is off, every retry waits 20 minutes before
the pipeline gives up.
The shard-to-VM and shard-to-source-snapshot mappings live in
vars.yml
under instance_names: and snapshot_source_disks:. Look them up directly
rather than relying on a copy here:
# From a checkout of data-server-rebuild-ansible:yq '.instance_names, .snapshot_source_disks' vars.ymlSymptoms
Section titled “Symptoms”Common error messages in the failing job log, mapped to a cause section below:
Timeout when waiting for search string OpenSSH in <hostname>:22— the 20-minute SSH wait failed (cold-start; see Target VM is TERMINATED).Target GCE VM '…' is in state 'TERMINATED' (expected: RUNNING)— preflight refusing to wait for SSH (same cause; the fast-fail variant after MR !100).Invalid value for field 'resource.sizeGb': 'XXX'. Requested disk size cannot be smaller than the snapshot size (YYY GB)—vars.ymlsize mismatch (see Disk size smaller than snapshot).Could not translate host name "master.patroni.service.consul" to address— patroni stuck trying to reach a deleted primary (see Patroni stuck on standby’s missing primary DNS).- WAL-replay timeout from
wait for "<host>" startup scripts to finish— see Other failure modes.
Target VM is TERMINATED at pipeline start
Section titled “Target VM is TERMINATED at pipeline start”This is the most common case. The VM was previously stopped (a manual
action, a failed prior pipeline mid-flight, GCP maintenance, or zone
resource exhaustion at the moment of start) and nothing has restarted it
since. The playbook then sits at wait for host to be up for 20 minutes
per attempt.
For a recent example see the production incident gitlab-com/gl-infra/production#21938.
Identify
Section titled “Identify”gcloud --project=gitlab-db-benchmarking compute instances describe \ <instance-name> --zone=us-east1-c \ --format='value(status,lastStartTimestamp,lastStopTimestamp)'If status is TERMINATED (or STOPPING/STAGING), this is the cause.
Resolution
Section titled “Resolution”Start the VM and re-run the failing job:
gcloud --project=gitlab-db-benchmarking compute instances start \ <instance-name> --zone=us-east1-cThe boot disk and …-log disk persist across stops; only …-data is
recreated by the pipeline. Starting the VM is non-destructive: its old data
disk (if still attached) will be wiped and replaced as part of
gcpdiskreplace.yml.
After the VM reaches RUNNING, retry the failing pipeline job. The
replacedisk stage will pick up where it left off; gcpdiskreplace.yml’s
detach/delete steps are idempotent (failed_when guards on "is not attached to instance" and "was not found").
NOTE: Retrying the failing job alone will not start the VM. The
pipeline has no logic to start a TERMINATED VM, so each retry hits the
same 1200s SSH timeout until the VM is up.
Disk size smaller than snapshot
Section titled “Disk size smaller than snapshot”GCE refuses to create a disk smaller than its source snapshot. If
production disks are resized but vars.yml is not updated for every shard
that sources from them, the next rebuild fails at the
create "<instance>-data" disk task with:
ERROR: (gcloud.compute.disks.create) Could not fetch resource: - Invalid value for field 'resource.sizeGb': 'XXX'. Requested disk size cannot be smaller than the snapshot size (YYY GB)Identify
Section titled “Identify”Compare snapshot_source_disk_size[shard] in vars.yml to the actual
source snapshot size:
# Find the source snapshot the pipeline usesSOURCE_DISK="$(yq '.snapshot_source_disks.<shard>' vars.yml)" # e.g. patroni-sec-v17-02-db-gprd
# Get the latest READY snapshot for that source diskgcloud compute snapshots list --project=gitlab-production \ --filter="status~READY AND sourceDisk~${SOURCE_DISK}" \ --sort-by=~creationTimestamp --limit=1 \ --format='value(name,diskSizeGb)'If the snapshot’s diskSizeGb exceeds the configured shard size in
vars.yml, this is the cause.
NOTE: test shard sources from sec’s gprd disk, so resizing sec
requires bumping both sec and test entries in vars.yml.
Resolution
Section titled “Resolution”- Open an MR bumping the affected shard’s
snapshot_source_disk_size[<shard>]to match (or exceed) the snapshot size. See MR !101 for an example. - Merge and re-run the pipeline.
IMPORTANT: If create failed, the pipeline already ran stop,
detach, and delete. The VM is now TERMINATED and its …-data disk is
gone. Re-running the failing replacedisk job specifically will succeed
past those steps (idempotent guards) and create the new disk at the
corrected size. Do not re-run the whole pipeline from
disableservices.yml: the VM will still be TERMINATED until
gcpdiskreplace.yml’s final start step, so a re-run from the top would
immediately hit the TERMINATED VM cause above.
Patroni stuck on standby’s missing primary DNS
Section titled “Patroni stuck on standby’s missing primary DNS”A snapshot taken from a standby cluster may reference a primary
(master.patroni.service.consul) that does not exist in db-benchmarking
DNS. Postgres fails to start and patroni restarts it in a tight loop until
monitoring until ready times out.
rebuild.yml already handles this by promoting out of standby
(gitlab-patronictl edit-config -s standby_cluster=null --force). If that
promotion task itself fails, run it manually on the target VM:
ssh data-analytics-refresh@<instance>.c.gitlab-db-benchmarking.internal \ "sudo gitlab-patronictl edit-config -s standby_cluster=null --force"Then re-run the ansible-deploy job to continue from wait for "<host>" state to be running.
Other failure modes
Section titled “Other failure modes”Less common, but historically observed:
-
ZONE_RESOURCE_POOL_EXHAUSTEDonstart: wait for capacity to free, or retry. To list recent occurrences in the audit log:Terminal window gcloud logging read \'protoPayload.status.message="ZONE_RESOURCE_POOL_EXHAUSTED" AND protoPayload.methodName="v1.compute.instances.start"' \--project=gitlab-db-benchmarking --freshness=30d --limit=20 \--format='value(timestamp,protoPayload.resourceName)' -
WAL-replay timeout during
monitoring until ready: Postgres is replaying a large amount of WAL after restoring from snapshot. Bumppatroni_ready_timeout[<shard>]invars.ymlif persistent, or wait it out and retry. -
Snapshot not READY: the
wait for snapshot to be readytask retries 30× at 60s. If a snapshot is still being taken in production this can happen; usually self-resolves on retry.
Useful commands
Section titled “Useful commands”List all rebuild VMs and their current state:
gcloud compute instances list --project=gitlab-db-benchmarking \ --filter="name~analytics-01-db-db-benchmarking" \ --format="table(name,zone,status,lastStartTimestamp)"