Skip to content

Measuring Recovery Activities

During the process of testing our recovery processes for Zonal and Regional outages, we want to record timing information. There are three different timing categories right now:

  1. Fleet specific VM recreation time
  2. Component specific DR restore process time
  3. Total DR restore process time

This is the time from when an apply is performed from an MR to create new VMs until we record a successful bootstrap script completion. In the bootstrap logs (or console output), look for Bootstrap finished in X minutes and Y seconds. When many VMs are provisioned, we should find the last VM to complete as our measurement.

During the provisioning process, when a new VM is created, it executes a bootstrap script that may restart the VM. This measurement might take place over multiple boots. This script can help measure the bootstrap time. This can be collected for all VMs during a gameday, or a random VM if we are creating many VMs.

The time it takes to execute a DR process. This should include creating MRs, communications, execution, and verification. This measurement is a rough measurement right now since current process has MRs created in advance of the gameday. Ideally, this measurement is designed to inform the overall flow and duration of recovery work for planning purposes.

DateEnvironmentVM Provision TimeBootstrap TimeNotes
2024-10-21GPRD00:39:0000:10:41Gameday change issue, the VM provision time is for 45 Production Gitaly VMs
2024-10-15GSTG00:14:1000:07:01Gameday change issue, this time is calculated from the slowest Gitaly node in the recreation process.
2024-08-22GSTG00:14:4900:07:07Gameday change issue, this time is calculated from the slowest Gitaly node in the recreation process.
2024-07-10GSTG00:18:2100:08:48Change issue
2024-06-20GPRD00:24:1300:07:11Initial test of using OS disk snapshots for restore in GPRD. Change issue
2024-06-10GSTG00:14:2100:08:01Game Day change issue
DateEnvironmentDurationNotes
2024-10-21GPRD02:05:00Change Issue , this was a limited Gameday that only measured creating and removing VMs
2024-10-15GSTG01:38:00Change Issue , Time difference is between the change::in-progress & change::complete labels being set it includes time required to create MRs and time taken to SSH connection to Staging.
2024-08-22GSTG02:07:00Change Issue , Time difference is between the change::in-progress & change::complete labels being set it includes time required to create MRs and time taken to create PAT and SSH connection to Staging.
2024-07-10GSTG01:15:00Change issue
2024-06-10GSTG01:20:00*Time difference is between the change::in-progress & change::complete labels being set. Doesn’t include time to create MRs.
DateEnvironmentVM Provision TimeBootstrap TimeNotes
2024-08-28GSTG00:19:2500:12:58GSTG Patroni Gameday
2024-08-08GSTG00:20:4900:10:57GSTG Patroni Gameday , This is calculated from the slowest Patroni node among all the clusters.
2024-08-06GPRD00:17:4100:11:03GPRD Patroni provisioning test with the registry cluster.
2024-04-25GSTGHH:MM:SS00:06:00Collection of a Patroni bootstrap duration baseline while using OS disk snapshots. Terraform apply duration was not recorded.
2024-04-25GSTGHH:MM:SS00:35:00Collection of a Patroni bootstrap duration baseline while using a clean Ubuntu image. Terraform apply duration was not recorded.
DateEnvironmentDurationNotes
2024-08-28GSTG00:39:00For this Gameday excersize on GSTG
2024-08-08GSTG01:12:SSFor this Gameday excersize on GSTG , attempted to create new patroni nodes in recovery zones , took longer than expected because we hit the snapshot quota

HAProxy/Traffic Routing Zonal Outage DR Process Time

Section titled “HAProxy/Traffic Routing Zonal Outage DR Process Time”
DateEnvironmentVM Provision TimeBootstrap timeNotes
2025-03-17GSTG00:09:0000:23:50Game Day change issue on GSTG
2024-08-14GSTG00:14:4000:13:15Game Day change issue on GSTG
DateEnvironmentDurationNotes
2025-03-17GSTG01:22:00Game Day change issue on GSTG
2024-10-10GSTG01:30:00Game Day change issue on GSTG. First time run by a non-Ops team member
2024-08-14GSTG00:53:00Game Day change issue on GSTG
DateEnvironmentVM Provision TimeBootstrap timeNotes
2024-08-29GPRD00:08:2000:03:59Game Day change issue on GPRD
DateEnvironmentDurationNotes
2024-08-29GPRD01:01:00Game Day change issue on GPRD
2025-04-21GPRD00:49:00Game Day change issue on GPRD