During the process of testing our recovery processes for Zonal and Regional outages, we want to record timing information.
There are three different timing categories right now:
- Fleet specific VM recreation time
- Component specific DR restore process time
- Total DR restore process time
This is the time from when an apply is performed from an MR to create new VMs until we record a successful bootstrap script completion.
In the bootstrap logs (or console output), look for Bootstrap finished in X minutes and Y seconds.
When many VMs are provisioned, we should find the last VM to complete as our measurement.
During the provisioning process, when a new VM is created, it executes a bootstrap script that may restart the VM.
This measurement might take place over multiple boots.
This script can help measure the bootstrap time.
This can be collected for all VMs during a gameday, or a random VM if we are creating many VMs.
The time it takes to execute a DR process. This should include creating MRs, communications, execution, and verification.
This measurement is a rough measurement right now since current process has MRs created in advance of the gameday.
Ideally, this measurement is designed to inform the overall flow and duration of recovery work for planning purposes.
Date | Environment | VM Provision Time | Bootstrap Time | Notes |
---|
2024-10-21 | GPRD | 00:39:00 | 00:10:41 | Gameday change issue, the VM provision time is for 45 Production Gitaly VMs |
2024-10-15 | GSTG | 00:14:10 | 00:07:01 | Gameday change issue, this time is calculated from the slowest Gitaly node in the recreation process. |
2024-08-22 | GSTG | 00:14:49 | 00:07:07 | Gameday change issue, this time is calculated from the slowest Gitaly node in the recreation process. |
2024-07-10 | GSTG | 00:18:21 | 00:08:48 | Change issue |
2024-06-20 | GPRD | 00:24:13 | 00:07:11 | Initial test of using OS disk snapshots for restore in GPRD. Change issue |
2024-06-10 | GSTG | 00:14:21 | 00:08:01 | Game Day change issue |
Date | Environment | Duration | Notes |
---|
2024-10-21 | GPRD | 02:05:00 | Change Issue , this was a limited Gameday that only measured creating and removing VMs |
2024-10-15 | GSTG | 01:38:00 | Change Issue , Time difference is between the change::in-progress & change::complete labels being set it includes time required to create MRs and time taken to SSH connection to Staging. |
2024-08-22 | GSTG | 02:07:00 | Change Issue , Time difference is between the change::in-progress & change::complete labels being set it includes time required to create MRs and time taken to create PAT and SSH connection to Staging. |
2024-07-10 | GSTG | 01:15:00 | Change issue |
2024-06-10 | GSTG | 01:20:00 | *Time difference is between the change::in-progress & change::complete labels being set. Doesn’t include time to create MRs. |
Date | Environment | Duration | Notes |
---|
2024-08-28 | GSTG | 00:39:00 | For this Gameday excersize on GSTG |
2024-08-08 | GSTG | 01:12:SS | For this Gameday excersize on GSTG , attempted to create new patroni nodes in recovery zones , took longer than expected because we hit the snapshot quota |