Skip to content

gameday

Mock DR events are simulated almost every week by the Ops team for a service and/or combination of services to test our DR processes and improve on them in case of an actual incident.

The goal of the Gamedays is to improve our DR processes and document them in a way so that folks who do not have context about the services can as well execute them the same way in case of an actual outage.

Join the slack channel #f_gamedays to follow updates

Additionally, information before every Gameday is announced in #production-engineering

RoleResponsibilityKey Actions
Change TechnicianExecute the gameday scenario and technical stepsPlan and execute all technical procedures
Monitor system metrics during execution
Document timing and measurements
Coordinate with other team members
Handle rollback procedures if needed
Change ReviewerTechnical validation and oversightReview and approve the gameday plan
Verify technical accuracy of procedures
Assess risk and rollback strategies
Provide technical guidance during execution

View our current Regional/Zonal RTO here

We have clear confidence levels setup for each of the services that helps represent how efficient our current DR process is.

  • No confidence

    1. We have not tested recovery
    2. We do not have a good understanding of the impact of the component going down
    3. We do not have an emergency plan for when the component goes down
  • Low confidence

    1. We have not tested recovery
    2. We have a good understanding of the impact of the component going down
    3. We may or may not have an emergency plan when the component goes down, but it has not been validated
  • Medium confidence

    1. We have tested recovery in a production like environment but not tested in production
    2. We have a good understanding of the impact of the component going down
    3. We have an emergency plan for when the component goes down, and it has been validated in some environment
  • High confidence

    1. We have tested recovery in production
    2. We have a good understanding of the impact of the component going down
    3. We have an emergency plan when the component goes down, and it has been validated

View our Zonal Confidence levels here

  • No Confidence

    1. We do not have an emergency plan in place
    2. We do not have confidence that a service can be recreated in a new region
  • Low Confidence

    1. We have ensured data is replicated and accessible in another region.
    2. We do not have an emergency plan* in place
  • Medium Confidence

    1. We have ensured data is replicated
    2. We have plans* to build infrastructure in place
  • High Confidence

    1. We have automated testing for data that is replicated
    2. We have infrastructure ready to recieve traffic”

Note : The plan mentioned in the Regional Confidence includes expanding Terraform to facilitate other region resources.

View our Regional RTO here

Phases are groupings that show what can be done in parallel. Items in the same phase can be done at the same time.

During the process of testing our recovery processes for Zonal and Regional outages, we want to record timing information. There are three different timing categories right now:

  1. Fleet specific VM recreation time
  2. Component specific DR restore process time
  3. Total DR restore process time

VM Provision Time This is the time from when an apply is performed from an MR to create new VMs until we record a successful bootstrap script completion. In the bootstrap logs (or console output), look for Bootstrap finished in X minutes and Y seconds. When many VMs are provisioned, we should find the last VM to complete as our measurement.

Bootstrap Time During the provisioning process, when a new VM is created, it executes a bootstrap script that may restart the VM. This measurement might take place over multiple boots. This script can help measure the bootstrap time. This can be collected for all VMs during a gameday, or a random VM if we are creating many VMs.

Gameday DR Process Time The time it takes to execute a DR process. This should include creating MRs, communications, execution, and verification. This measurement is a rough measurement right now since current process has MRs created in advance of the gameday. Ideally, this measurement is designed to inform the overall flow and duration of recovery work for planning purposes.

Note : View time measurements here