gameday
Gamedays
Section titled “Gamedays”Overview
Section titled “Overview”Mock DR events are simulated almost every week by the Ops team for a service and/or combination of services to test our DR processes and improve on them in case of an actual incident.
The goal of the Gamedays is to improve our DR processes and document them in a way so that folks who do not have context about the services can as well execute them the same way in case of an actual outage.
Attend a Gameday
Section titled “Attend a Gameday”Join the slack channel #f_gamedays to follow updates
Additionally, information before every Gameday is announced in #production-engineering
Roles and Responsibilities
Section titled “Roles and Responsibilities”Role | Responsibility | Key Actions |
---|---|---|
Change Technician | Execute the gameday scenario and technical steps | Plan and execute all technical procedures Monitor system metrics during execution Document timing and measurements Coordinate with other team members Handle rollback procedures if needed |
Change Reviewer | Technical validation and oversight | Review and approve the gameday plan Verify technical accuracy of procedures Assess risk and rollback strategies Provide technical guidance during execution |
RTO/RPO
Section titled “RTO/RPO”View our current Regional/Zonal RTO here
Definitions
Section titled “Definitions”Confidence Levels
Section titled “Confidence Levels”We have clear confidence levels setup for each of the services that helps represent how efficient our current DR process is.
Zonal Confidence Level
Section titled “Zonal Confidence Level”-
No confidence
- We have not tested recovery
- We do not have a good understanding of the impact of the component going down
- We do not have an emergency plan for when the component goes down
-
Low confidence
- We have not tested recovery
- We have a good understanding of the impact of the component going down
- We may or may not have an emergency plan when the component goes down, but it has not been validated
-
Medium confidence
- We have tested recovery in a production like environment but not tested in production
- We have a good understanding of the impact of the component going down
- We have an emergency plan for when the component goes down, and it has been validated in some environment
-
High confidence
- We have tested recovery in production
- We have a good understanding of the impact of the component going down
- We have an emergency plan when the component goes down, and it has been validated
View our Zonal Confidence levels here
Regional Confidence Level
Section titled “Regional Confidence Level”-
No Confidence
- We do not have an emergency plan in place
- We do not have confidence that a service can be recreated in a new region
-
Low Confidence
- We have ensured data is replicated and accessible in another region.
- We do not have an emergency plan* in place
-
Medium Confidence
- We have ensured data is replicated
- We have plans* to build infrastructure in place
-
High Confidence
- We have automated testing for data that is replicated
- We have infrastructure ready to recieve traffic”
Note : The plan mentioned in the Regional Confidence includes expanding Terraform to facilitate other region resources.
View our Regional RTO here
Phases
Section titled “Phases”Phases are groupings that show what can be done in parallel. Items in the same phase can be done at the same time.
Time Measurements
Section titled “Time Measurements”During the process of testing our recovery processes for Zonal and Regional outages, we want to record timing information. There are three different timing categories right now:
- Fleet specific VM recreation time
- Component specific DR restore process time
- Total DR restore process time
Common measurements
Section titled “Common measurements”VM Provision Time This is the time from when an apply is performed from an MR to create new VMs until we record a successful bootstrap script completion. In the bootstrap logs (or console output), look for Bootstrap finished in X minutes and Y seconds. When many VMs are provisioned, we should find the last VM to complete as our measurement.
Bootstrap Time During the provisioning process, when a new VM is created, it executes a bootstrap script that may restart the VM. This measurement might take place over multiple boots. This script can help measure the bootstrap time. This can be collected for all VMs during a gameday, or a random VM if we are creating many VMs.
Gameday DR Process Time The time it takes to execute a DR process. This should include creating MRs, communications, execution, and verification. This measurement is a rough measurement right now since current process has MRs created in advance of the gameday. Ideally, this measurement is designed to inform the overall flow and duration of recovery work for planning purposes.
Note : View time measurements here