GCPScheduledSnapshots
Table of contents
Overview
Section titled “Overview”Covered Alerts
Section titled “Covered Alerts”- GCPScheduledSnapshotsDelayed
- GCPScheduledSnapshotsFailed
Scheduled snapshots in GCP are not running at their regular interval, or are failing. GCP snapshots are necessary to meet our RPO/RTO targets for the Gitaly service and our RTO for Patroni since using them speeds up recovery. If snapshots are not being taken consistently, we become at risk of excessive data loss in the event of a catastrophic failure, and missing RTO targets in the event we need to restore instances.
- For all Gitaly storage nodes our default policy is to take a disk snapshot every 1 hour.
- For all other nodes that take scheduled snapshots we default to every 4 hours.
Contributing factors to scheduled snapshot execution could include:
- GCP Quota Limits: GCP enforces quotas on the number of snapshots that can exist at a given time. If we hit this value, snapshot execution will halt.
- Transient API errors on GCP’s side: While uncommon, occasionally Google will have a failure that is unrelated to us that results in a snapshot not being taken or failing.
- Misconfigured guest OS settings when application-consistent snapshots are enabled: When the application-consistent setting is enabled for a snapshot, it is a requirement that the guest OS google-cloud agent be installed, and configured to allow the feature. If this is not done, the snapshot will result in error.
- There is a problem preventing data from being collected from the Stackdriver Prometheus exporter.
If snapshot failures or delays are observed, you should check Stackdriver logs to determine the cause, and decide if any further action is necessary.
Services
Section titled “Services”- GCP snapshots runbook
- Because snapshots are taken at the cloud infrastructure level, this alert may apply to a number of different services. You can refer to the logs in Stackdriver to find the affected disk, which should provide a hint as to which service the failure applies to. In some cases it may be necessary to refer to the service-catalog to locate the appropriate service owner to determine the impact of missing snapshots.
Metrics
Section titled “Metrics”- gcp-snapshots.yml defines two alerts outside of the metrics-catalog. Both alerts use metrics scraped by the Stackdriver exporter.
- GCPScheduledSnapshotsDelayed looks for a timeseries that indicates that snapshots have stopped appearing for a disk that was previously taking scheduled snapshots in the past week.
- GCPScheduledSnapshotsFailed looks for any timeseries that represents a snapshot failure in the environment.
- Under normal circumstances, we do not expect any snapshot failures, and the alert thresholds are set to reflect that.
- Metrics in Grafana:
- GCPScheduledSnapshotsDelayed
- It is expected that this will return “No data” normally.
- GCPScheduledSnapshotsFailed
- It is expected that this will return “No data” normally.
- GCPScheduledSnapshotsDelayed
- Stackdriver logs for successful snapshots
- Stackdriver logs for snapshot errors
- GCP Quotas
Alert Behavior
Section titled “Alert Behavior”- The GCPScheduledSnapshotsFailed alert is scoped to the entire GPRD environment without other distinguishing labels, it is not recommended to create a silence unless the cause of the alert is understood and a resolution is in progress.
- The GCPScheduledSnapshotsDelayed alert may also fire if a snapshot schedule is paused or removed. If this is done intentionally, and the disk needs to stay around, silence the alert for the impacted disk for 1 week, and it will fall out of the query results.
- False positive alerts can occur when there is an issue ingesting Stackdriver exporter metrics into the monitoring system, resulting in a GCPScheduledSnapshotsDelayed alert.
Severities
Section titled “Severities”- The failure to take snapshots of our disks does not cause any immediate customer facing impact, instead, only exposes us to increased risk in the event of additional failures.
- Certain internal processes may run into issues if they depend on recent snapshots being available, such as automated database refresh tasks.
- If unsure, a good starting severity for this class of alerts would be
S3
Recent changes
Section titled “Recent changes”It is unlikely that recent changes have caused this alert unless someone recently changed the snapshot configuration for that particular system.
Troubleshooting
Section titled “Troubleshooting”- Basic troubleshooting steps:
- Determine if the alert is valid by cross referencing the prometheus metrics in Grafana with the logs in Stackdriver
- If there are errors returned in the log, view the message and determine which disk it is impacting. The disk will be stored in the
protoPayload.response.targetLink
field in the log entry. - The message should indicate whether the error is due to quota limits being reached, a misconfigured quest OS, a transient GCP API failure, etc.
Possible Resolutions
Section titled “Possible Resolutions”- Refer to this incident relating to a misconfigured OS
- Manually retry the failed snapshot:
-
If the errors in stackdriver recommend to retry e.g.
"Internal error. Please try again or contact Google Support. (Code: '-5418078226953242804')"
, we can look up the disk name of failed snapshot by going to ‘response’ -> ‘error’ -> ‘targetLink’ in stackdriver log message. For examplehttps://www.googleapis.com/compute/v1/projects/gitlab-production/zones/us-east1-c/disks/file-97-stor-gprd-data
, which has disk name as the last part of Urifile-97-stor-gprd-data
.Then run following command to create the snapshot (replace
<disk_name>
with the actual name e.g.file-97-stor-gprd-data
, and the<zone>
with the disk’s zone, can be found in this list):Terminal window gcloud --project gitlab-production compute disks snapshot <disk_name> --zone=<zone> --description="Retried manual snapshot for <disk_name>"The manually created snapshots will get cleaned up by a scheduled cron job.
-
- Request a snapshot quota increase if that is what’s indicated by the failure log.
Dependencies
Section titled “Dependencies”-
If the cause of the snapshot failure is not clear from the logs and manual retry attempts are not succeeding after a short period (one or two hours), it may be necessary to escalate.
-
Slack channels where help is likely to be found:
#g_production-engineering_ops
,#s_production_engineering