PatroniGCSSnapshotDelayed

Overview

We take GCS snapshots of the data disk of a Patroni replica periodically (period specified by Chef’s node['gitlab-patroni']['snapshot']['cron']['hour']). Only one specific replica is used for the purpose of a snapshot, and this replica does not receive any client connections nor participate in a leader election when a failover occurs.

The replica is assigned a special Chef role <env>-base-db-patroni-backup-replica in Terraform, here is an example from the production environment.

A cron job runs a Bash script (by default it is found in /usr/local/bin/gcs-snapshot.sh). The script run the snapshot operation (i.e. gcloud compute snapshot ...) sandwiched between a pg_start_backup and pg_stop_backup PostgreSQL calls, to ensure the integrity of the data. After a successful snapshot run, the script hits the local Prometheus Pushgateway with the current timestamp for observability.

This alert monitors the time elapsed since the last successful Patroni GCS snapshot in the Production environment (gprd)/ Staging environment(gtsg) . If no successful snapshot is taken within the last 6 hours, and this condition persists for 30 minutes, the alert will fire

If a failover or restart of patroni servers happen during execution of backup cronjob, the GCS snapshot might get halted (failed)
This does not affect patroni service itself as it is a background job. However, our disaster recovery posture becomes questionable if we do not have a successful GCS snapshot.
Try to determine root cause of failed GCS snapshot. If the snapshot failed due to failover of restart of patroni server, then re-run the bash script (/usr/local/bin/gcs-snapshot.sh) to create the latest GCS snapshot.

Services

Patroni Service
Team that owns the service: Production Engineering : Database Reliability

Metrics

Bash script for GCS snapshot, pushes a job completion metric via push gateway, which is used for alerting.
The cron job runs a Bash script (by default it is found in /usr/local/bin/gcs-snapshot.sh) runs every 6th hour , having the the alert to >=6h will only alert the on-call if it has failed twice in a row , ref
This is how the dashboard will look like when the alert is firing

Alert

This is how the dashboard will look like under normal conditions

Normal

Alert Behavior

We can silence this alert by going here, finding the PatroniGCSSnapshotDelayed and click on silence option, Silencing might be required if GCS Snapshots were intentinally disabled for certain patroni node changes a good idea would be to refer the Production issue board
This alert should be fairly rare, and usually indicates that there is a query that is not behaving as we expect
Previous incidents of this alert firing

Severities

If GCS back has been intentionally disabled it can be a severity:4 issue otherwise it should be a severity:3
Though this is not an immediate user-facing issue but it has repercussion for the customers as well because of bad recovery posture. Besides, we might be missing our internal targets of RTO and RPO for database recovery.

Verification

Prometheus link to query that triggered the alert

Recent changes

Recent Patroni Service change issues
Recently closed issues to determine, if a CR was completed recently, which might be correlated: Recently Closed Issues

Troubleshooting

If the snapshot operation failed for any reason, the script won’t hit Prometheus Pushgateway, which will eventually trigger an alert.

Check the logs for any clues, log file names have the following pattern /var/log/gitlab/postgresql/gcs-snapshot-*, check the last ones and see if an error is logged.

Try running the script manually like this and see if it exits successfully:
```
sudo su - gitlab-psql
/usr/local/bin/gcs-snapshot.sh
```
Patroni Service Overview
In case the GCS backup was halted , a new backup job can be started on the patroni server by running in a screen session

sudo su gitlab-sql
> /opt/wal-g/bin/backup.sh >> /var/log/wal-g/wal-g_backup_push.log 2>&1
#To check progress we can tail the logs on the Patroni server

Possible Resolutions

Dependencies

GCS Snapshots might be deliberately disabled to make changes on a patroni node , or when a patroni node is scheduled to be destroyed
If the recipient of this alert cannot determine the cause of the delayed GCS Snapshots and correct it using the troubleshooting steps above, it may be necessary to escalate
Slack channels where help is likely to be found: #g_infra_database_reliability
Link to tune the alert
The threshold time should ideally be more than 6 hours becuase the cron job to backup the Patroni snapshots runs every six hours
Link to edit this playbook
Update the template used to format this playbook
Related alerts
Related documentation