walgBaseBackupDelayed, WALGBaseBackupFailed

Overview

walgBaseBackupDelayed alert indicates that the base_backup for WAL-G has not finished in a certain amount of time.
WALGBaseBackupFailed means the most recent base_backup has failed.
This can be due to load on the database servers, network conditions, or problems with GCS.
This is not a user impacting alert.
When this alert fires, it is expected that the recipient of the alert will check in on the base_backup and try to determine what has interrupted or failed the backup.

Services

Service Overview
Team that owns the service: Production Engineering : Database Reliability

Metrics

walgBaseBackupDelayed

walgBaseBackupDelayed fires if the most recent base_backup is older than 30 hours.
This is recorded via the gitlab_com:last_walg_successful_basebackup_age_in_hours recording rule.

WALGBaseBackupFailed

walgBaseBackupDelayed fires if the most recent base_backup is older than 30 hours.
This is determined by the metric gitlab_job_failed{resource="walg-basebackup", type!~".+logical.+", env="gprd"} == 1

Alert Behavior

This alert can be silenced if the process is determined to be running and actually working. It shouldn’t be silenced for longer than the expected time to finish the backup.
This alert is expected to be rare.

Severities

The severity of this alert is generally going to be a ~severity::4.
There is no user impact at all. The impact will be in the amount of time it would take to recover should we need to do so.

Verification

Prometheus Query for Alert

Recent changes

Troubleshooting

Check /var/log/wal-g/wal-g_backup_push.log and/or /var/log/wal-g/wal-g_backup_push.log.1 on patroni nodes. Unfortunately WAL-G logs are not sent to Kibana at this time.

This will give you information on what is happening. A finished backup will look something like this:

<13>Sep  2 00:00:01 backup.sh: INFO: 2021/09/02 08:20:54.669692 Finished writing part 14245.
<13>Sep  2 00:00:01 backup.sh: INFO: 2021/09/02 08:20:56.337685 Wrote backup with name base_000000050005E38E000000A1
<13>Sep  2 00:00:01 backup.sh: end backup pgbackup_pg12-patroni-cluster_20210902.

Possible Resolutions

Dependencies

The base_backup requires Google Cloud Storage to be available.
Slack channels where help is likely to be found: #g_infra_database_reliability
Alert Definition
Update the template used to format this playbook
Related alerts
PostgreSQL Backup Docs