walgBaseBackupDelayed, WALGBaseBackupFailed
Overview
Section titled “Overview”- walgBaseBackupDelayed alert indicates that the
base_backup
for WAL-G has not finished in a certain amount of time. - WALGBaseBackupFailed means the most recent
base_backup
has failed. - This can be due to load on the database servers, network conditions, or problems with GCS.
- This is not a user impacting alert.
- When this alert fires, it is expected that the recipient of the alert will check in on the
base_backup
and try to determine what has interrupted or failed the backup.
Services
Section titled “Services”- Service Overview
- Team that owns the service: Production Engineering : Database Reliability
Metrics
Section titled “Metrics”walgBaseBackupDelayed
Section titled “walgBaseBackupDelayed”- walgBaseBackupDelayed fires if the most recent
base_backup
is older than 30 hours. - This is recorded via the
gitlab_com:last_walg_successful_basebackup_age_in_hours
recording rule.
WALGBaseBackupFailed
Section titled “WALGBaseBackupFailed”- walgBaseBackupDelayed fires if the most recent
base_backup
is older than 30 hours. - This is determined by the metric
gitlab_job_failed{resource="walg-basebackup", type!~".+logical.+", env="gprd"} == 1
Alert Behavior
Section titled “Alert Behavior”- This alert can be silenced if the process is determined to be running and actually working. It shouldn’t be silenced for longer than the expected time to finish the backup.
- This alert is expected to be rare.
Severities
Section titled “Severities”- The severity of this alert is generally going to be a ~severity::4.
- There is no user impact at all. The impact will be in the amount of time it would take to recover should we need to do so.
Verification
Section titled “Verification”Recent changes
Section titled “Recent changes”- Recent Patroni Production Change/Incident Issues
- Recent chef-repo Changes
- Recent k8s-workloads Changes
Troubleshooting
Section titled “Troubleshooting”-
Check
/var/log/wal-g/wal-g_backup_push.log
and/or/var/log/wal-g/wal-g_backup_push.log.1
on patroni nodes. Unfortunately WAL-G logs are not sent to Kibana at this time. -
This will give you information on what is happening. A finished backup will look something like this:
<13>Sep 2 00:00:01 backup.sh: INFO: 2021/09/02 08:20:54.669692 Finished writing part 14245.<13>Sep 2 00:00:01 backup.sh: INFO: 2021/09/02 08:20:56.337685 Wrote backup with name base_000000050005E38E000000A1<13>Sep 2 00:00:01 backup.sh: end backup pgbackup_pg12-patroni-cluster_20210902.
Possible Resolutions
Section titled “Possible Resolutions”- Another process using CPU causing the backup to slow down
- base_backup started on a node that later became unavailable
- The generating cluster isn’t in production, but still alerting
Dependencies
Section titled “Dependencies”-
The
base_backup
requires Google Cloud Storage to be available. -
Slack channels where help is likely to be found:
#g_infra_database_reliability