Skip to content

walgBaseBackupDelayed, WALGBaseBackupFailed

  • walgBaseBackupDelayed alert indicates that the base_backup for WAL-G has not finished in a certain amount of time.
  • WALGBaseBackupFailed means the most recent base_backup has failed.
  • This can be due to load on the database servers, network conditions, or problems with GCS.
  • This is not a user impacting alert.
  • When this alert fires, it is expected that the recipient of the alert will check in on the base_backup and try to determine what has interrupted or failed the backup.
  • walgBaseBackupDelayed fires if the most recent base_backup is older than 30 hours.
  • This is recorded via the gitlab_com:last_walg_successful_basebackup_age_in_hours recording rule.
  • walgBaseBackupDelayed fires if the most recent base_backup is older than 30 hours.
  • This is determined by the metric gitlab_job_failed{resource="walg-basebackup", type!~".+logical.+", env="gprd"} == 1
  • This alert can be silenced if the process is determined to be running and actually working. It shouldn’t be silenced for longer than the expected time to finish the backup.
  • This alert is expected to be rare.
  • The severity of this alert is generally going to be a ~severity::4.
  • There is no user impact at all. The impact will be in the amount of time it would take to recover should we need to do so.
  • Check /var/log/wal-g/wal-g_backup_push.log and/or /var/log/wal-g/wal-g_backup_push.log.1 on patroni nodes. Unfortunately WAL-G logs are not sent to Kibana at this time.

  • This will give you information on what is happening. A finished backup will look something like this:

    <13>Sep 2 00:00:01 backup.sh: INFO: 2021/09/02 08:20:54.669692 Finished writing part 14245.
    <13>Sep 2 00:00:01 backup.sh: INFO: 2021/09/02 08:20:56.337685 Wrote backup with name base_000000050005E38E000000A1
    <13>Sep 2 00:00:01 backup.sh: end backup pgbackup_pg12-patroni-cluster_20210902.