NATS Backup
We run NATS on Kubernetes via its Helm chart which is based on Stateful sets.
We use velero to backup the complete NATS workload and persist them to remote storage. It also takes snapshots of associated volumes. Backed up resources include:
-
NATS StatefulSet configurations
-
Persistent volumes containing NATS data
-
Associated Kubernetes resources (ConfigMaps, Secrets, Services)
-
This is rolled out on all customersdot environments. See gitlab-org/analytics-section/platform-insights/core#83.
Backup Schedule and Retention
Section titled “Backup Schedule and Retention”For staging environments, the frequency is of daily backups with retention period of 3 days. For production environment, the frequency is every 6 hours with retention period of 3 days.
| Environment | Frequency | Retention Period | Schedule Expression |
|---|---|---|---|
| Staging | Daily | 3 days | 0 2 * * * |
| Production | Every 6 hours | 3 days | 0 */6 * * * |
Prerequisites
Section titled “Prerequisites”Before troubleshooting or performing restore operations:
-
Make sure to configure
kubectl config contextproperly to point to the cluster.Terminal window # Verify cluster access$ kubectl config current-context# Switch to the correct cluster if needed$ kubectl config use-context <cluster-name> -
Velero CLI (optional but recommended)
- Installation guide
- With the CLI installed, you can omit kubectl exec commands, example:
velero backup getinstead ofkubectl exec -n velero deployment/velero -- /velero backup get
-
Required permissions:
- Read access to the velero and nats namespaces
- For restore operations: write access to the nats namespace
Troubleshooting
Section titled “Troubleshooting”Monitoring and Alerts
Section titled “Monitoring and Alerts”Alert Configuration
Section titled “Alert Configuration”We have an alert defined that triggers when there are no successful backups in last 12 hours.
Key Metrics
Section titled “Key Metrics”There are several metrics available on Grafana that can be used to monitor the backups:
| Metric | Description | Link |
|---|---|---|
velero_backup_last_successful_timestamp | Timestamp of last successful backup | View |
velero_backup_failure_total | Total number of failed backups | View |
velero_backup_attempt_total | Total backup attempts | View |
Failure rate query:
sum(rate(velero_backup_failure_total{env="gprd"}[12h])) /sum(rate(velero_backup_attempt_total{env="gprd"}[12h])) * 100The backups can also be investigated directly on the kubernetes cluster.
Checking active velero schedule
Section titled “Checking active velero schedule”$ kubectl exec -n velero deployment/velero -- /velero schedule get # or `velero schedule get`NAME STATUS CREATED SCHEDULE BACKUP TTL LAST BACKUP SELECTOR PAUSEDvelero-nats-backup Enabled 2025-10-27 17:02:51 +0000 UTC 0 */6 * * * 72h0m0s 5h ago <none> falseChecking failed backups
Section titled “Checking failed backups”$ kubectl exec -n velero deployment/velero -- /velero backup get # or `velero backup get`NAME STATUS ERRORS WARNINGS CREATED EXPIRES STORAGE LOCATION SELECTOR...velero-nats-backup-20251117060039 Completed 0 0 2025-11-17 07:00:39 +0100 CET 19h default <none>velero-nats-backup-20251117000039 Completed 0 0 2025-11-17 01:00:39 +0100 CET 13h default <none>velero-nats-backup-20251116180038 Completed 0 0 2025-11-16 19:00:38 +0100 CET 7h default <none>velero-nats-backup-20251116120038 Completed 0 0 2025-11-16 13:00:38 +0100 CET 1h default <none>Failed backups would have the status as Failed. We can peek into a backup for more details.
$ kubectl exec -n velero deployment/velero -- /velero backup describe velero-nats-backup-20251116120038# output has been elided as it includes lot more details but the important things to look are `Phase`, `Started/Completed` and `Item backed up` and it will include errorsName: velero-nats-backup-20251116120038Namespace: velero...Phase: CompletedNamespaces: Included: nats Excluded: <none>Resources: Included: * Excluded: <none> Cluster-scoped: included...TTL: 72h0m0s...Started: 2025-11-16 12:00:38 +0000 UTCCompleted: 2025-11-16 12:01:03 +0000 UTC
Expiration: 2025-11-19 12:00:38 +0000 UTC
Total items to be backed up: 709Items backed up: 709In order to look into why a particular failure happened, backups logs can be helpful, though it is quite verbose:
$ kubectl exec -n velero deployment/velero -- /velero backup logs velero-nats-backup-20251116120038Velero documentation also includes a general troubleshooting section.
Manually triggering a backup
Section titled “Manually triggering a backup”It is also possible to trigger a manual backup instead of waiting on the schedule or to actively monitor if there are errors:
$ kubectl exec -n velero deployment/velero -- /velero backup create nats-{timestamp} --include-namespaces nats --waitThis will create a backup that will include all resources in the namespace nats.
Restore procedure
Section titled “Restore procedure”Backups made by velero can be restored by using the Restore API available.
An example resource to execute a restore looks like this:
apiVersion: velero.io/v1kind: Restoremetadata: name: nats-restore namespace: velero labels: restore-type: disaster-recoveryspec: scheduleName: velero-nats-backup ## or modify this to point to specific backup created by velero # backupName: nats-30-10-2025
includedNamespaces: - nats
# Restore persistent volumes restorePVs: true
preserveNodePorts: true
includeClusterResources: true
includedResources: - '*'This resource can be applied to the cluster and it will restore all NATS kubernetes resources onto cluster.
For more details on the workflow and available configuration, see velero restore reference doc.