Skip to content

NATS Backup

We run NATS on Kubernetes via its Helm chart which is based on Stateful sets.

We use velero to backup the complete NATS workload and persist them to remote storage. It also takes snapshots of associated volumes. Backed up resources include:

For staging environments, the frequency is of daily backups with retention period of 3 days. For production environment, the frequency is every 6 hours with retention period of 3 days.

EnvironmentFrequencyRetention PeriodSchedule Expression
StagingDaily3 days0 2 * * *
ProductionEvery 6 hours3 days0 */6 * * *

Before troubleshooting or performing restore operations:

  1. Make sure to configure kubectl config context properly to point to the cluster.

    Terminal window
    # Verify cluster access
    $ kubectl config current-context
    # Switch to the correct cluster if needed
    $ kubectl config use-context <cluster-name>
  2. Velero CLI (optional but recommended)

    • Installation guide
    • With the CLI installed, you can omit kubectl exec commands, example: velero backup get instead of kubectl exec -n velero deployment/velero -- /velero backup get
  3. Required permissions:

    • Read access to the velero and nats namespaces
    • For restore operations: write access to the nats namespace

We have an alert defined that triggers when there are no successful backups in last 12 hours.

There are several metrics available on Grafana that can be used to monitor the backups:

MetricDescriptionLink
velero_backup_last_successful_timestampTimestamp of last successful backupView
velero_backup_failure_totalTotal number of failed backupsView
velero_backup_attempt_totalTotal backup attemptsView

Failure rate query:

sum(rate(velero_backup_failure_total{env="gprd"}[12h])) /
sum(rate(velero_backup_attempt_total{env="gprd"}[12h])) * 100

The backups can also be investigated directly on the kubernetes cluster.

Terminal window
$ kubectl exec -n velero deployment/velero -- /velero schedule get # or `velero schedule get`
NAME STATUS CREATED SCHEDULE BACKUP TTL LAST BACKUP SELECTOR PAUSED
velero-nats-backup Enabled 2025-10-27 17:02:51 +0000 UTC 0 */6 * * * 72h0m0s 5h ago <none> false
Terminal window
$ kubectl exec -n velero deployment/velero -- /velero backup get # or `velero backup get`
NAME STATUS ERRORS WARNINGS CREATED EXPIRES STORAGE LOCATION SELECTOR
...
velero-nats-backup-20251117060039 Completed 0 0 2025-11-17 07:00:39 +0100 CET 19h default <none>
velero-nats-backup-20251117000039 Completed 0 0 2025-11-17 01:00:39 +0100 CET 13h default <none>
velero-nats-backup-20251116180038 Completed 0 0 2025-11-16 19:00:38 +0100 CET 7h default <none>
velero-nats-backup-20251116120038 Completed 0 0 2025-11-16 13:00:38 +0100 CET 1h default <none>

Failed backups would have the status as Failed. We can peek into a backup for more details.

Terminal window
$ kubectl exec -n velero deployment/velero -- /velero backup describe velero-nats-backup-20251116120038
# output has been elided as it includes lot more details but the important things to look are `Phase`, `Started/Completed` and `Item backed up` and it will include errors
Name: velero-nats-backup-20251116120038
Namespace: velero
...
Phase: Completed
Namespaces:
Included: nats
Excluded: <none>
Resources:
Included: *
Excluded: <none>
Cluster-scoped: included
...
TTL: 72h0m0s
...
Started: 2025-11-16 12:00:38 +0000 UTC
Completed: 2025-11-16 12:01:03 +0000 UTC
Expiration: 2025-11-19 12:00:38 +0000 UTC
Total items to be backed up: 709
Items backed up: 709

In order to look into why a particular failure happened, backups logs can be helpful, though it is quite verbose:

Terminal window
$ kubectl exec -n velero deployment/velero -- /velero backup logs velero-nats-backup-20251116120038

Velero documentation also includes a general troubleshooting section.

It is also possible to trigger a manual backup instead of waiting on the schedule or to actively monitor if there are errors:

Terminal window
$ kubectl exec -n velero deployment/velero -- /velero backup create nats-{timestamp} --include-namespaces nats --wait

This will create a backup that will include all resources in the namespace nats.

Backups made by velero can be restored by using the Restore API available.

An example resource to execute a restore looks like this:

apiVersion: velero.io/v1
kind: Restore
metadata:
name: nats-restore
namespace: velero
labels:
restore-type: disaster-recovery
spec:
scheduleName: velero-nats-backup
## or modify this to point to specific backup created by velero
# backupName: nats-30-10-2025
includedNamespaces:
- nats
# Restore persistent volumes
restorePVs: true
preserveNodePorts: true
includeClusterResources: true
includedResources:
- '*'

This resource can be applied to the cluster and it will restore all NATS kubernetes resources onto cluster.

For more details on the workflow and available configuration, see velero restore reference doc.