Skip to content

Teleport Disaster Recovery

The backup practice is based on the official Teleport guide. For more details on how we made decisions and implemented back and restore process, please see this epic.

  • Teleport Agents and Proxy Service are stateless.
  • We use the Google Cloud Key Management Service (KMS) to store and handle Teleport certificate authorities.
  • We use Firestore as the storage backend for Teleport and it is shared among all Auth Service instances.
  • We also store the session recordings on an Object Storage bucket.
  • The configurations, including the teleport.yaml files, are version controlled in our repositories and deloyed through CI.

As a result, we only need to backup the Firestore database used by Teleport both for persisting the state of Cluster and the audit logs.

We do not manage any certificate authority and private keys inside the cluster. They are all stored in and managed by KMS.

To help guard against data corruption and to verify that data can be decrypted successfully, Cloud KMS periodically scans and backs up all key material and metadata. At regular intervals, the independent backup system backs up the entire datastore to both online and archival storage. This backup allows Cloud KMS to achieve its durability goals.

Please refer to this deep dive document on Google Cloud KMS and automatic backups.

The (default) Firestore database used by the Teleport cluster is backed up daily. These daily backups have a retention period of 30 days for Teleport staging cluster and 90 days for Teleport production cluster.

To list the current backup schedules, run the following command:

Terminal window
$ gcloud firestore backups schedules list --project="gitlab-teleport-production" --database="(default)"
$ gcloud firestore backups schedules list --project="gitlab-teleport-staging" --database="(default)"

To list the current backups, run the following command:

Terminal window
$ gcloud firestore backups list --project="gitlab-teleport-production"
$ gcloud firestore backups list --project="gitlab-teleport-staging"

The gl-teleport-staging-teleport-sessions and gl-teleport-production-teleport-sessions buckets are used for storing the session recordings.

These buckets use the Multi-Regional location and have soft deletion and versioning enabled.

Objects that have been in the bucket for 30 days will be automatically transitioned to the Nearline storage class (see this). Noncurrent objects (previous versions of objects) that have been noncurrent for 30 days will be automatically deleted (see this).

The combination of multi-region storage, versioning, and soft deletion provide high redundancy and protect against loss of objects (files).

To restore a backup, run the following command:

Terminal window
$ gcloud firestore databases restore \
--project="gitlab-teleport-production" \
--destination-database="(default)" \
--source-backup="projects/PROJECT_ID/locations/LOCATION/backups/BACKUP_ID" \
$ gcloud firestore databases restore \
--project="gitlab-teleport-staging" \
--destination-database="(default)" \
--source-backup="projects/PROJECT_ID/locations/LOCATION/backups/BACKUP_ID" \

This is an asynchronous operation and it returns the operation created for restoring. You can list the operations for a given Firestore database or describe a specific operation as follows.

Terminal window
$ gcloud firestore operations list --project="gitlab-teleport-production" --database="DATABASE"
$ gcloud firestore operations list --project="gitlab-teleport-staging" --database="DATABASE"
$ gcloud firestore operations describe --project="gitlab-teleport-production" "projects/PROJECT_ID/databases/DATABASE/operations/OPERATION_ID"
$ gcloud firestore operations describe --project="gitlab-teleport-staging" "projects/PROJECT_ID/databases/DATABASE/operations/OPERATION_ID"

For more details on how to backup and restore Firestore database, please see the official documentation.

We automatically verify our daily Firestore backups by running a daily job that restores the latest backup to a new database and checks if the restoration was successful. This process is implemented using Google Cloud Functions and managed as infrastructure-as-code in the teleport-backup Terraform module.

The Firestore backup restoration function, implemented in JavaScript, is available here. It follows a straightforward logic: it retrieves the list of backups for the (default) database, selects the latest one based on a timestamp field, and creates an operation to restore it into a new database. The new database’s name starts with restore-test-, followed by the execution ID and backup UUID.

The restore function is scheduled to run daily at 6:00 AM Eastern Time (UTC-05:00). The schedule is defined using a Cron pattern here.

You can access this function in the Google Cloud Console at the following locations.

You can view the execution logs in the LOGS section of the pages mentioned above. Each run has a unique execution ID that is prepended to each log message.

The Firestore backup verification function, implemented in JavaScript, can be found here. It has a straightforward logic: it retrieves the list of all Firestore databases in the Teleport project, filters out those with names starting with the restore-test- prefix, and iterates through the list of operations for each database. If any operation has failed, it logs and notifies the failure. Finally, it cleans up the test databases created for restoration.

The verify function is scheduled to run daily at 10:00 AM Eastern Time (UTC-05:00), which is four hours after the restore function. The restoration operations are expected to be completed within this timeframe. The schedule is defined using a Cron pattern here.

You can access this function in the Google Cloud Console at the following locations.

You can view the execution logs in the LOGS section of the pages mentioned above. Each run has a unique execution ID that is prepended to each log message.

If a restoration fails, you’ll see a log message with ERROR severity, starting with a ❌ symbol and providing more details about the failure.

The verification function reports a metric named gitlab_teleport_backup_test_results. This metric is a Gauge that indicates 1 when a restoration operation is successful and 0 when it fails.

Since these functions are short-lived jobs that run and exit quickly, they cannot be scraped by Prometheus directly. Instead, we need to explicitly push these metrics to a Pushgateway. Pushgateway is a long-lived service that collects metrics from short-lived jobs via a simple HTTP API and exposes them at the /metrics endpoint, allowing Prometheus to scrape them as usual.

In each of gprd, gstg, and ops environments, we run a compute instance named blackbox.

This instance runs a Prometheus Pushgateway as a Systemd service via the gitlab-prometheus::pushgateway recipe. The Pushgateway is listening on port 9091.

We need to access the blackbox instance in the ops environment via http://blackbox.int.ops.gitlab.net:9091. This is an internal address, so we must find a way to connect to it from the Teleport Google projects (gitlab-teleport-production and gitlab-teleport-staging).

If a restoration operation fails, an email notification will be sent to the [email protected] group using our Mailgun account. We have generated an API key for each Teleport instance (gitlab-teleport-production and gitlab-teleport-staging). The API keys are stored in our 1Password DevOps Vault under a Login named “Teleport”.

The Mailgun API key is accessible to the verify Cloud Function in each environment via a secret named mailgun_api_key in Google Secret Manager. This setup is done manually for ease of access, though you can also store these keys in Vault and replicate them in Google Secret Manager using Terraform. If you regenerate any Mailgun API keys, you must manually update the mailgun_api_key secret in Google Secret Manager and then update the secret version for the Cloud Function here.

Functions are Not Updated After Terraform is Applied

If you make changes to the function source code and apply them in your merge request (using Atlantis or Terraform in pipelines), you may find that in the Google Cloud Console, your changes are not reflected.

To resolve this issue, go to Cloud Storage and delete the gitlab-teleport-staging-cloud-functions bucket and another bucket starting with gcf-v2-sources-. Then, delete the function from the Google Cloud Functions web console. Finally, run terraform init and terraform apply from the main branch of config-mgmt locally to redeploy all the resources.