Skip to content

Teleport Administration

This run book covers administration of the Teleport service from an infrastructure perspective.

We run two Teleport clusters. The production cluster is used for managing all GitLab’s infrastructure resources (VMs, databases, etc.). The staging cluster is used for testing new changes, major upgrades, disaster recovery process, etc. Users only need to know about and use the production Teleport cluster available at https://production.teleport.gitlab.net.

The production Teleport cluster currently runs on ops-central GKE cluster in us-central1 while the staging one runs on ops-gitlab-gke GKE cluster in us-east1. We run the production Teleport cluster on a different region than the region we run our production infrastruture, so we can still access our infrastructure in case there is a region failure in our production region (us-east1).

The infra-as-code for these Teleport clusters can be found in the following locations:

Very high level, Teleport has two major components: Teleport cluster and Teleport agents. Teleport agents are processes that run on VMs or Kubernetes clusters and register resources with a Teleport cluster.

At GitLab, we run the Teleport agent on our VMs using the gitlab-teleport cookbook and on our GKE clusters using the teleport-agent Helm chart and release. We also run a Teleport cluster using the teleport-cluster Helm chart and release.

The agents running VMs register the VMs they are running on as servers. The agents running on Kubernetes clsuters register the Kubernetes clusters they are running on The Kubernetes management with Teleport is disabled at the moment since our license does not include this feature. The Teleport agents running on Kubernetes clusters also act as a proxy for registering our Postgres databases.

The Teleport cluster is comprised of multiple components.

  • Teleport Auth Service
    • The auth service acts as a certificate authority (CA) for the cluster.
    • It issues certificates for clients and nodes, collects the audit information, and stores it in the audit log.
    • The auth service is run in high-availability mode in our GKE clusters.
    • The auth service can be configured via tctl command line tool.
  • Teleport Proxy Service
    • The proxy is the only service in a cluster visible to the outside world.
    • All user connections for all supported protocols go through the proxy.
    • The proxy also serves the Web UI and allows remote nodes to establish reverse tunnels.
    • The proxy service is also run in high-availability mode in our GKE clusters.
  • Teleport Kubernetes Operator
  • Teleport Slack Plugin
    • Teleport’s Slack integration is used for notifying individuals and #teleport-requests channel.
  • Teleport Event-Handler Plugin
    • Teleport’s event-handler plugins allows securely sending audit events to a Fluentd instance for further processing by SIEM systems.

Certificate-based authentication is the most secure form of authentication. Teleport supports all the necessary certificate management operations to enable certificate-based authentication at scale. You can read more about authentication in Teleport here.

Teleport uses Role-Based Access Control (RBAC) model. We define roles and each role specifies what is allowed and what is not allowed. You can read more about authoriztion in Teleport here.

We use the Google Cloud Key Management Service (KMS) to store and handle Teleport certificate authorities.

Teleport generates private key material for its internal Certificate Authorities (CAs) during the first Auth Server’s initial startup. These CAs are used to sign all certificates issued to clients and hosts in the Teleport cluster. When configured to use Google Cloud KMS, all private key material for these CAs will be generated, stored, and used for signing inside of Google Cloud KMS. Instead of the actual private key, Teleport will only store the ID of the KMS key. In short, private key material will never leave Google Cloud KMS.

Please refer to this guide for more information on storing Teleport private keys in Google Cloud KMS.

We create a Key Ring and a CryptoKey for Teleport CA in this file. We then reference this key when installing the teleport-cluster Helm chart here

We do not manage any certificate authority and private keys inside the cluster. They are all stored in and managed by KMS.

To help guard against data corruption and to verify that data can be decrypted successfully, Cloud KMS periodically scans and backs up all key material and metadata.

Please refer to this deep dive document on Google Cloud KMS and automatic backups.

The PosgreSQL databases are registered with the Teleport instance by Teleport agents running on our regional Kubernetes clusters.

The certificates ( gprd and gstg ) used by teleport-agent running on Kubernetes clusters should match the PostgreSQL server certificate located at /var/opt/gitlab/postgresql/server.crt.

Furthermore, the CA file found at /var/opt/gitlab/postgresql/cacert on each Patroni or PostgreSQL node should correspond to the certificate authority used by the Teleport instance with which those databases are registered. This file is written by the gitlab-patroni::postgres recipe which is imported by the gitlab-patroni::default recipe. This recipe retrives the content of CA file from gs://gitlab-<env>-secrets/gitlab-patroni/gstg.enc.

If you move databases to a different Teleport instance and update this CA file, please remember to run the select pg_reload_conf(); command from the gitlab-psql shell on each node to reload the update CA.

Here is an example CR for updating the CA file.

Teleport roles and permissions are defined in roles_*.tf files. If you add a new role, you also need to add it to roles.tf file.

The association between Okta groups and Teleport roles are configured in groups.tf file.

All the services comprise a Teleport cluster are stateless.

You can check the status of the production Teleport cluster by running the following command, after running glsh kube use-cluster ops-central command in a separate shell:

$ kubectl exec --namespace=teleport-cluster-production --stdin --tty <teleport-production-auth-pod> -- tctl status
Example output
Cluster production.teleport.gitlab.net
Version 16.1.7
host CA never updated
user CA never updated
db CA never updated
db_client CA never updated
openssh CA never updated
jwt CA never updated
saml_idp CA never updated
oidc_idp CA never updated
spiffe CA never updated
CA pin sha256:d4ac1c9af5d25e6cf3c60c8078efe443c1186c071c99641dcd9b11eb0831f46d

Generally, if the pods are healthy, then the service is healthy.

When our license is about to expire, we need to obtain a new license file and update our Teleport instances.

  • Access to Teleport’s billing dashboard at gitlab-tp.teleport.sh, or support from an admin to download and provide the license for you (see step 1 below).
  • Access to Vault
  • Read/write access to the following Vault paths: k8s/ops-gitlab-gke/teleport-cluster-production/, k8s/ops-central/teleport-cluster-staging/
  • Access to the Kubernetes API, or support from a site reliability or infrastructure engineer to access Kubernetes on your behalf.

Log in to Teleport’s billing dashboard at gitlab-tp.teleport.sh as an admin and download the new license file (license.pem).

Notes:

  • We use the same License for both staging and production instances teleport.
  • If you don’t have an account, ask an admin user to do so and share the license file with you through a secure channel (e.g. 1Password). Admin users include the business owners listed in the tech stack - filter for “Teleport”.

Step 2. Update the license stored in Vault

Section titled “Step 2. Update the license stored in Vault”

Approach 1: via Vault GUI

  1. Log in to vault and navigate to the Teleport cluster

Staging: ops-gitlab-gke/teleport-cluster-staging/ Production: ops-central/teleport-cluster-production/

  1. Click on license
  2. Click on the Secret tab
  3. Click Create new version +
  4. Delete the contents of the text box to the right of license.pem, and paste the full contents of the license file obtained in Step 1
  5. Click the Show diff toggle and confirm the contents have changed
  6. Click Save
  7. Take note of the Current version

Approach 2: via Vault CLI

  1. Open SSH proxy
Terminal window
# Run this command from the runbooks repo
$ glsh vault proxy
  1. Update the license

Staging:

Terminal window
# Write the new license to Vault
$ vault login -method oidc
$ vault kv put k8s/ops-gitlab-gke/teleport-cluster-staging/license [email protected]

Production:

Terminal window
# Write the new license to Vault
$ vault login -method oidc
$ vault kv put k8s/ops-central/teleport-cluster-production/license [email protected]

Update the version for secretKey:license.pem in the argocd/apps repository:

Staging: services/teleport-cluster/env/ops/clusters/ops-gitlab-gke/values-vault-secrets.yaml Production: services/teleport-cluster/env/ops/clusters/ops-central/values-vault-secrets.yaml

The auth service should automatically restart after merging the Helm chart changes from the previous step.

If this does not happen, it can be manually restarted by following these steps:

Staging:

Terminal window
# Run this command from the runbooks repo
$ glsh kube use-cluster ops
# Restart the teleport auth pods
$ kubectl rollout restart deployment/teleport-staging-auth --namespace=teleport-cluster-staging

Production:

Terminal window
# Run this command from the runbooks repo
$ glsh kube use-cluster ops-central
# Restart the teleport auth pods
$ kubectl rollout restart deployment/teleport-production-auth --namespace=teleport-cluster-production

Read more about the Enterprise License file here and managing it here.

Past license rotation issues:

If the license hasn’t updated as expected after bumping the secret version, first check that the license secret contains what you expect:

Terminal window
kubectl get secrets license -n teleport-cluster-production -o json | jq -r '.data."license.pem"' | base64 -d

If it doesn’t, check what the refreshInterval of the ExternalSecret is:

Terminal window
kubectl get es license -n teleport-cluster-production -o json | jq .spec.refreshInterval

If refreshInterval is set to 0, external-secrets will never update the secret from Vault, so you will need to change the refreshInterval to something non-zero (e.g. 1h).

Terraform is used to manage resources in the Teleport cluster. See this guide for more information on the Terraform provider.

Terraform authenticates to Teleport using Machine & Workload Identity via two bots, each using a different join method depending on where Terraform runs. Credentials are issued automatically via the respective join method - no manual rotation is required.

This bot is used by Terraform when running in GitLab CI. See the token and bot definition in the teleport-cluster-config chart.

  • Join method: gitlab
  • Token name: terraform-gitlab-bot-token
  • Restricted to: project gitlab-com/gl-infra/config-mgmt on ops.gitlab.net, main branch, protected refs only
  • Role: terraform-cluster-manager

This bot is used by Terraform when running via Atlantis on GCP. See the token and bot definition in the teleport-cluster-config chart.

  • Join method: gcp
  • Token name: terraform-gcp-bot-token
  • GCP service account: [email protected] in project gitlab-ops
  • Role: terraform-cluster-manager

Both bots share the terraform-cluster-manager role, which has broad CRUD permissions to manage Teleport resources (roles, bots, tokens, users, etc.).

The Teleport resources supporting this are defined in the teleport-cluster-config chart and must be applied before Terraform runs.

  • The following resources are created via the teleport-cluster-config chart:
    • Role terraform-cluster-manager
    • TeleportProvisionToken terraform-gitlab-bot-token (GitLab join method)
    • TeleportProvisionToken terraform-gcp-bot-token (GCP join method)
    • TeleportBotV1 terraform-gitlab-bot
    • TeleportBotV1 terraform-gcp-bot

The teleport-plugin-slack is used for communicating with Slack. See this guide for running this plugin.

This plugin authenticates to Teleport using Machine & Workload Identity via tbot. Credentials are automatically renewed by tbot - no manual rotation is required.

tbot is deployed as a separate deployment within the teleport-plugin-slack Helm release. On startup, it joins the Teleport cluster using the slack-bot-token provision token via the Kubernetes JWKS join method, authenticated by the pod’s Kubernetes ServiceAccount (using the static_jwks sub-type). Once joined, tbot continuously renews short-lived credentials and writes them to the teleport-{env}-slack-tbot-out Kubernetes secret, which the Slack plugin reads to authenticate its requests to Teleport.

The Teleport resources supporting this are defined in the teleport-cluster-config chart and must be applied before the plugin deploys.

  • This plugin is installed using the teleport-plugin-slack Helm chart (see the ArgoCD app.yaml).
  • The following resources are created via the teleport-cluster-config chart:
    • Role slack-access-requests-viewer
    • Role slack-access-requests-manager
    • TeleportProvisionToken slack-bot-token (Kubernetes JWKS join method)
    • TeleportBotV1 slack-bot

The teleport-plugin-event-handler is used for handling Teleport audit events and sending them to a Fluentd instance, so such events can be further shipped to other systems (i.e. SIEM) for security and auditing purposes. Read more about exporting Teleport audit events here.

See this guide and this one for further instructions.

This plugin authenticates to Teleport using Machine & Workload Identity via tbot. Credentials are automatically renewed by tbot - no manual rotation is required.

tbot is deployed as a separate deployment within the teleport-plugin-event-handler Helm release. On startup, it joins the Teleport cluster using the event-handler-bot-token provision token via the Kubernetes JWKS join method, authenticated by the pod’s Kubernetes ServiceAccount (using the static_jwks sub-type). Once joined, tbot continuously renews short-lived credentials and writes them to the teleport-{env}-event-handler-tbot-out Kubernetes secret, which the event-handler plugin reads to authenticate its requests to Teleport.

The Teleport resources supporting this are defined in the teleport-cluster-config chart and must be applied before the plugin deploys.

  • This plugin is installed using the teleport-plugin-event-handler Helm chart (see the ArgoCD app.yaml).
  • The following resources are created via the teleport-cluster-config chart:
    • Role event-handler-events-sessions-viewer
    • TeleportProvisionToken event-handler-bot-token (Kubernetes JWKS join method)
    • TeleportBotV1 event-handler-bot

The teleport-plugin-event-handler requires Mutual TLS connection be enabled on fluentd instance for security purposes. The certificate authority, server key and certificate, and client key and certificate are stored in Vault at k8s/ops-gitlab-gke/teleport-cluster-staging/fluentd-certs and k8s/ops-central/teleport-cluster-production/fluentd-certs.

For renewing the certificates, create the following files in a directory.

openssl.conf
[req]
default_bits = 4096
distinguished_name = req_distinguished_name
string_mask = utf8only
default_md = sha256
x509_extensions = v3_ca
[req_distinguished_name]
countryName = Country Name (2 letter code)
stateOrProvinceName = State or Province Name
localityName = Locality Name
0.organizationName = Organization Name
organizationalUnitName = Organizational Unit Name
commonName = Common Name
emailAddress = Email Address
countryName_default =
stateOrProvinceName_default =
localityName_default =
0.organizationName_default = GitLab Inc.
organizationalUnitName_default = Teleport
commonName_default = localhost
emailAddress_default =
[v3_ca]
subjectKeyIdentifier = hash
authorityKeyIdentifier = keyid:always,issuer
basicConstraints = critical, CA:true, pathlen: 0
keyUsage = critical, cRLSign, keyCertSign
[client_cert]
basicConstraints = CA:FALSE
nsCertType = client, email
nsComment = "OpenSSL Generated Client Certificate"
subjectKeyIdentifier = hash
authorityKeyIdentifier = keyid,issuer
keyUsage = critical, nonRepudiation, digitalSignature, keyEncipherment
extendedKeyUsage = clientAuth, emailProtection
[crl_ext]
authorityKeyIdentifier = keyid:always
[ocsp]
basicConstraints = CA:FALSE
subjectKeyIdentifier = hash
authorityKeyIdentifier = keyid,issuer
keyUsage = critical, digitalSignature
extendedKeyUsage = critical, OCSPSigning
[staging_server_cert]
basicConstraints = CA:FALSE
nsCertType = server
nsComment = "OpenSSL Generated Server Certificate"
subjectKeyIdentifier = hash
authorityKeyIdentifier = keyid,issuer:always
keyUsage = critical, digitalSignature, keyEncipherment
extendedKeyUsage = serverAuth
subjectAltName = @staging_alt_names
[staging_alt_names]
DNS.0 = teleport-staging-fluentd-headless.teleport-cluster-staging.svc.cluster.local
DNS.1 = *.teleport-staging-fluentd-headless.teleport-cluster-staging.svc.cluster.local
DNS.2 = teleport-staging-fluentd-aggregator.teleport-cluster-staging.svc.cluster.local
DNS.3 = *.teleport-staging-fluentd-aggregator.teleport-cluster-staging.svc.cluster.local
[production_server_cert]
basicConstraints = CA:FALSE
nsCertType = server
nsComment = "OpenSSL Generated Server Certificate"
subjectKeyIdentifier = hash
authorityKeyIdentifier = keyid,issuer:always
keyUsage = critical, digitalSignature, keyEncipherment
extendedKeyUsage = serverAuth
subjectAltName = @production_alt_names
[production_alt_names]
DNS.0 = teleport-production-fluentd-headless.teleport-cluster-production.svc.cluster.local
DNS.1 = *.teleport-production-fluentd-headless.teleport-cluster-production.svc.cluster.local
DNS.2 = teleport-production-fluentd-aggregator.teleport-cluster-production.svc.cluster.local
DNS.3 = *.teleport-production-fluentd-aggregator.teleport-cluster-production.svc.cluster.local
Makefile
key_len := 4096
staging_dir := staging
production_dir := production
.PHONY: gen-staging
gen-staging:
mkdir -p $(staging_dir)
rm -f $(staging_dir)/*
openssl genrsa -out $(staging_dir)/ca.key $(key_len)
chmod 444 $(staging_dir)/ca.key
openssl req -config openssl.conf -key $(staging_dir)/ca.key -new -x509 -days 3650 -sha256 -extensions v3_ca -subj "/CN=ca" -out $(staging_dir)/ca.crt
openssl genrsa -out $(staging_dir)/client.key $(key_len)
chmod 444 $(staging_dir)/client.key
openssl req -config openssl.conf -subj "/CN=teleport-event-handler" -key $(staging_dir)/client.key -new -out $(staging_dir)/client.csr
openssl x509 -req -in $(staging_dir)/client.csr -CA $(staging_dir)/ca.crt -CAkey $(staging_dir)/ca.key -CAcreateserial -days 365 -out $(staging_dir)/client.crt -extfile openssl.conf -extensions client_cert
openssl genrsa -out $(staging_dir)/server.key $(key_len)
chmod 444 $(staging_dir)/server.key
openssl req -config openssl.conf -subj "/CN=fluentd-aggregator" -key $(staging_dir)/server.key -new -out $(staging_dir)/server.csr
openssl x509 -req -in $(staging_dir)/server.csr -CA $(staging_dir)/ca.crt -CAkey $(staging_dir)/ca.key -CAcreateserial -days 365 -out $(staging_dir)/server.crt -extfile openssl.conf -extensions staging_server_cert
.PHONY: gen-production
gen-production:
mkdir -p $(production_dir)
rm -f $(production_dir)/*
openssl genrsa -out $(production_dir)/ca.key $(key_len)
chmod 444 $(production_dir)/ca.key
openssl req -config openssl.conf -key $(production_dir)/ca.key -new -x509 -days 3650 -sha256 -extensions v3_ca -subj "/CN=ca" -out $(production_dir)/ca.crt
openssl genrsa -out $(production_dir)/client.key $(key_len)
chmod 444 $(production_dir)/client.key
openssl req -config openssl.conf -subj "/CN=teleport-event-handler" -key $(production_dir)/client.key -new -out $(production_dir)/client.csr
openssl x509 -req -in $(production_dir)/client.csr -CA $(production_dir)/ca.crt -CAkey $(production_dir)/ca.key -CAcreateserial -days 365 -out $(production_dir)/client.crt -extfile openssl.conf -extensions client_cert
openssl genrsa -out $(production_dir)/server.key $(key_len)
chmod 444 $(production_dir)/server.key
openssl req -config openssl.conf -subj "/CN=fluentd-aggregator" -key $(production_dir)/server.key -new -out $(production_dir)/server.csr
openssl x509 -req -in $(production_dir)/server.csr -CA $(production_dir)/ca.crt -CAkey $(production_dir)/ca.key -CAcreateserial -days 365 -out $(production_dir)/server.crt -extfile openssl.conf -extensions production_server_cert

Run the make gen-production command for generating certificates for the teleport-production cluster.

and the make gen-production command for the teleport-production cluster. Next, update the Vault secret as follows.

Terminal window
$ vault login -method oidc
$ vault kv put k8s/ops-central/teleport-cluster-production/fluentd-certs \
ca.crt="$(cat ca.crt)" \
server.crt="$(cat server.crt)" \
server.key="$(cat server.key)" \
client.crt="$(cat client.crt)" \
client.key="$(cat client.key)"

Finally, get the latest secret version from the Vault and update the teleport-cluster release here.

Fluentd uses fluent-plugin-gcloud-pubsub-custom gem for sending the audit events to the following Google Cloud Pub/Sub topics:

  • projects/gitlab-teleport-staging/topics/teleport-staging-events
  • projects/gitlab-teleport-production/topics/teleport-production-events

We use a custom-built OCI (Docker) image with the fluent-plugin-gcloud-pubsub-custom gem baked into the image. You can update/modify this image here.

We use Workload Identity for authenticating to Google Cloud. The Workload Identity is configured here and the required Roles are configured here.

When you add a new resource, there are manual steps required to initialize the new resource with the Teleport cluster.

  1. Your new resource should have the gitlab-teleport Chef cookbook as part of it’s runlist.
  2. Your new resource should also have network access to talk to the Teleport proxy in the cluster.
  3. After applying a new token to register a node, the next chef-client run should remove the secret from the config file.
  1. On your local machine, log into tsh.
  2. Use tctl tokens add --ttl=5m --type=node to generate a new token.
  3. On the resource node, stop teleport with systemctl stop teleport.
  4. On the resource node, edit the /etc/teleport/config.yaml file, and add the generated token to the token_name field.
  5. Start up the teleport service again on the resource node: systemctl start teleport.
  1. You will need to add the new cluster resources to the K8s Teleport Agent.
  2. You can reference sections earlier in this document about how the certificates are used to verify access to the database.

If you have added a new db-type (for example), you may need to add it to existing roles defined in Terraform. Here is an example merge request to add the new sec db-type to existing roles.

Join Services with a Secure Token