Skip to content

Teleport Administration

This run book covers administration of the Teleport service from an infrastructure perspective.

We run two Teleport clusters. The production cluster is used for managing all GitLab’s infrastructure resources (VMs, databases, etc.). The staging cluster is used for testing new changes, major upgrades, disaster recovery process, etc. Users only need to know about and use the production Teleport cluster available at https://production.teleport.gitlab.net.

The production Teleport cluster currently runs on ops-central GKE cluster in us-central1 while the staging one runs on ops-gitlab-gke GKE cluster in us-east1. We run the production Teleport cluster on a different region than the region we run our production infrastruture, so we can still access our infrastructure in case there is a region failure in our production region (us-east1).

The infra-as-code for these Teleport clusters can be found in the following locations:

Very high level, Teleport has two major components: Teleport cluster and Teleport agents. Teleport agents are processes that run on VMs or Kubernetes clusters and register resources with a Teleport cluster.

At GitLab, we run the Teleport agent on our VMs using the gitlab-teleport cookbook and on our GKE clusters using the teleport-agent Helm chart and release. We also run a Teleport cluster using the teleport-cluster Helm chart and release.

The agents running VMs register the VMs they are running on as servers. The agents running on Kubernetes clsuters register the Kubernetes clusters they are running on The Kubernetes management with Teleport is disabled at the moment since our license does not include this feature. The Teleport agents running on Kubernetes clusters also act as a proxy for registering our Postgres databases.

The Teleport cluster is comprised of multiple components.

  • Teleport Auth Service
    • The auth service acts as a certificate authority (CA) for the cluster.
    • It issues certificates for clients and nodes, collects the audit information, and stores it in the audit log.
    • The auth service is run in high-availability mode in our GKE clusters.
    • The auth service can be configured via tctl command line tool.
  • Teleport Proxy Service
    • The proxy is the only service in a cluster visible to the outside world.
    • All user connections for all supported protocols go through the proxy.
    • The proxy also serves the Web UI and allows remote nodes to establish reverse tunnels.
    • The proxy service is also run in high-availability mode in our GKE clusters.
  • Teleport Kubernetes Operator
  • Teleport Slack Plugin
    • Teleport’s Slack integration is used for notifying individuals and #teleport-requests channel.
  • Teleport Event-Handler Plugin
    • Teleport’s event-handler plugins allows securely sending audit events to a Fluentd instance for further processing by SIEM systems.

Certificate-based authentication is the most secure form of authentication. Teleport supports all the necessary certificate management operations to enable certificate-based authentication at scale. You can read more about authentication in Teleport here.

Teleport uses Role-Based Access Control (RBAC) model. We define roles and each role specifies what is allowed and what is not allowed. You can read more about authoriztion in Teleport here.

We use the Google Cloud Key Management Service (KMS) to store and handle Teleport certificate authorities.

Teleport generates private key material for its internal Certificate Authorities (CAs) during the first Auth Server’s initial startup. These CAs are used to sign all certificates issued to clients and hosts in the Teleport cluster. When configured to use Google Cloud KMS, all private key material for these CAs will be generated, stored, and used for signing inside of Google Cloud KMS. Instead of the actual private key, Teleport will only store the ID of the KMS key. In short, private key material will never leave Google Cloud KMS.

Please refer to this guide for more information on storing Teleport private keys in Google Cloud KMS.

We create a Key Ring and a CryptoKey for Teleport CA in this file. We then reference this key when installing the teleport-cluster Helm chart here

We do not manage any certificate authority and private keys inside the cluster. They are all stored in and managed by KMS.

To help guard against data corruption and to verify that data can be decrypted successfully, Cloud KMS periodically scans and backs up all key material and metadata.

Please refer to this deep dive document on Google Cloud KMS and automatic backups.

The PosgreSQL databases are registered with the Teleport instance by Teleport agents running on our regional Kubernetes clusters.

The certificates ( gprd and gstg used by teleport-agent running on Kubernetes clusters should match the PostgreSQL server certificate located at /var/opt/gitlab/postgresql/server.crt.

Furthermore, the CA file found at /var/opt/gitlab/postgresql/cacert on each Patroni or PostgreSQL node should correspond to the certificate authority used by the Teleport instance with which those databases are registered. This file is written by the gitlab-patroni::postgres recipe which is imported by the gitlab-patroni::default recipe. This recipe retrives the content of CA file from gs://gitlab-<env>-secrets/gitlab-patroni/gstg.enc.

If you move databases to a different Teleport instance and update this CA file, please remember to run the select pg_reload_conf(); command from the gitlab-psql shell on each node to reload the update CA.

Here is an example CR for updating the CA file.

Teleport roles and permissions are defined in roles_*.tf files. If you add a new role, you also need to add it to roles.tf file.

The association between Okta groups and Teleport roles are configured in groups.tf file.

All the services comprise a Teleport cluster are stateless.

You can check the status of the production Teleport cluster by running the following command, after running glsh kube use-cluster ops-central command in a separate shell:

$ kubectl exec --namespace=teleport-cluster-production --stdin --tty <teleport-production-auth-pod> -- tctl status
Example output
Cluster production.teleport.gitlab.net
Version 16.1.7
host CA never updated
user CA never updated
db CA never updated
db_client CA never updated
openssh CA never updated
jwt CA never updated
saml_idp CA never updated
oidc_idp CA never updated
spiffe CA never updated
CA pin sha256:d4ac1c9af5d25e6cf3c60c8078efe443c1186c071c99641dcd9b11eb0831f46d

Generally, if the pods are healthy, then the service is healthy.

We use the same License for both staging and production instances teleport.

When our license is about to expire, we need to obtain a new license file and update our Teleport instances with. Read more about the Enterprise License file here and managing it here. In short, you need to login to gitlab-tp.teleport.sh as an admin and download the new license file (license.pem). You can also ask an admin user to do so and share the license file with you through a secure channel (1Password). Admin users include the business owners listed in the tech stack - search for “title: teleport”.

Add the new license to Vault.

Terminal window
$ glsh vault proxy
$ vault login -method oidc
$ vault kv put k8s/ops-central/teleport-cluster-production/license [email protected]

Grab the latest version from the output last command and update it here.

Finally restart the Teleport Auth component.

Terminal window
$ glsh kube use-cluster ops-central
$ kubectl rollout restart deployment/teleport-production-auth --namespace=teleport-cluster-production

If the license hasn’t updated as expected after bumping the secret version, first check that the license secret contains what you expect:

Terminal window
kubectl get secrets license -n teleport-cluster-production -o json | jq -r '.data."license.pem"' | base64 -d

If it doesn’t, check what the refreshInterval of the ExternalSecret is:

Terminal window
kubectl get es license -n teleport-cluster-production -o json | jq .spec.refreshInterval

If refreshInterval is set to 0, external-secrets will never update the secret from Vault, so you will need to change the refreshInterval to something non-zero (e.g. 1h).

Terraform needs a user with required permissions to manage the Teleport cluster. See this guide for more information.

Currently, we use a user terraform and generate an auth identity for this user using impersonation. We generate this identity to be valid for one year and it needs to be regenerated once year. We should switch to Machine ID for running this plugin which does not manualy renewal. Please refer to this issue.

  • The following roles and user are created via the teleport-bootstrap chart.
    • Role terraform-cluster-manager
    • Role terraform-impersonator
    • User terraform

Use the following Makefile, if you need to update or rotate the auth identity for the terraform user.

Makefile
okta_group := GitLab - SRE
impersonator_role := terraform-impersonator
target_user := terraform
authid_ttl := 8760h
define update_auth_id
tsh login --proxy=$(1).teleport.gitlab.net
tctl get saml/okta --with-secrets > user-saml-$(1).yaml
# Add the impersonator role to the Okta group
yq eval '.spec.attributes_to_roles[] |= select(.value == "$(okta_group)") .roles += ["$(impersonator_role)"]' -i user-saml-$(1).yaml
tctl create -f user-saml-$(1).yaml
# Login again to assume the new role
tsh logout
tsh login --proxy=$(1).teleport.gitlab.net
# Request and sign an identity certificate for the user
tctl auth sign --user=$(target_user) --ttl=$(authid_ttl) --out=auth-id-$(1)
# Write the new auth identity to Vault
vault login -method oidc
vault kv patch -mount=ci ops-gitlab-net/gitlab-com/gl-infra/config-mgmt/teleport-$(1)/teleport auth_id=@auth-id-$(1)
# Remove created files
rm user-saml-$(1).yaml auth-id-$(1)
# Display the latest version of Vault secret
vault kv get -format=json -mount=ci ops-gitlab-net/gitlab-com/gl-infra/config-mgmt/teleport-$(1)/teleport | jq '.data.metadata.version' | xargs -I {} echo "The latest secret version: {}"
endef
.PHONY: update-production
update-production:
$(call update_auth_id,production)
.PHONY: update-staging
update-staging:
$(call update_auth_id,staging)
  1. Run the glsh vault proxy command in a separate terminal.
  2. Run make update-staging or make update-production.
  3. Terraform will pick up the latest version of the Vault secret.

The teleport-plugin-slack is used for communicating with Slack. See this guide for running this plugin.

Currently, we use a user slack and generate an auth identity for this user using impersonation. We generate this identity to be valid for one year and it needs to be regenerated once year. We should switch to Machine ID for running this plugin which does not manualy renewal. Please refer to this issue.

  • This plugin is installed using teleport-plugin-slack Helm chart (see this).
  • The following roles and user are created via the teleport-bootstrap chart.
    • Role slack-access-requests-viewer
    • Role slack-access-requests-manager
    • Role slack-impersonator
    • User slack

Use the following Makefile, if you need to update or rotate the auth identity for the slack user.

Makefile
okta_group := GitLab - SRE
impersonator_role := slack-impersonator
target_user := slack
authid_ttl := 8760h
define update_auth_id
$(eval cluster=$(if $(filter production,$(1)),ops-central,$(if $(filter staging,$(1)),ops-gitlab-gke,invalid)))
tsh login --proxy=$(1).teleport.gitlab.net
tctl get saml/okta --with-secrets > user-saml-$(1).yaml
# Add the impersonator role to the Okta group
yq eval '.spec.attributes_to_roles[] |= select(.value == "$(okta_group)") .roles += ["$(impersonator_role)"]' -i user-saml-$(1).yaml
tctl create -f user-saml-$(1).yaml
# Login again to assume the new role
tsh logout
tsh login --proxy=$(1).teleport.gitlab.net
# Request and sign an identity certificate for the user
tctl auth sign --user=$(target_user) --ttl=$(authid_ttl) --out=auth-id-$(1)
# Write the new auth identity to Vault
vault login -method oidc
vault kv patch -mount=k8s $(cluster)/teleport-cluster-$(1)/slack auth_id=@auth-id-$(1)
# Remove created files
rm user-saml-$(1).yaml auth-id-$(1)
# Display the latest version of Vault secret
vault kv get -format=json -mount=k8s $(cluster)/teleport-cluster-$(1)/slack | jq '.data.metadata.version' | xargs -I {} echo "The latest secret version: {}"
endef
.PHONY: update-production
update-production:
$(call update_auth_id,production)
.PHONY: update-staging
update-staging:
$(call update_auth_id,staging)
  1. Run the glsh vault proxy command in a separate terminal.

  2. Run make update-staging or make update-production.

  3. The last line of the output shows the latest version of the Vault secret. Grab it and update the teleport-cluster release with it (staging and production).

  4. Restart the teleport-staging-slack deployment to pick the new auth identity.

    Terminal window
    # Staging
    $ glsh kube use-cluster ops
    $ kubectl rollout restart deployment/teleport-staging-slack --namespace=teleport-cluster-staging
    # Production
    $ glsh kube use-cluster ops-central
    $ kubectl rollout restart deployment/teleport-production-slack --namespace=teleport-cluster-production

The teleport-plugin-event-handler is used for handling Teleport audit events and sending them to a Fluentd instance, so such events can be further shipped to other systems (i.e. SIEM) for security and auditing purposes. Read more about exporting exporting Teleport audit events here.

See this guide and this one for further instructions.

Currently, we use a user event-handler and generate an auth identity for this user using impersonation. We generate this identity to be valid for one year and it needs to be regenerated once year. We should switch to Machine ID for running this plugin which does not manualy renewal. Please refer to this issue.

Use the following Makefile, if you need to update or rotate the auth identity for the event-handler user.

Makefile
okta_group := GitLab - SRE
impersonator_role := event-handler-impersonator
target_user := event-handler
authid_ttl := 8760h
define update_auth_id
$(eval cluster=$(if $(filter production,$(1)),ops-central,$(if $(filter staging,$(1)),ops-gitlab-gke,invalid)))
tsh login --proxy=$(1).teleport.gitlab.net
tctl get saml/okta --with-secrets > user-saml-$(1).yaml
# Add the impersonator role to the Okta group
yq eval '.spec.attributes_to_roles[] |= select(.value == "$(okta_group)") .roles += ["$(impersonator_role)"]' -i user-saml-$(1).yaml
tctl create -f user-saml-$(1).yaml
# Login again to assume the new role
tsh logout
tsh login --proxy=$(1).teleport.gitlab.net
# Request and sign an identity certificate for the user
tctl auth sign --user=$(target_user) --ttl=$(authid_ttl) --out=auth-id-$(1)
# Write the new auth identity to Vault
vault login -method oidc
vault kv patch -mount=k8s $(cluster)/teleport-cluster-$(1)/event-handler auth_id=@auth-id-$(1)
# Remove created files
rm user-saml-$(1).yaml auth-id-$(1)
# Display the latest version of Vault secret
vault kv get -format=json -mount=k8s $(cluster)/teleport-cluster-$(1)/event-handler | jq '.data.metadata.version' | xargs -I {} echo "The latest secret version: {}"
endef
.PHONY: update-production
update-production:
$(call update_auth_id,production)
.PHONY: update-staging
update-staging:
$(call update_auth_id,staging)
  1. Run the glsh vault proxy command in a separate terminal.

  2. Run make update-staging or make update-production.

  3. The last line of the output shows the latest version of the Vault secret. Grab it and update the teleport-cluster release with it (staging and production).

  4. Restart the teleport-staging-event-handler deployment to pick the new auth identity.

    Terminal window
    # Staging
    $ glsh kube use-cluster ops
    $ kubectl rollout restart deployment/teleport-staging-event-handler --namespace=teleport-cluster-staging
    # Production
    $ glsh kube use-cluster ops-central
    $ kubectl rollout restart deployment/teleport-production-event-handler --namespace=teleport-cluster-production

The teleport-plugin-event-handler requires Mutual TLS connection be enabled on fluentd instance for security purposes. The certificate authority, server key and certificate, and client key and certificate are stored in Vault at k8s/ops-gitlab-gke/teleport-cluster-staging/fluentd-certs and k8s/ops-central/teleport-cluster-production/fluentd-certs.

For renewing the certificates, create the following files in a directory.

openssl.conf
[req]
default_bits = 4096
distinguished_name = req_distinguished_name
string_mask = utf8only
default_md = sha256
x509_extensions = v3_ca
[req_distinguished_name]
countryName = Country Name (2 letter code)
stateOrProvinceName = State or Province Name
localityName = Locality Name
0.organizationName = Organization Name
organizationalUnitName = Organizational Unit Name
commonName = Common Name
emailAddress = Email Address
countryName_default =
stateOrProvinceName_default =
localityName_default =
0.organizationName_default = GitLab Inc.
organizationalUnitName_default = Teleport
commonName_default = localhost
emailAddress_default =
[v3_ca]
subjectKeyIdentifier = hash
authorityKeyIdentifier = keyid:always,issuer
basicConstraints = critical, CA:true, pathlen: 0
keyUsage = critical, cRLSign, keyCertSign
[client_cert]
basicConstraints = CA:FALSE
nsCertType = client, email
nsComment = "OpenSSL Generated Client Certificate"
subjectKeyIdentifier = hash
authorityKeyIdentifier = keyid,issuer
keyUsage = critical, nonRepudiation, digitalSignature, keyEncipherment
extendedKeyUsage = clientAuth, emailProtection
[crl_ext]
authorityKeyIdentifier = keyid:always
[ocsp]
basicConstraints = CA:FALSE
subjectKeyIdentifier = hash
authorityKeyIdentifier = keyid,issuer
keyUsage = critical, digitalSignature
extendedKeyUsage = critical, OCSPSigning
[staging_server_cert]
basicConstraints = CA:FALSE
nsCertType = server
nsComment = "OpenSSL Generated Server Certificate"
subjectKeyIdentifier = hash
authorityKeyIdentifier = keyid,issuer:always
keyUsage = critical, digitalSignature, keyEncipherment
extendedKeyUsage = serverAuth
subjectAltName = @staging_alt_names
[staging_alt_names]
DNS.0 = teleport-staging-fluentd-headless.teleport-cluster-staging.svc.cluster.local
DNS.1 = *.teleport-staging-fluentd-headless.teleport-cluster-staging.svc.cluster.local
DNS.2 = teleport-staging-fluentd-aggregator.teleport-cluster-staging.svc.cluster.local
DNS.3 = *.teleport-staging-fluentd-aggregator.teleport-cluster-staging.svc.cluster.local
[production_server_cert]
basicConstraints = CA:FALSE
nsCertType = server
nsComment = "OpenSSL Generated Server Certificate"
subjectKeyIdentifier = hash
authorityKeyIdentifier = keyid,issuer:always
keyUsage = critical, digitalSignature, keyEncipherment
extendedKeyUsage = serverAuth
subjectAltName = @production_alt_names
[production_alt_names]
DNS.0 = teleport-production-fluentd-headless.teleport-cluster-production.svc.cluster.local
DNS.1 = *.teleport-production-fluentd-headless.teleport-cluster-production.svc.cluster.local
DNS.2 = teleport-production-fluentd-aggregator.teleport-cluster-production.svc.cluster.local
DNS.3 = *.teleport-production-fluentd-aggregator.teleport-cluster-production.svc.cluster.local
Makefile
key_len := 4096
staging_dir := staging
production_dir := production
.PHONY: gen-staging
gen-staging:
mkdir -p $(staging_dir)
rm -f $(staging_dir)/*
openssl genrsa -out $(staging_dir)/ca.key $(key_len)
chmod 444 $(staging_dir)/ca.key
openssl req -config openssl.conf -key $(staging_dir)/ca.key -new -x509 -days 3650 -sha256 -extensions v3_ca -subj "/CN=ca" -out $(staging_dir)/ca.crt
openssl genrsa -out $(staging_dir)/client.key $(key_len)
chmod 444 $(staging_dir)/client.key
openssl req -config openssl.conf -subj "/CN=teleport-event-handler" -key $(staging_dir)/client.key -new -out $(staging_dir)/client.csr
openssl x509 -req -in $(staging_dir)/client.csr -CA $(staging_dir)/ca.crt -CAkey $(staging_dir)/ca.key -CAcreateserial -days 365 -out $(staging_dir)/client.crt -extfile openssl.conf -extensions client_cert
openssl genrsa -out $(staging_dir)/server.key $(key_len)
chmod 444 $(staging_dir)/server.key
openssl req -config openssl.conf -subj "/CN=fluentd-aggregator" -key $(staging_dir)/server.key -new -out $(staging_dir)/server.csr
openssl x509 -req -in $(staging_dir)/server.csr -CA $(staging_dir)/ca.crt -CAkey $(staging_dir)/ca.key -CAcreateserial -days 365 -out $(staging_dir)/server.crt -extfile openssl.conf -extensions staging_server_cert
.PHONY: gen-production
gen-production:
mkdir -p $(production_dir)
rm -f $(production_dir)/*
openssl genrsa -out $(production_dir)/ca.key $(key_len)
chmod 444 $(production_dir)/ca.key
openssl req -config openssl.conf -key $(production_dir)/ca.key -new -x509 -days 3650 -sha256 -extensions v3_ca -subj "/CN=ca" -out $(production_dir)/ca.crt
openssl genrsa -out $(production_dir)/client.key $(key_len)
chmod 444 $(production_dir)/client.key
openssl req -config openssl.conf -subj "/CN=teleport-event-handler" -key $(production_dir)/client.key -new -out $(production_dir)/client.csr
openssl x509 -req -in $(production_dir)/client.csr -CA $(production_dir)/ca.crt -CAkey $(production_dir)/ca.key -CAcreateserial -days 365 -out $(production_dir)/client.crt -extfile openssl.conf -extensions client_cert
openssl genrsa -out $(production_dir)/server.key $(key_len)
chmod 444 $(production_dir)/server.key
openssl req -config openssl.conf -subj "/CN=fluentd-aggregator" -key $(production_dir)/server.key -new -out $(production_dir)/server.csr
openssl x509 -req -in $(production_dir)/server.csr -CA $(production_dir)/ca.crt -CAkey $(production_dir)/ca.key -CAcreateserial -days 365 -out $(production_dir)/server.crt -extfile openssl.conf -extensions production_server_cert

Run the make gen-production command for generating certificates for the teleport-production cluster.

and the make gen-production command for the teleport-production cluster. Next, update the Vault secret as follows.

Terminal window
$ vault login -method oidc
$ vault kv put k8s/ops-central/teleport-cluster-production/fluentd-certs \
ca.crt="$(cat ca.crt)" \
server.crt="$(cat server.crt)" \
server.key="$(cat server.key)" \
client.crt="$(cat client.crt)" \
client.key="$(cat client.key)"

Finally, get the latest secret version from the Vault and update the teleport-cluster release here.

Fluentd uses fluent-plugin-gcloud-pubsub-custom gem for sending the audit events to the following Google Cloud Pub/Sub topics:

  • projects/gitlab-teleport-staging/topics/teleport-staging-events
  • projects/gitlab-teleport-production/topics/teleport-production-events

We use a custom-built OCI (Docker) image with the fluent-plugin-gcloud-pubsub-custom gem baked into the image. You can update/modify this image here.

We use Workload Identity for authenticating to Google Cloud. The Workload Identity is configured here and the required Roles are configured here.

When you add a new resource, there are manual steps required to initialize the new resource with the Teleport cluster.

  1. Your new resource should have the gitlab-teleport Chef cookbook as part of it’s runlist.
  2. Your new resource should also have network access to talk to the Teleport proxy in the cluster.
  3. After applying a new token to register a node, the next chef-client run should remove the secret from the config file.
  1. On your local machine, log into tsh.
  2. Use tctl tokens add --ttl=5m --type=node to generate a new token.
  3. On the resource node, stop teleport with systemctl stop teleport.
  4. On the resource node, edit the /etc/teleport/config.yaml file, and add the generated token to the token_name field.
  5. Start up the teleport service again on the resource node: systemctl start teleport.
  1. You will need to add the new cluster resources to the K8s Teleport Agent.
  2. You can reference sections earlier in this document about how the certificates are used to verify access to the database.

If you have added a new db-type (for example), you may need to add it to existing roles defined in Terraform. Here is an example merge request to add the new sec db-type to existing roles.

Join Services with a Secure Token