Teleport Administration
This run book covers administration of the Teleport service from an infrastructure perspective.
- See the Teleport Rails Console runbook if you’d like to log in to a machine using teleport.
- See the Teleport Database Console runbook if you’d like to connect to a database using teleport.
- See the Teleport Approval Workflow runbook if you’d like to review and approve access requests.
- See the Teleport Disaster Recovery runbook if you’d like to know about the DR operations for Teleport.
Infrastructure Setup
Section titled “Infrastructure Setup”We run two Teleport clusters. The production cluster is used for managing all GitLab’s infrastructure resources (VMs, databases, etc.). The staging cluster is used for testing new changes, major upgrades, disaster recovery process, etc. Users only need to know about and use the production Teleport cluster available at https://production.teleport.gitlab.net.
The production Teleport cluster currently runs on ops-central
GKE cluster in us-central1
while the staging one runs on ops-gitlab-gke
GKE cluster in us-east1
.
We run the production Teleport cluster on a different region than the region we run our production infrastruture,
so we can still access our infrastructure in case there is a region failure in our production region (us-east1
).
The infra-as-code for these Teleport clusters can be found in the following locations:
Architecture
Section titled “Architecture”Very high level, Teleport has two major components: Teleport cluster and Teleport agents. Teleport agents are processes that run on VMs or Kubernetes clusters and register resources with a Teleport cluster.
At GitLab, we run the Teleport agent on our VMs using the
gitlab-teleport cookbook and on our GKE clusters using the teleport-agent
Helm chart and
release.
We also run a Teleport cluster using the teleport-cluster
Helm chart and
release.
The agents running VMs register the VMs they are running on as servers. The agents running on Kubernetes clsuters register the Kubernetes clusters they are running on The Kubernetes management with Teleport is disabled at the moment since our license does not include this feature. The Teleport agents running on Kubernetes clusters also act as a proxy for registering our Postgres databases.
The Teleport cluster is comprised of multiple components.
- Teleport Auth Service
- The auth service acts as a certificate authority (CA) for the cluster.
- It issues certificates for clients and nodes, collects the audit information, and stores it in the audit log.
- The auth service is run in high-availability mode in our GKE clusters.
- The auth service can be configured via
tctl
command line tool.
- Teleport Proxy Service
- The proxy is the only service in a cluster visible to the outside world.
- All user connections for all supported protocols go through the proxy.
- The proxy also serves the Web UI and allows remote nodes to establish reverse tunnels.
- The proxy service is also run in high-availability mode in our GKE clusters.
- Teleport Kubernetes Operator
- The Teleport Kubernetes Operator provides a way for Kubernetes users to manage some Teleport resources through Kubernetes, following the Operator Pattern.
- Teleport Slack Plugin
- Teleport’s Slack integration is used for notifying individuals and #teleport-requests channel.
- Teleport Event-Handler Plugin
- Teleport’s event-handler plugins allows securely sending audit events to a Fluentd instance for further processing by SIEM systems.
Authentication
Section titled “Authentication”Certificate-based authentication is the most secure form of authentication. Teleport supports all the necessary certificate management operations to enable certificate-based authentication at scale. You can read more about authentication in Teleport here.
Authorization
Section titled “Authorization”Teleport uses Role-Based Access Control (RBAC) model. We define roles and each role specifies what is allowed and what is not allowed. You can read more about authoriztion in Teleport here.
Secrets
Section titled “Secrets”We use the Google Cloud Key Management Service (KMS) to store and handle Teleport certificate authorities.
Teleport generates private key material for its internal Certificate Authorities (CAs) during the first Auth Server’s initial startup. These CAs are used to sign all certificates issued to clients and hosts in the Teleport cluster. When configured to use Google Cloud KMS, all private key material for these CAs will be generated, stored, and used for signing inside of Google Cloud KMS. Instead of the actual private key, Teleport will only store the ID of the KMS key. In short, private key material will never leave Google Cloud KMS.
Please refer to this guide for more information on storing Teleport private keys in Google Cloud KMS.
We create a Key Ring and a CryptoKey for Teleport CA
in this file.
We then reference this key when installing the teleport-cluster
Helm chart
here
We do not manage any certificate authority and private keys inside the cluster. They are all stored in and managed by KMS.
To help guard against data corruption and to verify that data can be decrypted successfully, Cloud KMS periodically scans and backs up all key material and metadata.
Please refer to this deep dive document on Google Cloud KMS and automatic backups.
Database Management
Section titled “Database Management”The PosgreSQL databases are registered with the Teleport instance by Teleport agents running on our regional Kubernetes clusters.
The certificates (
gprd and
gstg used by
teleport-agent
running on Kubernetes clusters should match the PostgreSQL server certificate located at /var/opt/gitlab/postgresql/server.crt
.
Furthermore, the CA file found at /var/opt/gitlab/postgresql/cacert
on each Patroni or PostgreSQL node should correspond to
the certificate authority used by the Teleport instance with which those databases are registered.
This file is written by the gitlab-patroni::postgres
recipe which is imported by the gitlab-patroni::default
recipe.
This recipe retrives the content of CA file from gs://gitlab-<env>-secrets/gitlab-patroni/gstg.enc
.
If you move databases to a different Teleport instance and update this CA file, please remember to
run the select pg_reload_conf();
command from the gitlab-psql
shell on each node to reload the update CA.
Here is an example CR for updating the CA file.
Role and Access Management
Section titled “Role and Access Management”Teleport roles and permissions are defined in
roles_*.tf
files.
If you add a new role, you also need to add it to
roles.tf
file.
The association between Okta groups and Teleport roles are configured in
groups.tf
file.
Checking Status of Teleport
Section titled “Checking Status of Teleport”All the services comprise a Teleport cluster are stateless.
- All certificates and keys are stored in KMS.
- The cluster internal state and audit events are stored in Firestore.
- The session recordings are stored in a Cloud Storage bucket.
You can check the status of the production Teleport cluster by running the following command,
after running glsh kube use-cluster ops-central
command in a separate shell:
$ kubectl exec --namespace=teleport-cluster-production --stdin --tty <teleport-production-auth-pod> -- tctl status
Example output
Cluster production.teleport.gitlab.netVersion 16.1.7host CA never updateduser CA never updateddb CA never updateddb_client CA never updatedopenssh CA never updatedjwt CA never updatedsaml_idp CA never updatedoidc_idp CA never updatedspiffe CA never updatedCA pin sha256:d4ac1c9af5d25e6cf3c60c8078efe443c1186c071c99641dcd9b11eb0831f46d
Generally, if the pods are healthy, then the service is healthy.
Updating Enterprise License
Section titled “Updating Enterprise License”We use the same License for both staging and production instances teleport.
When our license is about to expire, we need to obtain a new license file and update our Teleport instances with.
Read more about the Enterprise License file here and managing it here.
In short, you need to login to gitlab-tp.teleport.sh as an admin and download the new license file (license.pem
).
You can also ask an admin user to do so and share the license file with you through a secure channel (1Password). Admin users include the business owners listed in the tech stack - search for “title: teleport”.
Add the new license to Vault.
$ glsh vault proxy
$ vault login -method oidc
Grab the latest version from the output last command and update it here.
Finally restart the Teleport Auth component.
$ glsh kube use-cluster ops-central
$ kubectl rollout restart deployment/teleport-production-auth --namespace=teleport-cluster-production
Troubleshooting
Section titled “Troubleshooting”If the license hasn’t updated as expected after bumping the secret version, first check that the license
secret contains what you expect:
kubectl get secrets license -n teleport-cluster-production -o json | jq -r '.data."license.pem"' | base64 -d
If it doesn’t, check what the refreshInterval
of the ExternalSecret
is:
kubectl get es license -n teleport-cluster-production -o json | jq .spec.refreshInterval
If refreshInterval
is set to 0
, external-secrets
will never update the secret from Vault, so you will need to change the refreshInterval
to something non-zero (e.g. 1h
).
Terraform Integration
Section titled “Terraform Integration”Terraform needs a user with required permissions to manage the Teleport cluster. See this guide for more information.
Currently, we use a user terraform
and generate an auth identity for this user using
impersonation.
We generate this identity to be valid for one year and it needs to be regenerated once year.
We should switch to Machine ID for running this plugin which does not manualy renewal.
Please refer to this issue.
- The following roles and user are created via the
teleport-bootstrap chart.
- Role
terraform-cluster-manager
- Role
terraform-impersonator
- User
terraform
- Role
Update Auth Identity
Section titled “Update Auth Identity”Use the following Makefile, if you need to update or rotate the auth identity for the terraform
user.
Makefile
okta_group := GitLab - SREimpersonator_role := terraform-impersonatortarget_user := terraformauthid_ttl := 8760h
define update_auth_id tsh login --proxy=$(1).teleport.gitlab.net tctl get saml/okta --with-secrets > user-saml-$(1).yaml # Add the impersonator role to the Okta group yq eval '.spec.attributes_to_roles[] |= select(.value == "$(okta_group)") .roles += ["$(impersonator_role)"]' -i user-saml-$(1).yaml tctl create -f user-saml-$(1).yaml # Login again to assume the new role tsh logout tsh login --proxy=$(1).teleport.gitlab.net # Request and sign an identity certificate for the user tctl auth sign --user=$(target_user) --ttl=$(authid_ttl) --out=auth-id-$(1) # Write the new auth identity to Vault vault login -method oidc vault kv patch -mount=ci ops-gitlab-net/gitlab-com/gl-infra/config-mgmt/teleport-$(1)/teleport auth_id=@auth-id-$(1) # Remove created files rm user-saml-$(1).yaml auth-id-$(1) # Display the latest version of Vault secret vault kv get -format=json -mount=ci ops-gitlab-net/gitlab-com/gl-infra/config-mgmt/teleport-$(1)/teleport | jq '.data.metadata.version' | xargs -I {} echo "The latest secret version: {}"endef
.PHONY: update-productionupdate-production: $(call update_auth_id,production)
.PHONY: update-stagingupdate-staging: $(call update_auth_id,staging)
- Run the
glsh vault proxy
command in a separate terminal. - Run
make update-staging
ormake update-production
. - Terraform will pick up the latest version of the Vault secret.
Slack Integration
Section titled “Slack Integration”The teleport-plugin-slack
is used for communicating with Slack.
See this guide for running this plugin.
Currently, we use a user slack
and generate an auth identity for this user using
impersonation.
We generate this identity to be valid for one year and it needs to be regenerated once year.
We should switch to Machine ID for running this plugin which does not manualy renewal.
Please refer to this issue.
- This plugin is installed using teleport-plugin-slack Helm chart (see this).
- The following roles and user are created via the
teleport-bootstrap chart.
- Role
slack-access-requests-viewer
- Role
slack-access-requests-manager
- Role
slack-impersonator
- User
slack
- Role
Update Auth Identity
Section titled “Update Auth Identity”Use the following Makefile, if you need to update or rotate the auth identity for the slack
user.
Makefile
okta_group := GitLab - SREimpersonator_role := slack-impersonatortarget_user := slackauthid_ttl := 8760h
define update_auth_id $(eval cluster=$(if $(filter production,$(1)),ops-central,$(if $(filter staging,$(1)),ops-gitlab-gke,invalid)))
tsh login --proxy=$(1).teleport.gitlab.net tctl get saml/okta --with-secrets > user-saml-$(1).yaml # Add the impersonator role to the Okta group yq eval '.spec.attributes_to_roles[] |= select(.value == "$(okta_group)") .roles += ["$(impersonator_role)"]' -i user-saml-$(1).yaml tctl create -f user-saml-$(1).yaml # Login again to assume the new role tsh logout tsh login --proxy=$(1).teleport.gitlab.net # Request and sign an identity certificate for the user tctl auth sign --user=$(target_user) --ttl=$(authid_ttl) --out=auth-id-$(1) # Write the new auth identity to Vault vault login -method oidc vault kv patch -mount=k8s $(cluster)/teleport-cluster-$(1)/slack auth_id=@auth-id-$(1) # Remove created files rm user-saml-$(1).yaml auth-id-$(1) # Display the latest version of Vault secret vault kv get -format=json -mount=k8s $(cluster)/teleport-cluster-$(1)/slack | jq '.data.metadata.version' | xargs -I {} echo "The latest secret version: {}"endef
.PHONY: update-productionupdate-production: $(call update_auth_id,production)
.PHONY: update-stagingupdate-staging: $(call update_auth_id,staging)
-
Run the
glsh vault proxy
command in a separate terminal. -
Run
make update-staging
ormake update-production
. -
The last line of the output shows the latest version of the Vault secret. Grab it and update the
teleport-cluster
release with it (staging and production). -
Restart the
teleport-staging-slack
deployment to pick the new auth identity.Terminal window # Staging$ glsh kube use-cluster ops$ kubectl rollout restart deployment/teleport-staging-slack --namespace=teleport-cluster-staging# Production$ glsh kube use-cluster ops-central$ kubectl rollout restart deployment/teleport-production-slack --namespace=teleport-cluster-production
SIEM Integration
Section titled “SIEM Integration”The teleport-plugin-event-handler
is used for handling Teleport audit events and sending them to a Fluentd instance,
so such events can be further shipped to other systems (i.e. SIEM) for security and auditing purposes.
Read more about exporting exporting Teleport audit events here.
See this guide and this one for further instructions.
Currently, we use a user event-handler
and generate an auth identity for this user using
impersonation.
We generate this identity to be valid for one year and it needs to be regenerated once year.
We should switch to Machine ID for running this plugin which does not manualy renewal.
Please refer to this issue.
- This plugin is installed using teleport-plugin-event-handler Helm chart (see this).
- The following roles and user are created via the
teleport-bootstrap chart.
- Role
event-handler-events-sessions-viewer
- Role
event-handler-impersonator
- User
event-handler
- Role
Update Auth Identity
Section titled “Update Auth Identity”Use the following Makefile, if you need to update or rotate the auth identity for the event-handler
user.
Makefile
okta_group := GitLab - SREimpersonator_role := event-handler-impersonatortarget_user := event-handlerauthid_ttl := 8760h
define update_auth_id $(eval cluster=$(if $(filter production,$(1)),ops-central,$(if $(filter staging,$(1)),ops-gitlab-gke,invalid)))
tsh login --proxy=$(1).teleport.gitlab.net tctl get saml/okta --with-secrets > user-saml-$(1).yaml # Add the impersonator role to the Okta group yq eval '.spec.attributes_to_roles[] |= select(.value == "$(okta_group)") .roles += ["$(impersonator_role)"]' -i user-saml-$(1).yaml tctl create -f user-saml-$(1).yaml # Login again to assume the new role tsh logout tsh login --proxy=$(1).teleport.gitlab.net # Request and sign an identity certificate for the user tctl auth sign --user=$(target_user) --ttl=$(authid_ttl) --out=auth-id-$(1) # Write the new auth identity to Vault vault login -method oidc vault kv patch -mount=k8s $(cluster)/teleport-cluster-$(1)/event-handler auth_id=@auth-id-$(1) # Remove created files rm user-saml-$(1).yaml auth-id-$(1) # Display the latest version of Vault secret vault kv get -format=json -mount=k8s $(cluster)/teleport-cluster-$(1)/event-handler | jq '.data.metadata.version' | xargs -I {} echo "The latest secret version: {}"endef
.PHONY: update-productionupdate-production: $(call update_auth_id,production)
.PHONY: update-stagingupdate-staging: $(call update_auth_id,staging)
-
Run the
glsh vault proxy
command in a separate terminal. -
Run
make update-staging
ormake update-production
. -
The last line of the output shows the latest version of the Vault secret. Grab it and update the
teleport-cluster
release with it (staging and production). -
Restart the
teleport-staging-event-handler
deployment to pick the new auth identity.Terminal window # Staging$ glsh kube use-cluster ops$ kubectl rollout restart deployment/teleport-staging-event-handler --namespace=teleport-cluster-staging# Production$ glsh kube use-cluster ops-central$ kubectl rollout restart deployment/teleport-production-event-handler --namespace=teleport-cluster-production
Configuring mTLS Communication
Section titled “Configuring mTLS Communication”The teleport-plugin-event-handler
requires Mutual TLS connection be enabled on fluentd
instance for security purposes.
The certificate authority, server key and certificate, and client key and certificate are stored in Vault at
k8s/ops-gitlab-gke/teleport-cluster-staging/fluentd-certs
and k8s/ops-central/teleport-cluster-production/fluentd-certs
.
For renewing the certificates, create the following files in a directory.
openssl.conf
[req]default_bits = 4096distinguished_name = req_distinguished_namestring_mask = utf8onlydefault_md = sha256x509_extensions = v3_ca
[req_distinguished_name]countryName = Country Name (2 letter code)stateOrProvinceName = State or Province NamelocalityName = Locality Name0.organizationName = Organization NameorganizationalUnitName = Organizational Unit NamecommonName = Common NameemailAddress = Email Address
countryName_default =stateOrProvinceName_default =localityName_default =0.organizationName_default = GitLab Inc.organizationalUnitName_default = TeleportcommonName_default = localhostemailAddress_default =
[v3_ca]subjectKeyIdentifier = hashauthorityKeyIdentifier = keyid:always,issuerbasicConstraints = critical, CA:true, pathlen: 0keyUsage = critical, cRLSign, keyCertSign
[client_cert]basicConstraints = CA:FALSEnsCertType = client, emailnsComment = "OpenSSL Generated Client Certificate"subjectKeyIdentifier = hashauthorityKeyIdentifier = keyid,issuerkeyUsage = critical, nonRepudiation, digitalSignature, keyEnciphermentextendedKeyUsage = clientAuth, emailProtection
[crl_ext]authorityKeyIdentifier = keyid:always
[ocsp]basicConstraints = CA:FALSEsubjectKeyIdentifier = hashauthorityKeyIdentifier = keyid,issuerkeyUsage = critical, digitalSignatureextendedKeyUsage = critical, OCSPSigning
[staging_server_cert]basicConstraints = CA:FALSEnsCertType = servernsComment = "OpenSSL Generated Server Certificate"subjectKeyIdentifier = hashauthorityKeyIdentifier = keyid,issuer:alwayskeyUsage = critical, digitalSignature, keyEnciphermentextendedKeyUsage = serverAuthsubjectAltName = @staging_alt_names
[staging_alt_names]DNS.0 = teleport-staging-fluentd-headless.teleport-cluster-staging.svc.cluster.localDNS.1 = *.teleport-staging-fluentd-headless.teleport-cluster-staging.svc.cluster.localDNS.2 = teleport-staging-fluentd-aggregator.teleport-cluster-staging.svc.cluster.localDNS.3 = *.teleport-staging-fluentd-aggregator.teleport-cluster-staging.svc.cluster.local
[production_server_cert]basicConstraints = CA:FALSEnsCertType = servernsComment = "OpenSSL Generated Server Certificate"subjectKeyIdentifier = hashauthorityKeyIdentifier = keyid,issuer:alwayskeyUsage = critical, digitalSignature, keyEnciphermentextendedKeyUsage = serverAuthsubjectAltName = @production_alt_names
[production_alt_names]DNS.0 = teleport-production-fluentd-headless.teleport-cluster-production.svc.cluster.localDNS.1 = *.teleport-production-fluentd-headless.teleport-cluster-production.svc.cluster.localDNS.2 = teleport-production-fluentd-aggregator.teleport-cluster-production.svc.cluster.localDNS.3 = *.teleport-production-fluentd-aggregator.teleport-cluster-production.svc.cluster.local
Makefile
key_len := 4096staging_dir := stagingproduction_dir := production
.PHONY: gen-staginggen-staging: mkdir -p $(staging_dir) rm -f $(staging_dir)/*
openssl genrsa -out $(staging_dir)/ca.key $(key_len) chmod 444 $(staging_dir)/ca.key openssl req -config openssl.conf -key $(staging_dir)/ca.key -new -x509 -days 3650 -sha256 -extensions v3_ca -subj "/CN=ca" -out $(staging_dir)/ca.crt
openssl genrsa -out $(staging_dir)/client.key $(key_len) chmod 444 $(staging_dir)/client.key openssl req -config openssl.conf -subj "/CN=teleport-event-handler" -key $(staging_dir)/client.key -new -out $(staging_dir)/client.csr openssl x509 -req -in $(staging_dir)/client.csr -CA $(staging_dir)/ca.crt -CAkey $(staging_dir)/ca.key -CAcreateserial -days 365 -out $(staging_dir)/client.crt -extfile openssl.conf -extensions client_cert
openssl genrsa -out $(staging_dir)/server.key $(key_len) chmod 444 $(staging_dir)/server.key openssl req -config openssl.conf -subj "/CN=fluentd-aggregator" -key $(staging_dir)/server.key -new -out $(staging_dir)/server.csr openssl x509 -req -in $(staging_dir)/server.csr -CA $(staging_dir)/ca.crt -CAkey $(staging_dir)/ca.key -CAcreateserial -days 365 -out $(staging_dir)/server.crt -extfile openssl.conf -extensions staging_server_cert
.PHONY: gen-productiongen-production: mkdir -p $(production_dir) rm -f $(production_dir)/*
openssl genrsa -out $(production_dir)/ca.key $(key_len) chmod 444 $(production_dir)/ca.key openssl req -config openssl.conf -key $(production_dir)/ca.key -new -x509 -days 3650 -sha256 -extensions v3_ca -subj "/CN=ca" -out $(production_dir)/ca.crt
openssl genrsa -out $(production_dir)/client.key $(key_len) chmod 444 $(production_dir)/client.key openssl req -config openssl.conf -subj "/CN=teleport-event-handler" -key $(production_dir)/client.key -new -out $(production_dir)/client.csr openssl x509 -req -in $(production_dir)/client.csr -CA $(production_dir)/ca.crt -CAkey $(production_dir)/ca.key -CAcreateserial -days 365 -out $(production_dir)/client.crt -extfile openssl.conf -extensions client_cert
openssl genrsa -out $(production_dir)/server.key $(key_len) chmod 444 $(production_dir)/server.key openssl req -config openssl.conf -subj "/CN=fluentd-aggregator" -key $(production_dir)/server.key -new -out $(production_dir)/server.csr openssl x509 -req -in $(production_dir)/server.csr -CA $(production_dir)/ca.crt -CAkey $(production_dir)/ca.key -CAcreateserial -days 365 -out $(production_dir)/server.crt -extfile openssl.conf -extensions production_server_cert
Run the make gen-production
command for generating certificates for the teleport-production
cluster.
and the make gen-production
command for the teleport-production
cluster.
Next, update the Vault secret as follows.
$ vault login -method oidc$ vault kv put k8s/ops-central/teleport-cluster-production/fluentd-certs \ ca.crt="$(cat ca.crt)" \ server.crt="$(cat server.crt)" \ server.key="$(cat server.key)" \ client.crt="$(cat client.crt)" \ client.key="$(cat client.key)"
Finally, get the latest secret version from the Vault and update the teleport-cluster
release
here.
Google Cloud Pub/Sub Configurations
Section titled “Google Cloud Pub/Sub Configurations”Fluentd uses fluent-plugin-gcloud-pubsub-custom gem for sending the audit events to the following Google Cloud Pub/Sub topics:
projects/gitlab-teleport-staging/topics/teleport-staging-events
projects/gitlab-teleport-production/topics/teleport-production-events
We use a custom-built OCI (Docker) image with the fluent-plugin-gcloud-pubsub-custom
gem baked into the image.
You can update/modify this image here.
We use Workload Identity for authenticating to Google Cloud. The Workload Identity is configured here and the required Roles are configured here.
Adding New Resources
Section titled “Adding New Resources”When you add a new resource, there are manual steps required to initialize the new resource with the Teleport cluster.
- Your new resource should have the gitlab-teleport Chef cookbook as part of it’s runlist.
- Your new resource should also have network access to talk to the Teleport proxy in the cluster.
- After applying a new token to register a node, the next chef-client run should remove the secret from the config file.
SSH Nodes
Section titled “SSH Nodes”- On your local machine, log into tsh.
- Use
tctl tokens add --ttl=5m --type=node
to generate a new token. - On the resource node, stop teleport with
systemctl stop teleport
. - On the resource node, edit the
/etc/teleport/config.yaml
file, and add the generated token to thetoken_name
field. - Start up the teleport service again on the resource node:
systemctl start teleport
.
Databases
Section titled “Databases”- You will need to add the new cluster resources to the K8s Teleport Agent.
- You can reference sections earlier in this document about how the certificates are used to verify access to the database.
Updating Roles
Section titled “Updating Roles”If you have added a new db-type (for example), you may need to add it to existing roles defined in Terraform.
Here is an example merge request to add the new sec
db-type to existing roles.