k8s-oncall-setup

Summary

Note: Before starting an on-call shift, be sure you follow these setup instructions

Majority of our Kubernetes configuration is managed using these projects:

https://gitlab.com/gitlab-com/gl-infra/k8s-workloads/gitlab-com
https://gitlab.com/gitlab-com/gl-infra/k8s-workloads/gitlab-helmfiles
https://gitlab.com/gitlab-com/gl-infra/k8s-workloads/common
- A dependency of the helmfile repositories above.
https://gitlab.com/gitlab-com/gl-infra/k8s-workloads/tanka-deployments

:warning: CI jobs are executed on the ops instance. :warning:

:warning: Deployer makes changes to the cluster config outside of git, but using pipelines in these projects. This means that the state of the cluster is often not reflected in the projects linked above. However, it usually should be possible to trace down the CI job that applied the change. :warning:

They include CI jobs that apply the relevant config to the right cluster. Most of what we do does not require interacting with clusters directly, but instead making changes to code in these projects.

Kubernetes API Access

Certain diagnostic steps can only be performed by interacting with Kubernetes directly. For this reason you need to be able to run kubectl commands. Remember to avoid making any changes to the clusters config outside of git!

We use private GKE clusters, with the control plane only accessible from within the cluster’s VPC or the VPN.

Setup Yubikey SSH keys: https://gitlab.com/gitlab-com/runbooks/-/blob/master/docs/uncategorized/yubikey.md
Setup bastion for clusters: https://gitlab.com/gitlab-com/runbooks/-/tree/master/docs/bastions
Install gcloud: https://cloud.google.com/sdk/docs/install
Install gke-gcloud-auth-plugin: gcloud components install gke-gcloud-auth-plugin.
Install kubectl:
- MacOS
- Linux
Run gcloud auth login: The browser will open to choose with google email address to use to allow for oauth.

💡 If you see warnings about permissions issues related to ~/.config/gcloud/* check the permissions of this directory. Simply change it to your user if necessary: sudo chown -R $(whoami) ~/.config 💡

Option A: VPN

Run glsh kube setup --no-proxy: use glsh to set up the kubectl configuration to be able to talk to all clusters without a proxy.
Launch NordLayer and connect to one of the organization gateways (:warning: not to a shared gateway!)
Run glsh kube use-cluster: this will print all the available clusters.
Run glsh kube use-cluster gstg --no-proxy: this will connect you to the gstg regional cluster.

Alternatively, you can directly run kubectx to select a cluster from an interactive menu.
Run kubectl get nodes: this will list all the nodes available in the cluster.

Option B: SOCKS5 proxy via SSH

Run glsh kube setup: use glsh to set up kubectl configuration to be able to talk to all clusters via a SOCKS5 proxy.
Run glsh kube use-cluster: this will print all the available clusters.
Run glsh kube use-cluster gstg: this will connect you to the gstg regional cluster.
In a new window run kubectl get nodes: this will list all the nodes available in the cluster.

GUI consoles and metrics

When troubleshooting issues, it can often be helpful to have a graphical overview of resources within the cluster, and basic metric data. For more detailed and expansive metric data, we have a number of dashboards within Grafana. For tunneling mechanism above glsh kube use-cluster $CLUSTER, one excellent option for a local graphical view into the clusters that works with both is the Lens IDE. Alternatively, the GKE console provides access to much of the same information via a web browser, as well.

Shell access to nodes and pods

Accessing a node

Initiate an SSH connection to one of the production nodes, this requires a fairly recent version of gsuite

kubectl get pods -o wide  # find the name of the node that you want to access. The `NODE` column shows you the name of the node where each pod is scheduled
gcloud compute --project "gitlab-production" ssh <node name> --tunnel-through-iap

This will create an ssh key that is propagated to the GCP project to allow access. You may receive a message from SIRTBot afterwards.

From the node you can list containers, and get shell access to a pod as root. At this writing our nodepools run a mix of docker and containerd, but eventually we expect them to be all containerd.

When using the code snippets below on docker nodes, change crictl to docker; they are functionally mostly equivalent for common basic tasks.

To quickly see if a node is running docker without explicitly looking it up, run docker ps; any containers listed in the output means it is a docker node, and empty output means containerd

crictl ps
crictl exec -u root -it <container> /bin/bash

You shouldn’t install anything on the GKE nodes. Instead, use toolbox to troubleshoot problems, for example run strace on a process running in one of the GitLab containers. You can install anything you want in the toolbox container.

gcloud compute --project "gitlab-production" ssh <node name>
toolbox

for more documentation on toolbox see: https://cloud.google.com/container-optimized-os/docs/how-to/toolbox

For more troubleshooting tips see also: attach to a running container

Accessing a pod

Initiate an interactive shell session in one of the pods. Bear in mind, that many containers do not include a shell which means you won’t be able to access them in this way.

kubectl exec -it <pod_name> -- sh

Running kubernetes config locally

There are certain scenarios in which you might want to evaluate our kubernetes config locally. One such scenario is during an incident, when the CI jobs are unable to run. Another is during development, when you want to test the config against a local cluster such as minikube or k3d.

In order to be able to run config locally, you need to install tools from the projects with kubernetes config linked above.

Install tools

Checkout repos from all projects
Install tools from them. They contain .tool-versions files which should be used with asdf, for example: cd gitlab-helmfiles; asdf install
Install helm plugins by running the script https://gitlab.com/gitlab-com/gl-infra/k8s-workloads/common/-/blob/master/bin/install-helm-plugins.sh
- You’ll want to run this with the version of helm used by gitlab-com / gitlab-helmfiles “active”. If you’re using asdf, you can achieve this by running the script from inside one of the helmfile repos.

Workstation setup for k-ctl

Get the credentials for the pre-prod cluster:

gcloud container clusters get-credentials pre-gitlab-gke --region us-east1 --project gitlab-pre

Setup local environment for k-ctl

These steps walk through running k-ctl against the preprod cluster but can also be used to connect to any of the staging or production clusters using sshuttle above. It is probably very unlikely you will need to make a configuration change to the clusters outside of CI, follow these instructions for the rare case this is necessary. k-ctl is a shell wrapper used by the k8s-workloads/gitlab-com over helmfile.

git clone [email protected]:gitlab-com/gl-infra/k8s-workloads/gitlab-com
cd gitlab-com
export CLUSTER=pre-gitlab-gke
export REGION=us-east1
./bin/k-ctl -e pre list

You should see a successful output of the helm objects as well as custom Kubernetes objects managed by the gitlab-com repository.

Note that if you’ve renamed your kube contexts to something less unwieldy, you can make the wrapper use your current context:

kubectl config use-context pre
FORCE_KUBE_CONTEXT=1 ./bin/k-ctl -e pre list

Make a change to the preprod configuration and execute a dry-run

$ vi releases/gitlab/values/pre.yaml.gotmpl

./bin/k-ctl -e pre -D upgrade

Getting or setting HAProxy state for the zonal clusters

It’s possible to drain and stop connections to an entire zonal cluster. This should be only done in extreme circumstances where you want to stop traffic to an entire availability zone.

Get the server state for the production us-east1-b zone

Use the bin/get-server-state script in chef-repo

./bin/get-server-state gprd gke-us-east1-b

./bin/set-server-state is used to set the state, just like any other server in an HAProxy backend

Troubleshooting

Connection to the server refused

If kubectl get nodes returns an error like “The connection to the server xx.xx.xxx.xx was refused - did you specify the right host or port?”, this probably means SSH access via bastion was not set up properly. Refer to the bastion setup documentation to set up your bastion SSH access. You can use your yubikey to set up your SSH keys by following documentation here.