Rebuilding a GKE cluster

1. Skip cluster deployments

It’s necessary to skip deploying to the cluster while replacing it, this way we don’t disrupt the Auto Deploy.

Make sure the Auto Deploy pipeline is not active and no active deployment is happening for the targeted environment.
Identify the name of the cluster we need to skip, we need to use the full name of the GKE cluster, for example gstg-us-east1-b.
Set the variable CLUSTER_SKIP to the name of the cluster, gstg-us-east1-b for instance, in the ops mirror CI variables, from which deployment pipelines are run.

2. Pause monitoring

Create silences on alerts.gitlab.net using the following example filters:

env="gstg" cluster="gstg-us-east1-b"
env="gstg" region="us-east1-b" alert_class="traffic_cessation"

3. Disable traffic to the cluster

All HAProxy nodes in the zone need to be stopped to disable all traffic to the zonal cluster in a timely manner while also not over-saturating Canary, as it doesn’t have the capacity to handle the full main stage traffic of a single zone.

This will trigger a graceful stop of all HAProxy nodes in the zone with a forced stop after 5 minutes, 5 at a time with a 1 minute pause after each one:

knife ssh -C 5 'chef_environment:gstg AND roles:gstg-base-haproxy AND zone:projects\/65580314219\/zones\/us-east1-b' \
  'sudo systemctl mask haproxy.service; sudo systemctl kill --signal SIGUSR1 haproxy.service; while [ $(systemctl is-active haproxy.service) != "inactive" ] && [ ${i:=1} -lt 150 ]; do sleep 2; i=$((i + 1)); done; sudo systemctl stop haproxy.service; systemctl status haproxy.service; sleep 60'

4. Replace the GKE cluster

4.a. Terraform

This is a two step process: the first one to replace the cluster and node pools, then the next to update the Kubernetes authentication method and secret engine in Vault with the new cluster IP and CA certificate.

Open a new merge request with the desired changes to the zonal GKE cluster (example MR). If the cluster is to be rebuilt without changes, add a comment in the environment’s Terraform configuration.
Get approval for the merge request but do not merge it yet.

Perform a terraform plan via Atlantis in the targeted environment to recreate the cluster by commenting in the MR:

atlantis plan -p gstg -- -replace module.gke-us-east1-b.google_container_cluster.cluster

You should see something like:

Terraform will perform the following actions:

  # module.gke-us-east1-b.google_container_cluster.cluster will be replaced, as requested

...

  # module.gke-us-east1-b.google_container_node_pool.node_pool["generic-1"] will be replaced due to changes in replace_triggered_by

Then apply without automerging by commenting in the MR:
Terminal window
```
atlantis apply -p gstg --auto-merge-disabled
```
This should take around 20 to 30 minutes.

Once applied, the new cluster and all its node pools should be up.
Perform a terraform plan via Atlantis in the vault-production environment to update the Kubernetes authentication method and secret engine by commenting in the MR:
Terminal window
```
atlantis plan -p vault-production
```
Then apply by commenting in the MR:
Terminal window
```
atlantis apply -p vault-production
```
Once applied, Atlantis will merge the MR automatically.

4.b. New cluster configuration setup

At this point we have a brand new cluster and we need to orient our tooling to use it.

Install glsh if it’s not installed already.
Run glsh kube setup to setup kubectl with the new cluster configuration.

Validate we can use the new context and kubectl works with the cluster:

glsh kube setup
glsh kube use-cluster gstg-us-east1-b
kubectl get pods --all-namespaces

Create a new JWT token to re-enable authentication to the cluster via Vault:
1. From the gitlab-helmfiles repository, pull the latest changes and install the vault-k8s-secrets release:
Terminal window
```
git pull
cd releases/vault-k8s-secrets
helmfile -e gstg-us-east1-b apply
```
1. Get the new JWT token that was just provisioned with the release and save it into Vault:
Terminal window
```
kubectl --namespace vault-k8s-secrets get secret vault-k8s-secrets-token -o jsonpath='{.data.token}' | base64 -d | \
    vault kv put ci/ops-gitlab-net/gitlab-com/gl-infra/config-mgmt/vault-production/kubernetes/clusters/gstg/gstg-us-east1-b service_account_jwt=-
```
1. Trigger a new config-mgmt pipeline for the vault-production environment to update the Kubernetes secrets engine with this new JWT token.

4.c. Deploy all workloads

Add the necessary annotations and labels to the kube-dns configmap so that gitlab-helmfiles can manage it via Helm:

kubectl -n kube-system annotate configmap/kube-dns meta.helm.sh/release-name=kube-dns-extras meta.helm.sh/release-namespace=kube-system
kubectl -n kube-system label configmap/kube-dns app.kubernetes.io/managed-by=Helm

Then deploy all workloads via our existing CI pipelines:
1. From gitlab-helmfiles CI pipelines, find the latest default branch pipeline, and re-run the job associated with the rebuilt cluster.
2. From tanka-deployments CI pipelines, trigger a new pipeline for the main branch.
3. After installing the workloads, run kubectl get pods --all-namespaces and check that all workloads are working correctly before going to the next step.

4.d. Deploy `gitlab-com`

Remove the CLUSTER_SKIP variable from the ops mirror CI variables.
Find the latest pipeline which performed a configuration change to the targeted environment and re-run the job associated with the rebuilt cluster.

This will install all releases and configurations but will not deploy the correct version of GitLab, that comes in the following step

It’ll be easiest to find a pipeline from the most recent merged MR rather than browsing through the pipeline pages
Deploy the correct version of GitLab by running the latest successful auto-deploy job:
1. Go to the #announcements channel and check the latest successful job for the targeted environment.
2. Re-run the Kubernetes job for the targeted cluster.
Spot check the cluster to validate that all pods are coming online and remain in a running state:
Terminal window
```
glsh kube use-cluster gstg-us-east1-b
kubectl get pods --namespace gitlab
```

Verify that we run the same version of GitLab on all clusters:

glsh kube use-cluster gstg-us-east1-b
kubectl get configmap --namespace gitlab gitlab-gitlab-chart-info -o jsonpath="{.data.gitlabVersion}"
glsh kube use-cluster gstg-us-east1-c
kubectl get configmap --namespace gitlab gitlab-gitlab-chart-info -o jsonpath="{.data.gitlabVersion}"

The version from both clusters should match.

5. Resume monitoring

In this dashboard we should see the numbers of the pods and containers of the cluster.
Remove any silences that were created earlier.
Validate that no alerts are firing related to this replacement cluster in Alertmanager.

6. Re-enable traffic to the cluster

We now need to restart all HAProxy nodes to re-enable traffic to the cluster.

We want to start them 1 at a time with a 1 minute pause between each one to give the cluster some time to scale up so that we don’t over-saturate it:

knife ssh -C 1 'chef_environment:gstg AND roles:gstg-base-haproxy AND zone:projects\/65580314219\/zones\/us-east1-b' \
  'sudo systemctl unmask haproxy.service; sudo systemctl start haproxy.service; sleep 60'