Rebuilding a GKE cluster
1. Skip cluster deployments
Section titled “1. Skip cluster deployments”It’s necessary to skip deploying to the cluster while replacing it, this way we don’t disrupt the Auto Deploy.
- Make sure the Auto Deploy pipeline is not active and no active deployment is happening for the targeted environment.
- Identify the name of the cluster we need to skip, we need to use the full
name of the GKE cluster, for example
gstg-us-east1-b
. - Set the variable
CLUSTER_SKIP
to the name of the cluster,gstg-us-east1-b
for instance, in theops
mirror CI variables, from which deployment pipelines are run.
2. Pause monitoring
Section titled “2. Pause monitoring”Create silences on alerts.gitlab.net using the following example filters:
env="gstg" cluster="gstg-us-east1-b"
env="gstg" region="us-east1-b" alert_class="traffic_cessation"
3. Disable traffic to the cluster
Section titled “3. Disable traffic to the cluster”All HAProxy nodes in the zone need to be stopped to disable all traffic to the zonal cluster in a timely manner while also not over-saturating Canary, as it doesn’t have the capacity to handle the full main stage traffic of a single zone.
This will trigger a graceful stop of all HAProxy nodes in the zone with a forced stop after 5 minutes, 5 at a time with a 1 minute pause after each one:
knife ssh -C 5 'chef_environment:gstg AND roles:gstg-base-haproxy AND zone:projects\/65580314219\/zones\/us-east1-b' \ 'sudo systemctl mask haproxy.service; sudo systemctl kill --signal SIGUSR1 haproxy.service; while [ $(systemctl is-active haproxy.service) != "inactive" ] && [ ${i:=1} -lt 150 ]; do sleep 2; i=$((i + 1)); done; sudo systemctl stop haproxy.service; systemctl status haproxy.service; sleep 60'
4. Replace the GKE cluster
Section titled “4. Replace the GKE cluster”4.a. Terraform
Section titled “4.a. Terraform”This is a two step process: the first one to replace the cluster and node pools, then the next to update the Kubernetes authentication method and secret engine in Vault with the new cluster IP and CA certificate.
-
Open a new merge request with the desired changes to the zonal GKE cluster (example MR). If the cluster is to be rebuilt without changes, add a comment in the environment’s Terraform configuration.
-
Get approval for the merge request but do not merge it yet.
-
Perform a
terraform plan
via Atlantis in the targeted environment to recreate the cluster by commenting in the MR:Terminal window atlantis plan -p gstg -- -replace module.gke-us-east1-b.google_container_cluster.clusterYou should see something like:
Terraform will perform the following actions:# module.gke-us-east1-b.google_container_cluster.cluster will be replaced, as requested...# module.gke-us-east1-b.google_container_node_pool.node_pool["generic-1"] will be replaced due to changes in replace_triggered_by -
Then
apply
without automerging by commenting in the MR:Terminal window atlantis apply -p gstg --auto-merge-disabledOnce applied, the new cluster and all its node pools should be up.
-
Perform a
terraform plan
via Atlantis in thevault-production
environment to update the Kubernetes authentication method and secret engine by commenting in the MR:Terminal window atlantis plan -p vault-production -
Then
apply
by commenting in the MR:Terminal window atlantis apply -p vault-productionOnce applied, Atlantis will merge the MR automatically.
4.b. New cluster configuration setup
Section titled “4.b. New cluster configuration setup”At this point we have a brand new cluster and we need to orient our tooling to use it.
-
Install
glsh
if it’s not installed already. -
Run
glsh kube setup
to setupkubectl
with the new cluster configuration. -
Validate we can use the new context and
kubectl
works with the cluster:Terminal window glsh kube setupglsh kube use-cluster gstg-us-east1-bkubectl get pods --all-namespaces -
Create a new JWT token to re-enable authentication to the cluster via Vault:
- From the
gitlab-helmfiles
repository, pull the latest changes and install thevault-k8s-secrets
release:
Terminal window git pullcd releases/vault-k8s-secretshelmfile -e gstg-us-east1-b apply- Get the new JWT token that was just provisioned with the release and save it into Vault:
Terminal window kubectl --namespace vault-k8s-secrets get secret vault-k8s-secrets-token -o jsonpath='{.data.token}' | base64 -d | \vault kv put ci/ops-gitlab-net/gitlab-com/gl-infra/config-mgmt/vault-production/kubernetes/clusters/gstg/gstg-us-east1-b service_account_jwt=-- Trigger a new
config-mgmt
pipeline for thevault-production
environment to update the Kubernetes secrets engine with this new JWT token.
- From the
4.c. Deploy all workloads
Section titled “4.c. Deploy all workloads”-
Add the necessary annotations and labels to the
kube-dns
configmap so thatgitlab-helmfiles
can manage it via Helm:Terminal window kubectl -n kube-system annotate configmap/kube-dns meta.helm.sh/release-name=kube-dns-extras meta.helm.sh/release-namespace=kube-systemkubectl -n kube-system label configmap/kube-dns app.kubernetes.io/managed-by=Helm -
Then deploy all workloads via our existing CI pipelines:
- From
gitlab-helmfiles
CI pipelines, find the latest default branch pipeline, and re-run the job associated with the rebuilt cluster. - From
tanka-deployments
CI pipelines, trigger a new pipeline for the main branch. - After installing the workloads, run
kubectl get pods --all-namespaces
and check that all workloads are working correctly before going to the next step.
- From
4.d. Deploy gitlab-com
Section titled “4.d. Deploy gitlab-com”-
Remove the
CLUSTER_SKIP
variable from theops
mirror CI variables. -
Find the latest pipeline which performed a configuration change to the targeted environment and re-run the job associated with the rebuilt cluster.
-
Deploy the correct version of GitLab by running the latest successful
auto-deploy
job:- Go to the
#announcements
channel and check the latest successful job for the targeted environment. - Re-run the Kubernetes job for the targeted cluster.
- Go to the
-
Spot check the cluster to validate that all pods are coming online and remain in a running state:
Terminal window glsh kube use-cluster gstg-us-east1-bkubectl get pods --namespace gitlab -
Verify that we run the same version of GitLab on all clusters:
Terminal window glsh kube use-cluster gstg-us-east1-bkubectl get configmap --namespace gitlab gitlab-gitlab-chart-info -o jsonpath="{.data.gitlabVersion}"glsh kube use-cluster gstg-us-east1-ckubectl get configmap --namespace gitlab gitlab-gitlab-chart-info -o jsonpath="{.data.gitlabVersion}"The version from both clusters should match.
5. Resume monitoring
Section titled “5. Resume monitoring”- In this dashboard we should see the numbers of the pods and containers of the cluster.
- Remove any silences that were created earlier.
- Validate that no alerts are firing related to this replacement cluster in Alertmanager.
6. Re-enable traffic to the cluster
Section titled “6. Re-enable traffic to the cluster”We now need to restart all HAProxy nodes to re-enable traffic to the cluster.
We want to start them 1 at a time with a 1 minute pause between each one to give the cluster some time to scale up so that we don’t over-saturate it:
knife ssh -C 1 'chef_environment:gstg AND roles:gstg-base-haproxy AND zone:projects\/65580314219\/zones\/us-east1-b' \ 'sudo systemctl unmask haproxy.service; sudo systemctl start haproxy.service; sleep 60'