Skip to content

Rebuilding a GKE cluster

It’s necessary to skip deploying to the cluster while replacing it, this way we don’t disrupt the Auto Deploy.

  1. Make sure the Auto Deploy pipeline is not active and no active deployment is happening for the targeted environment.
  2. Identify the name of the cluster we need to skip, we need to use the full name of the GKE cluster, for example gstg-us-east1-b.
  3. Set the variable CLUSTER_SKIP to the name of the cluster, gstg-us-east1-b for instance, in the ops mirror CI variables, from which deployment pipelines are run.

Create silences on alerts.gitlab.net using the following example filters:

  • env="gstg" cluster="gstg-us-east1-b"
  • env="gstg" region="us-east1-b" alert_class="traffic_cessation"

All HAProxy nodes in the zone need to be stopped to disable all traffic to the zonal cluster in a timely manner while also not over-saturating Canary, as it doesn’t have the capacity to handle the full main stage traffic of a single zone.

This will trigger a graceful stop of all HAProxy nodes in the zone with a forced stop after 5 minutes, 5 at a time with a 1 minute pause after each one:

Terminal window
knife ssh -C 5 'chef_environment:gstg AND roles:gstg-base-haproxy AND zone:projects\/65580314219\/zones\/us-east1-b' \
'sudo systemctl mask haproxy.service; sudo systemctl kill --signal SIGUSR1 haproxy.service; while [ $(systemctl is-active haproxy.service) != "inactive" ] && [ ${i:=1} -lt 150 ]; do sleep 2; i=$((i + 1)); done; sudo systemctl stop haproxy.service; systemctl status haproxy.service; sleep 60'

This is a two step process: the first one to replace the cluster and node pools, then the next to update the Kubernetes authentication method and secret engine in Vault with the new cluster IP and CA certificate.

  1. Open a new merge request with the desired changes to the zonal GKE cluster (example MR). If the cluster is to be rebuilt without changes, add a comment in the environment’s Terraform configuration.

  2. Get approval for the merge request but do not merge it yet.

  3. Perform a terraform plan via Atlantis in the targeted environment to recreate the cluster by commenting in the MR:

    Terminal window
    atlantis plan -p gstg -- -replace module.gke-us-east1-b.google_container_cluster.cluster

    You should see something like:

    Terraform will perform the following actions:
    # module.gke-us-east1-b.google_container_cluster.cluster will be replaced, as requested
    ...
    # module.gke-us-east1-b.google_container_node_pool.node_pool["generic-1"] will be replaced due to changes in replace_triggered_by
  4. Then apply without automerging by commenting in the MR:

    Terminal window
    atlantis apply -p gstg --auto-merge-disabled

    Once applied, the new cluster and all its node pools should be up.

  5. Perform a terraform plan via Atlantis in the vault-production environment to update the Kubernetes authentication method and secret engine by commenting in the MR:

    Terminal window
    atlantis plan -p vault-production
  6. Then apply by commenting in the MR:

    Terminal window
    atlantis apply -p vault-production

    Once applied, Atlantis will merge the MR automatically.

At this point we have a brand new cluster and we need to orient our tooling to use it.

  1. Install glsh if it’s not installed already.

  2. Run glsh kube setup to setup kubectl with the new cluster configuration.

  3. Validate we can use the new context and kubectl works with the cluster:

    Terminal window
    glsh kube setup
    glsh kube use-cluster gstg-us-east1-b
    kubectl get pods --all-namespaces
  4. Create a new JWT token to re-enable authentication to the cluster via Vault:

    1. From the gitlab-helmfiles repository, pull the latest changes and install the vault-k8s-secrets release:
    Terminal window
    git pull
    cd releases/vault-k8s-secrets
    helmfile -e gstg-us-east1-b apply
    1. Get the new JWT token that was just provisioned with the release and save it into Vault:
    Terminal window
    kubectl --namespace vault-k8s-secrets get secret vault-k8s-secrets-token -o jsonpath='{.data.token}' | base64 -d | \
    vault kv put ci/ops-gitlab-net/gitlab-com/gl-infra/config-mgmt/vault-production/kubernetes/clusters/gstg/gstg-us-east1-b service_account_jwt=-
    1. Trigger a new config-mgmt pipeline for the vault-production environment to update the Kubernetes secrets engine with this new JWT token.
  1. Add the necessary annotations and labels to the kube-dns configmap so that gitlab-helmfiles can manage it via Helm:

    Terminal window
    kubectl -n kube-system annotate configmap/kube-dns meta.helm.sh/release-name=kube-dns-extras meta.helm.sh/release-namespace=kube-system
    kubectl -n kube-system label configmap/kube-dns app.kubernetes.io/managed-by=Helm
  2. Then deploy all workloads via our existing CI pipelines:

    1. From gitlab-helmfiles CI pipelines, find the latest default branch pipeline, and re-run the job associated with the rebuilt cluster.
    2. From tanka-deployments CI pipelines, trigger a new pipeline for the main branch.
    3. After installing the workloads, run kubectl get pods --all-namespaces and check that all workloads are working correctly before going to the next step.
  1. Remove the CLUSTER_SKIP variable from the ops mirror CI variables.

  2. Find the latest pipeline which performed a configuration change to the targeted environment and re-run the job associated with the rebuilt cluster.

  3. Deploy the correct version of GitLab by running the latest successful auto-deploy job:

    1. Go to the #announcements channel and check the latest successful job for the targeted environment.
    2. Re-run the Kubernetes job for the targeted cluster.
  4. Spot check the cluster to validate that all pods are coming online and remain in a running state:

    Terminal window
    glsh kube use-cluster gstg-us-east1-b
    kubectl get pods --namespace gitlab
  5. Verify that we run the same version of GitLab on all clusters:

    Terminal window
    glsh kube use-cluster gstg-us-east1-b
    kubectl get configmap --namespace gitlab gitlab-gitlab-chart-info -o jsonpath="{.data.gitlabVersion}"
    glsh kube use-cluster gstg-us-east1-c
    kubectl get configmap --namespace gitlab gitlab-gitlab-chart-info -o jsonpath="{.data.gitlabVersion}"

    The version from both clusters should match.

  1. In this dashboard we should see the numbers of the pods and containers of the cluster.
  2. Remove any silences that were created earlier.
  3. Validate that no alerts are firing related to this replacement cluster in Alertmanager.

We now need to restart all HAProxy nodes to re-enable traffic to the cluster.

We want to start them 1 at a time with a 1 minute pause between each one to give the cluster some time to scale up so that we don’t over-saturate it:

Terminal window
knife ssh -C 1 'chef_environment:gstg AND roles:gstg-base-haproxy AND zone:projects\/65580314219\/zones\/us-east1-b' \
'sudo systemctl unmask haproxy.service; sudo systemctl start haproxy.service; sleep 60'