Rebuilding a kubernetes cluster
This page is for replacing a zonal cluster, if you seek creating a new cluster on a different region/zone, please refer to the create new kubernetes cluster page
1- Skipping cluster deployment
Section titled “1- Skipping cluster deployment”It’s necessary to skip deploying to the cluster when replacing it, this way we don’t disrupt the Auto Deploy.
- Make sure the Auto Deploy pipeline is not active and no active deployment is happening on the environment
- Identify the name of the cluster we need to skip, and we need to use the full name of the GKE cluster,
for example
gstg-us-east1-b
. - Then we need to set the environment variable
CLUSTER_SKIP
to the name of the clustergstg-us-east1-b
for instance. This needs to be placed on the ops instance where the pipelines run. - Don’t forget to remove the variable after the maintenance window closes and the cluster is replaced.
2- Removing traffic
Section titled “2- Removing traffic”- Create a silence on alerts.gitlab.net using the following example filter:
cluster=gstg-us-east1-b
- Pause alerts from Dead Mans Snitch
- Find the alert named
Prometheus - GKE <cluster_name>
and hit the pause button
- Find the alert named
2.a Removing traffic from canary
Section titled “2.a Removing traffic from canary”We do this so we don’t over-saturate canary when the gstg
cluster goes down, canary doesn’t have the same capacity as the main stage.
- We start with setting all the canary backends to
MAINT
mode
$> declare -a CNY=(`./bin/get-server-state -z $ZONE gstg | grep -E 'cny|canary' | awk '{ print substr($3,1,length($3)-1) }' | tr '\n' ' '`)
$> for server in $CNYdo./bin/set-server-state -f -z $ZONE gstg maint $serverdone
- Fetch all canary backends to validate that they are put to
MAINT
$> ./bin/get-server-state -z $ZONE gstg | grep -E 'cny|canary'
2.b Removing traffic from main stage
Section titled “2.b Removing traffic from main stage”- We now want to remove traffic targeting our main stage for this zone. The below command will
instruct HAProxies that live in the same zone and sets the backends to
MAINT
:
$> declare -a MAIN=(`./bin/get-server-state -z b gstg | grep -I -v -E 'cny|canary'| grep 'us-east1-b' | awk '{ print substr($3,1,length($3)-1) }'| tr '\n' ' '`)
$> for server in $MAINdo./bin/set-server-state -f -z b -s 60 gstg maint $serverdone
- Fetch all main stage backends to validate that they are put to
MAINT
./bin/get-server-state -z b gstg | grep -I -v -E 'cny|canary'| grep 'us-east1-b'
3- Replacing cluster using Terraform
Section titled “3- Replacing cluster using Terraform”3.a Setting up the tooling
Section titled “3.a Setting up the tooling”In order to work with terraform and config-mgmt
repo, you can refer to the getting started
to setup the needed tooling and quick overview of the steps needed.
3.b Pull latest changes
Section titled “3.b Pull latest changes”- We need to make sure we pulled the latest changes from the
config-mgmt
repository before executing any command.
3.c Executing terraform
Section titled “3.c Executing terraform”This is a two step process, 1 to replace the cluster, the next to recreate the node pools that were removed.
- Perform a
terraform plan
to validate cluster recreation:
Executer should perform a plan
and validate the changes before running the apply
tf plan -replace="module.gke-us-east1-b.google_container_cluster.cluster"
You should see:
Terraform will perform the following actions:
# module.gke-us-east1-b.google_container_cluster.cluster will be replaced, as requested
- Then
apply
the cluster recreation change:
tf apply -replace="module.gke-us-east1-b.google_container_cluster.cluster"
- Perform an unconstrained
terraform plan
to validate the creation of the node pools.
tf plan
We should see the addition of various node pools. Refer to the
config-mgmt
repository for details on the node pools we configure at that
moment in time.
If the plan is dirty consider writing a targeted apply as necessary to avoid change outside of the Change Request related to the cluster rebuild work.
- Perform a
terraform apply
, leveraging atarget
if required.
tf apply [-target=<path/to/node_pool{s}>]
3.d New cluster configuration setup
Section titled “3.d New cluster configuration setup”After the Terraform command is executed we will have a brand new cluster, we need to orient our tooling to use the new cluster.
- We start with
glsh
we need to runglsh kube setup
to fetch the cluster configs. - Validate we can use the new context and
kubectl
works with the cluster:
glsh kube setupglsh kube use-cluster gstg-us-east1-bkubectl get pods --all-namespaces
- Update the new cluster’s
apiServer
IP to the Tanka repository - Configure Vault Secrets responsible for CI configurations within
config-mgmt
:
CONTEXT_NAME="$(kubectl config current-context)"KUBERNETES_HOST="$(kubectl config view -o jsonpath="{.clusters[?(@.name == \"${CONTEXT_NAME}\")].cluster.server}")"CA_CERT="$(kubectl config view --raw -o jsonpath="{.clusters[?(@.name == \"${CONTEXT_NAME}\")].cluster.certificate-authority-data}" | base64 -d)"
vault kv put ci/ops-gitlab-net/config-mgmt/vault-production/kubernetes/${CONTEXT_NAME##*_} host="${KUBERNETES_HOST}" ca_cert="${CA_CERT}"
5- From the config-mgmt/environments/vault-production
repo, we need to run tf apply
, it will show us
that there’s config change for the cluster we replaced, then we apply this change so Vault knows of the new cluster.
cd config-mgmt/environments/vault-productiontf init -upgradetf apply
4.a- Deploying Workloads
Section titled “4.a- Deploying Workloads”First we bootstrap our cluster with required CI configurations:
-
From gitlab-helmfiles repo
- pull latest changes
cd releases/00-gitlab-ci-accounts
helmfile -e gstg-us-east1-b apply
-
We then need to tend to our Calico management. Execute the following on the new cluster;
kubectl -n kube-system annotate cm calico-node-vertical-autoscaler meta.helm.sh/release-name=calico-node-autoscalerkubectl -n kube-system annotate cm calico-node-vertical-autoscaler meta.helm.sh/release-namespace=kube-systemkubectl -n kube-system label cm calico-node-vertical-autoscaler app.kubernetes.io/managed-by=Helm
Then we can complete setup via our existing CI pipelines:
- From gitlab-helmfiles CI Pipelines, find the latest default branch pipeline, and rerun the job associated with the cluster rebuild
- From tanka-deployments CI Pipelines, find the latest default branch pipeline, and rerun the job associated with the cluster rebuild
- After installing the workloads, run
kubectl get pods --all-namespaces
and check if all workloads are working correctly before going to the next step.
4.b Deploy Prometheus Rules
Section titled “4.b Deploy Prometheus Rules”Prometheus has a wide array of recording/alert rules; these must be deployed otherwise we may fly blind with some of our metrics.
- Browse to Runbooks CI Pipelines
- Find the latest Pipeline executed against the default branch
- Retry the
deploy-rules-{non-}production
job
4.c Deploying gitlab-com
Section titled “4.c Deploying gitlab-com”- Remove the
CLUSTER_SKIP
variable from ops instance - Find the latest pipeline which performed a configuration change to the staging
environment and replace the job associated with this cluster
- Note that this will install all releases and configurations but will not deploy the correct version of GitLab, that comes in the following step
- It’ll be easiest to find a pipeline from the most recent MR merge vs surfing through the Pipeline pages
- MR’s are on: gitlab-com
- Deploy the correct version of GitLab
- Run the latest successful
auto-deploy
job. Go to the announcements channel and check latest successful job, and re-run the Kubernetes job for the targeted cluster.
- Run the latest successful
- Spot check the cluster to validate the Pods are coming online and remain in a
Running state
- Connect to our replaced cluster
glsh kube use-cluster gstg-us-east1-b
kubectl get pods --namespace gitlab
- Connect to our replaced cluster
4.d Verify we run the same version on all clusters
Section titled “4.d Verify we run the same version on all clusters”glsh kube use-cluster gstg-us-east1-b
- in a separate window:
kubectl get configmap gitlab-gitlab-chart-info -o jsonpath="{.data.gitlabVersion}"
glsh kube use-cluster gstg-us-east1-c
- in a separate window:
kubectl get configmap gitlab-gitlab-chart-info -o jsonpath="{.data.gitlabVersion}"
- Version from cluster
c
matches version from clusterb
4.e Monitoring
Section titled “4.e Monitoring”- In this dashboard we should see the numbers of the pods and containers of the cluster.
- Remove any silences that were created earlier
- Unpause any alerts from Dead Mans Snitch
- Validate no alerts are firing related to this replacement cluster on the Alertmanager
5- Pushing traffic back to the cluster
Section titled “5- Pushing traffic back to the cluster”- We start with the main stage, we chose zone
b
as an example.
$> declare -a MAIN=(`./bin/get-server-state -z b gstg | grep -I -v -E 'cny|canary'| grep 'us-east1-b' | awk '{ print substr($3,1,length($3)-1) }' | tr '\n' ' '`)
$> for server in $MAINdo./bin/set-server-state -f -z b -s 60 gstg ready $serverdone
- Validate all main stage backends are on READY state
$> ./bin/get-server-state -z b gstg | grep -I -v -E 'cny|canary'| grep 'us-east1-b'
- Then we change canary backends to READY state
$> declare -a CNY=(`./bin/get-server-state -z b gstg | grep -E 'cny|canary' | awk '{ print substr($3,1,length($3)-1) }' | tr '\n' ' '`)
$>for server in $CNYdo./bin/set-server-state -f -z b -s 60 gstg ready $serverdone
- Validate all canary stage backends are on READY state
$> ./bin/get-server-state -z b gstg | grep -E 'cny|canary'