GKE Cluster Upgrade Procedure
All of our GKE clusters are now set to automatically upgrade. They are all using the Regular release channel and have specific times they will upgrade themselves, as documented below
Environment | Cluster | Upgrade Window 1 | Upgrade Window 2 |
---|---|---|---|
pre | pre-gitlab-gke | 02:00:00 - 08:00:00 MON | 02:00:00 - 08:00:00 TUE |
gstg | gstg-gitlab-gke | 02:00:00 - 08:00:00 MON | 02:00:00 - 08:00:00 TUE |
gstg | gstg-us-east1-b | 02:00:00 - 08:00:00 MON | 02:00:00 - 08:00:00 TUE |
gstg | gstg-us-east1-c | 12:00:00 - 18:00:00 MON | 12:00:00 - 18:00:00 TUE |
gstg | gstg-us-east1-d | 12:00:00 - 18:00:00 MON | 12:00:00 - 18:00:00 TUE |
gprd | gprd-gitlab-gke | 02:00:00 - 08:00:00 WED | 02:00:00 - 08:00:00 THU |
gprd | gprd-us-east1-b | 02:00:00 - 08:00:00 WED | 02:00:00 - 08:00:00 THU |
gprd | gprd-us-east1-c | 02:00:00 - 08:00:00 THU | 02:00:00 - 08:00:00 FRI |
gprd | gprd-us-east1-d | 02:00:00 - 08:00:00 THU | 02:00:00 - 08:00:00 FRI |
ops | gitlab-ops | 02:00:00 - 08:00:00 MON | 02:00:00 - 08:00:00 TUE |
We have a cloud function called gke-notifications which will add annotations to Grafana every time a GKE auto upgrade takes place.
Our production clusters are currently the only clusters which need to be upgraded manually.
Rollback Procedure (or lack thereof)
Section titled “Rollback Procedure (or lack thereof)”:warning: Please make sure to read and understand the following :warning:
Due the nature of GKE upgrades, there is unfortunately no ability for us to rollback. For zonal cluster upgrades if something goes wrong we have the ability for specific services to stop sending traffic to that entire gke cluster by draining the affected service backends from haproxy.
If we do ever hit issues which would warrant a rollback, the first step is to reach out to Google support with a sev 1 issue to attempt to recover the cluster. In the case of entire catastrophic failure, we can destroy the cluster and recreate it using terraform (and bootstrap it following instructions at https://gitlab.com/gitlab-com/runbooks/-/blob/master/docs/uncategorized/k8s-new-cluster.md
Notes about auto-ugprades being “cancelled”
Section titled “Notes about auto-ugprades being “cancelled””The short take is that it’s not a problem that this happens.
If a node-pool upgrade doesn’t finish by the time our “maintenance window” is over, GCP “cancels” the upgrade, which sounds a lot more serious than it is. Basically it finishes the node it was upgrading, then leaves the node pool in a state where some nodes are the old version, some are the new, and it will continue the upgrade next maintenance window.
An example
operation-1617690426743-bb7cc7db UPGRADE_NODES us-east1 sidekiq-catchall-1 Operation was aborted:
Timing for that operation (note 08:00 is the time maintenance window stops)
operation-1617690426743-bb7cc7db. DONE 2021-04-06T06:27:06.743928908Z 2021-04-06T08:05:29.438407193Z
Now if we look at what happens to the node pool with auto-scaling after an aborted upgrade we see
[email protected]:~$ kubectl get nodes | grep sidekiq-catchallgke-gprd-gitlab-gke-sidekiq-catchall--4055e82f-dm0m Ready <none> 23h v1.18.16-gke.302gke-gprd-gitlab-gke-sidekiq-catchall--4055e82f-gmjm Ready <none> 14h v1.18.16-gke.302gke-gprd-gitlab-gke-sidekiq-catchall--4055e82f-kvlb Ready <none> 23h v1.18.16-gke.302gke-gprd-gitlab-gke-sidekiq-catchall--4055e82f-lsv0 Ready <none> 40h v1.18.16-gke.302gke-gprd-gitlab-gke-sidekiq-catchall--8adf5714-0hou Ready <none> 41h v1.18.16-gke.302gke-gprd-gitlab-gke-sidekiq-catchall--8adf5714-f4ub Ready <none> 41h v1.18.16-gke.302gke-gprd-gitlab-gke-sidekiq-catchall--8adf5714-qg4r Ready <none> 41h v1.18.16-gke.302gke-gprd-gitlab-gke-sidekiq-catchall--8adf5714-wln2 Ready <none> 41h v1.18.16-gke.302gke-gprd-gitlab-gke-sidekiq-catchall--9cb3bfc4-2158 Ready <none> 14h v1.18.12-gke.1210gke-gprd-gitlab-gke-sidekiq-catchall--9cb3bfc4-g0gh Ready <none> 36h v1.18.12-gke.1210gke-gprd-gitlab-gke-sidekiq-catchall--9cb3bfc4-ps05 Ready <none> 14h v1.18.12-gke.1210
New nodes are spun up with the old version (note this upgrade was a minor upgrade from v1.18.12-gke.1210 to v1.18.16-gke.302.
Notes about forced upgrades across minor versions
Section titled “Notes about forced upgrades across minor versions”You can look at the release notes for the regular release channel here This is important to follow as when all releases of a specific minor version (e.g. 1.16) are removed from a channel, the clusters will be automatically upgraded to the next minor release (e.g. 1.17) during the next maintenance period. This is typically noted in the release notes with a note similar to
Auto-upgrading control planes upgrade from versions 1.16 and 1.17 to version 1.17.9-gke.1504 during this release.
Things to take note of when expecting a minor version upgrade
Section titled “Things to take note of when expecting a minor version upgrade”First thing to do is check the Kubernetes release notes for the version in question here. In particular you should read carefully everything under the following sections
- Known Issues
- Urgent Upgrade Notes
- Deprecations and Removals
- Metrics Changes
Look for anything that might impact APIs, services, or metrics we currently consume.
After a minor upgrade has taken place on a cluster, you should look at all the dashboards in https://dashboards.gitlab.net that have the Kubernetes tag and check they still work in the upgraded environment (e.g. no missing metrics)
Procedure
Section titled “Procedure”The following is the procedure to undertake for the GKE cluster in question, and includes the steps for upgrading both the masters and the individual node pools. It is safe to use as a basis for the steps in the change request, but might need to be altered to suit the environment (e.g. steps duplicated for each node pool)
Step 0.1
Section titled “Step 0.1”The first step is to determine what version of Kubernetes you wish to upgrade
your cluster to. To do so, find the highest patch version of the minor release
your upgrading to, inside the REGULAR
release channel
gcloud --project gitlab-pre container get-server-config --region us-east1 --format json | jq '.channels[] | select(.channel == "REGULAR")'
- Copy and paste the below procedure into a Change Request (summary through rollback procedure)
- Fill out the necessary details of the Change Request following our [Change Management Guidelines]
- Modify any
<Merge Request>
with a link to the merge request associated with that step - Modify
<VERSION>
with the desired version we will be upgrading the GKE cluster to
Step 0.2
Section titled “Step 0.2”- Copy and paste the below sections into a new change request at
- Fill out the necessary details of the Change Request following our [Change Management Guidelines]
- Modify any references to
<CLUSTER>
with the name of the cluster you upgrading - Modify any
<Merge Request>
with a link to the merge request associated with that step - Modify
<VERSION>
with the desired version we will be upgrading the GKE cluster to - Note for zonal clusters you will need to replace all references to
--region us-east1
with--zone us-east1-b
(if for example, upgrading the zonal cluster inus-east1-b
)
Summary
Section titled “Summary”To upgrade our GKE Cluster <CLUSTER>
to <VERSION>
.
Part of <INSERT LINK TO GKE Upgrade Issue>
Step 1: Upgrade masters
Section titled “Step 1: Upgrade masters”- Use the gcloud cli to upgrade the Kubernetes masters only. They must be done before any of the node pools are done.
gcloud --project <PROJECT> container clusters upgrade <CLUSTER> --cluster-version=<VERSION> --master --region us-east1
This operation can take up to 40 minutes or so. Once it has been done, you can
confirm the new version is running on the masters by pointing your kubectl
to
the cluster and running
kubectl version
Specifically looking for Server Version
. Remember to be connected to the target cluster. Instructions for this are here
Step 2: Determine Node Pools to Upgrade
Section titled “Step 2: Determine Node Pools to Upgrade”- First list all the node pools of the cluster
gcloud --project <PROJECT> container node-pools list --cluster <CLUSTER> --region us-east1
And make a note of the names of all the node pools. Each node pool will need it’s own step documented as below
OPTIONAL Step 3: Upgrade Node Pool
Section titled “OPTIONAL Step 3: Upgrade Node Pool ”Note with auto-upgrades enabled on all our clusters, this step really is optional. The default and best solution is to just let nodes auto-upgrade at their leisure.
- Upgrade the node pool by running the following command
gcloud --project <PROJECT> container clusters upgrade <CLUSTER> --cluster-version=<VERSION> --node-pool <NODE POOL NAME> --region us-east1
Note this operation can take multiple hours for a node pool, depending on the size and workloads running on it.
To confirm the node pool has been upgraded, use gcloud
to list all the node
pools and look at the NODE_VERSION
column and confirm the version is correct.
gcloud --project <PROJECT> container node-pools list --cluster <CLUSTER> --region us-east1
Step 4: Update terraform references to new minimum version
Section titled “Step 4: Update terraform references to new minimum version”Now that the cluster and node pools have been upgraded, we need to do an update
in terraform to set the minimum Kubernetes version for that cluster via our
gke modules kubernetes_version
parameter. This ensures that should the cluster
need to be rebuilt for any reason, it will be be built running the version
that we have upgraded to at minimum.
Open an MR against the terraform repo, for the cluster in question (it’s either
in gke-regional.tf
or gke-zonal.tf
) to bump the kubernetes_version
parameter to the Kubernetes minor version we just upgraded to. E.g. if we just
upgraded to 1.18.12-gke.1206
change the parameter in terraform to 1.18
- MR
<Merge Request>
to be applied via terraform to lock cluster to minimum as new version
Post upgrade Considerations
Section titled “Post upgrade Considerations”Once all clusters have been upgraded to a new version, we should look at all our
kubernetes deployment tooling repositories under https://cloud.google.com/kubernetes-engine/docs/release-notes-regular
and open issues/MRs against them to upgrade the version of kubectl
they are
using to match the new minor version.