GKE
Overview
Section titled “Overview”Container-Optimized OS versions are directly tied to the deployed GKE version. Google manages the patching and release of security fixes for these images, and updates are applied as nodes are upgraded to newer GKE versions. All of our clusters have automatic node pool upgrades enabled, so we should monitor to ensure that our Kubernetes major and minor versions are within Google’s support window and allow for automatic updates within each cluster’s release channel. This will ensure that nodes remain up to date with security patches.
It is possible that non-evictable, or critical, workload may be scheduled on a node, preventing it from being replaced and upgraded. In these scenarios, care should be taken to monitor for this and plan to move the Pods to newer instances when safe. And while it should be rare, workloads that are not deployed in a highly available manner (Zoekt, being one example), may incur service disruptions while they are being evicted from nodes.
Google publishes a JSON mapping of COS version to GKE versions.
Example of what to expect during a security update made by Google to a cluster running 1.28.9
Skew detection
Section titled “Skew detection”It’s possible that nodes can fall behind on their GKE versions due to previously mentioned constraints around workload eviction. To detect this, the following Prometheus query can be used to see what the latest GKE node version on the cluster is, and see if any nodes are older than that.
Manually initiated COS upgrades
Section titled “Manually initiated COS upgrades”Vulnerabilities may be discovered that exist in a COS image, where Google’s remediation is slower to propagate through the release channels than desired. In these cases, one can refer to the JSON mapping to determine the GKE version that contains the fix for discovered vulnerabilities, and then initiate cluster upgrades to this version. This may require changing the cluster release channel
Automation
Section titled “Automation”For day to day operations, no action is generally required from SREs to keep nodes up to date with security patches. Google automatically initiates node pool replacements when new versions are available to address security vulnerabilities within the specified release channel, and defined maintenance windows.
To ensure these updates are consistently available however, it is on the infrastructure teams to ensure that the Kubernetes versions deployed, are still within their support window. Initiating upgrades of the Kubernetes version may not always be automatic.
A tool like Renovate may also be able to be used to help initiate Kubernetes version upgrades via automated MR creation against the config-mgmt repository that contains the Terraform used to provision the clusters.