GitLabZonalComponentVersionsOutOfSync

Overview

GitLab components are running different versions across zonal clusters in the same environment for longer than expected.

Rolling deploys transiently produce this state, so the alert only fires when the drift persists past the normal deploy window. A sustained firing usually means one of:

An auto-deploy pipeline is stuck or has failed.
Maintenance on a zonal cluster paused deployments to it (for example a rebuild or scheduled work).
A CI job that applies to one cluster failed while others succeeded.

Services

Kubernetes Service Overview
Owner: Fleet Management

Metrics

The alert is defined in mimir-rules/gitlab-{gprd,gstg}/kube/kubernetes.yml and compares the gitlab_version_info metric across zonal clusters in the same environment.

Alert Behavior

Non-paging severity. Visible in Alertmanager and Slack.
Expected during normal auto-deploys; suppresses itself once versions converge.
Sustained firings past 1 hour warrant investigation.

Severities

Default: S4.
Consider S3 if drift persists past a full deploy cycle and one cluster is stuck on an older release.

Verification

Check the alert graph to identify which cluster is out of sync.

Confirm which version each cluster is running:

for cluster in gprd-us-east1-{b,c,d}; do
  glsh kube use-cluster "$cluster"
  echo -n "$cluster: "
  kubectl get configmap --namespace gitlab gitlab-gitlab-chart-info -o jsonpath='{.data.gitlabVersion}'
  echo
done

Compare against the latest successful deploy in the #announcements Slack channel.

Recent changes

Troubleshooting

Check whether auto-deploy is stuck or a CI job failed on the lagging cluster.
If a cluster is intentionally skipped (CLUSTER_SKIP set in the ops mirror CI vars), silence the alert and remove CLUSTER_SKIP when maintenance completes.
If Helm is stuck on a cluster, follow Helm Upgrade is Stuck.
Re-run the Kubernetes deploy job for the lagging cluster from the latest successful auto-deploy pipeline.

Escalation

Delivery / Release Managers if auto-deploy is stuck.
#g_fleet_management for cluster-level issues.

Definitions

Alert rule: mimir-rules/gitlab-gprd/kube/kubernetes.yml
Edit this playbook