GitLabZonalComponentVersionsOutOfSync
Overview
Section titled “Overview”GitLab components are running different versions across zonal clusters in the same environment for longer than expected.
Rolling deploys transiently produce this state, so the alert only fires when the drift persists past the normal deploy window. A sustained firing usually means one of:
- An auto-deploy pipeline is stuck or has failed.
- Maintenance on a zonal cluster paused deployments to it (for example a rebuild or scheduled work).
- A CI job that applies to one cluster failed while others succeeded.
Services
Section titled “Services”Metrics
Section titled “Metrics”The alert is defined in mimir-rules/gitlab-{gprd,gstg}/kube/kubernetes.yml and compares the gitlab_version_info metric across zonal clusters in the same environment.
Alert Behavior
Section titled “Alert Behavior”- Non-paging severity. Visible in Alertmanager and Slack.
- Expected during normal auto-deploys; suppresses itself once versions converge.
- Sustained firings past 1 hour warrant investigation.
Severities
Section titled “Severities”- Default:
S4. - Consider
S3if drift persists past a full deploy cycle and one cluster is stuck on an older release.
Verification
Section titled “Verification”-
Check the alert graph to identify which cluster is out of sync.
-
Confirm which version each cluster is running:
Terminal window for cluster in gprd-us-east1-{b,c,d}; doglsh kube use-cluster "$cluster"echo -n "$cluster: "kubectl get configmap --namespace gitlab gitlab-gitlab-chart-info -o jsonpath='{.data.gitlabVersion}'echodone -
Compare against the latest successful deploy in the
#announcementsSlack channel.
Recent changes
Section titled “Recent changes”Troubleshooting
Section titled “Troubleshooting”- Check whether auto-deploy is stuck or a CI job failed on the lagging cluster.
- If a cluster is intentionally skipped (
CLUSTER_SKIPset in theopsmirror CI vars), silence the alert and removeCLUSTER_SKIPwhen maintenance completes. - If Helm is stuck on a cluster, follow Helm Upgrade is Stuck.
- Re-run the Kubernetes deploy job for the lagging cluster from the latest successful
auto-deploypipeline.
Escalation
Section titled “Escalation”- Delivery / Release Managers if auto-deploy is stuck.
#g_fleet_managementfor cluster-level issues.