Host clock skew / clock not synchronising
HostClockSkew or HostClockNotSynchronising is firing. A host’s clock has
drifted from true time, or its NTP client has stopped synchronising. This can
break TLS validation, token/expiry checks, database replication, and make logs
across hosts line up wrong, so it’s worth chasing down even at s3.
GitLab hosts sync time with chrony (chronyd), configured by Chef via the
gitlab-server::ntp-client recipe. Config lives at /etc/chrony/chrony.conf.
1. Find the host and confirm it’s real
Section titled “1. Find the host and confirm it’s real”The alert labels carry the instance. Check the offending host’s metrics in
the relevant tenant (substitute <host>):
node_timex_offset_seconds{instance="<host>"} # offset from true time; > ~0.05s sustained is abnormalnode_timex_sync_status{instance="<host>"} # 0 = kernel clock NOT synchronisednode_timex_maxerror_seconds{instance="<host>"} # grows as sync is lostThe Node Exporter Full dashboard (rYdddlPWk/node-exporter-full, “Time”
panels) shows the same series; the alert annotation links straight to it.
A single host with a small, recovering offset is usually noise. A host stuck at
sync_status == 0, or offset climbing and not coming back, is real.
2. Is this a Chef VM or a GKE node?
Section titled “2. Is this a Chef VM or a GKE node?”The remediation is different, so check the instance/fqdn first:
- Chef-managed VM (e.g.
*.c.gitlab-production.internal): you can SSH in and fix chrony directly. Go to step 3. - GKE node (e.g.
gke-*): these run Container-Optimized OS and are managed by GCP. You don’t hand-fix time on them. Skip to step 4.
If multiple hosts are skewing at once, don’t bother per-host; that points at an upstream NTP or network problem. Go to step 4.
3. Fix chrony on a Chef VM
Section titled “3. Fix chrony on a Chef VM”SSH to the host and check the daemon:
systemctl status chronyd.servicechronyc tracking # look at System time, Leap status (should be "Normal"), Stratumchronyc sources -v # are upstream sources reachable and selected?Common cases:
chronydis dead or wedged →sudo systemctl restart chronyd.service, then re-checkchronyc tracking. The offset should start converging.- Daemon up but sources unreachable → likely a network/firewall issue
reaching the upstream NTP servers (defined in
/etc/chrony/chrony.conf). This is not a host-local fix, so go to step 4. - Host recently rebuilt/restored/cloned with a bad clock → a
chronydrestart usually lets it step the clock back into line.
If a restart fixes it and it stays fixed, you’re done. If it drifts again, escalate. The config or the upstream is the problem, not the daemon.
4. When you can’t fix it host-side
Section titled “4. When you can’t fix it host-side”Escalate to the owning team / SRE on-call (these alerts are severity: s3,
team: sre_reliability) when:
- It’s a GKE node: flag it; the node is GCP-managed. If it’s causing real impact, cordon/drain and let the node be recycled rather than trying to fix time in place.
- Multiple hosts are skewing together: treat it as an upstream NTP or network incident, not a per-host fault.
- A Chef VM’s
chronydis healthy but the clock won’t converge, or it drifts again after a restart. The upstream source orchrony.confneeds attention, which goes through Chef.