Skip to content

GitalyShardWeightsAssignerStale

This alert is triggered when the gitaly-shard-weights-assigner (the shard rebalancer) has not completed a successful run for longer than its staleness threshold. The assigner runs daily as a scheduled pipeline on ops.gitlab.net and reassigns Gitaly storage shard weights based on each shard’s available disk space, so that new repositories are placed on shards with capacity.

When the assigner stops succeeding, shard weights go stale: new repositories continue to be placed according to the last weights written, even as those shards fill up. Over time this erodes disk headroom on the active shards and can leave the fleet without capacity for new repositories. Known causes include:

  • An expired or invalid TARGET_API_PRIVATE_TOKEN (INC-11283).
  • The Prometheus pushgateway being unreachable, so that even a successful run fails to publish its completion metric and the deadman goes stale (a past failure was caused by a changed pushgateway IP).
  • The Mimir query API being unavailable, so the job cannot read available disk space and aborts.

This alert is a deadman switch: it fires on the absence of a recent success rather than on an explicit failure, so it also catches the case where the job stops running entirely.

Refer to the service catalogue for the service owners and escalation.

This alert is based on the job-completion (deadman) metric gitlab_job_success_timestamp_seconds, which the assigner publishes to a Prometheus pushgateway with labels resource="assign_weights", tier="stor", and exported_type="gitaly-shard-weight-assigner". The metric records the Unix timestamp of the last successful run; the alert measures how long ago that was, in seconds.

The alert is generated for both production (env="gprd", mimir-gitlab-gprd datasource) and staging (env="gstg", mimir-gitlab-gstg datasource), since a scheduled assigner job runs in each environment.

The threshold is set above the job’s daily cadence so that normal scheduling jitter or a single slow run does not page, while a genuinely missed daily run is still caught promptly. Under normal conditions the last-success timestamp advances every ~24 hours, so the measured age stays well below the threshold.

Explore the underlying metric in Grafana:

The exact query and threshold value are defined in code and may change over time; see the Definitions section for the authoritative values.

This alert fires when the age of the last successful run exceeds the staleness threshold for the rule’s for duration. At that point the daily rebalancing has effectively stopped and weights are no longer being updated.

There are no automated silencing rules. Silence only if the staleness is known and expected, for example while the schedule is intentionally deactivated during a Gitaly fleet expansion (the assigner is paused while new storages are added).

The alert expression and for duration are defined in code and may change over time; see the Definitions section for the authoritative values.

s4, does not page. Production (gprd) notifications go to #g_tenant_services; staging (gstg) notifications are blackholed.

This alert is intentionally s4 for an initial bake-in period to confirm it is accurate and not noisy. The intention is to promote it to a paging alert (severity: s2 plus pager: pagerduty) once it has proven reliable, because the failure mode it guards against caused a real incident (INC-11283) and warrants paging.

There is no immediate end-user impact when the alert first fires, but prolonged staleness directly contributed to a disk-saturation incident on the Gitaly fleet. To gauge severity, check how long the job has been stale (a single missed run versus weeks) and the current disk headroom on the active Gitaly shards.

  • The assigner is intentionally deactivated during Gitaly fleet expansions while new storages are provisioned, then reactivated; an expansion in progress is a common benign reason for staleness. See the related production change requests in gitlab-com/gl-infra/production.
  • The job’s schedule and credentials are configured in the gitaly-shard-weights-assigner project on ops.gitlab.net (pipeline schedule variables and CI/CD settings).

Under normal operation the daily run keeps the last-success age well below the threshold. Staleness almost always traces back to the job no longer being able to run to completion. Prioritize finding why the most recent run did not succeed.

  1. Open the latest scheduled pipeline and read the most recent job log for the failure or stall reason.
  2. Check the token first. An expired or invalid TARGET_API_PRIVATE_TOKEN (the INC-11283 root cause). Confirm the token is present and valid, and rotate it if needed.
  3. Check pushgateway reachability. If the run succeeds but the completion metric is not published (e.g. a changed pushgateway IP or a firewall change), the deadman will still go stale. Verify the gitlab_job_success_timestamp_seconds{resource="assign_weights"} series is advancing.
  4. Check the Mimir query path. If the job cannot read shard disk-space metrics it will abort before writing weights.
  5. Rule out an intentional pause: confirm the schedule has not been deactivated for an in-progress fleet expansion.
  6. Once the cause is fixed, re-run the schedule manually and confirm the run succeeds and the last-success timestamp advances.
  • INC-11283: a bad/expired TARGET_API_PRIVATE_TOKEN caused the job to silently stop running contributing to Gitaly disk saturation. Resolved by fixing the token and restoring the daily run. Related: production#22346.

Internal and external dependencies which could potentially cause this alert:

  • ops.gitlab.net scheduled pipelines: the runner and schedule that execute the daily job; a deactivated schedule or runner outage stops the job entirely.
  • TARGET_API_PRIVATE_TOKEN: the credential used to read and write repository storage weights via the GitLab API; an expired or invalid token is the most common cause.
  • Prometheus pushgateway: receives the job-completion metric; if unreachable, the deadman goes stale even on otherwise successful runs.
  • Mimir/Prometheus query API: provides the shard available-disk-space metrics the job needs to compute weights.

If the issue cannot be resolved quickly, escalate to the appropriate engineering or operations team for further investigation. Refer to the service catalogue for the service owners and escalation.

The definition for this alert can be found at:

The staleness threshold assumes the job’s daily cadence; keep it above ~24h so normal jitter does not flap the alert, and below ~48h so a single missed run is still caught promptly.