Skip to content

GitalyShardWeightsAssignerFailed

This alert is triggered when the gitaly-shard-weights-assigner (the shard rebalancer) reports a failure on its most recent run. The assigner runs daily as a scheduled pipeline on ops.gitlab.net and reassigns Gitaly storage shard weights based on each shard’s available disk space, so that new repositories are placed on shards with capacity.

When a run fails, shard weights are not updated. A single transient failure is usually harmless because the next daily run corrects it, but persistent failures mean weights stop tracking real disk usage: new repositories continue to be placed according to stale weights even as shards fill up, eroding disk headroom on the active shards. Known causes include:

  • An expired or invalid TARGET_API_PRIVATE_TOKEN (INC-11283).
  • Mimir query authentication errors, so the job cannot read available disk space.
  • The Prometheus pushgateway being unreachable.

Unlike its companion GitalyShardWeightsAssignerStale deadman alert (which fires on the absence of a recent success), this alert fires on an explicit failure signal from the job.

Refer to the service catalogue for the service owners and escalation.

This alert is based on the job-failure metric gitlab_job_failed, which the assigner publishes to a Prometheus pushgateway with labels resource="assign_weights", tier="stor", and exported_type="gitaly-shard-weight-assigner". It is a boolean gauge: a run sets it to 1 on failure, and a subsequent successful run clears it back to 0.

The alert is generated for both production (env="gprd", mimir-gitlab-gprd datasource) and staging (env="gstg", mimir-gitlab-gstg datasource), since a scheduled assigner job runs in each environment.

The for duration is set so that a same-day failure has a chance to self-recover on the next scheduled (daily) run before the alert pages, while a failure that persists across runs still escalates. Under normal conditions a successful run keeps the flag at 0.

Explore the underlying metric in Grafana:

  • Job failure flag (gprd): the raw gitlab_job_failed{resource="assign_weights", env="gprd"} series this alert evaluates; values flapping between 1 and 0 indicate runs failing and then recovering.
  • Job failure flag (gstg): the same flag in staging.

The exact query and for duration are defined in code and may change over time; see the Definitions section for the authoritative values.

This alert fires when the job-failure flag stays at 1 for the rule’s for duration, i.e. a failure that is not cleared by a subsequent successful run within that window.

There are no automated silencing rules. Silence only if the failure is known and a fix is already in flight.

The alert expression and for duration are defined in code and may change over time; see the Definitions section for the authoritative values.

s4, does not page. Production (gprd) notifications go to #g_tenant_services; staging (gstg) notifications are blackholed.

This alert is intentionally s4 for an initial bake-in period to confirm it is accurate and not noisy. The intention is to promote it to a paging alert (severity: s2 plus pager: pagerduty) once it has proven reliable, because the failure mode it guards against caused a real incident (INC-11283) and warrants paging.

There is no immediate end-user impact from a single failed run, but persistent failures stop weight rebalancing and contributed to a disk-saturation incident on the Gitaly fleet. To gauge severity, determine whether the failure is transient or recurring and check the current disk headroom on the active Gitaly shards.

Under normal operation the daily run succeeds and the failure flag stays at 0. A failure that persists past the for window almost always points to a broken credential or dependency rather than a transient blip. Read the job log first.

  1. Open the latest pipeline job log via the pipeline schedules and read the failure reason.
  2. Map the failure to its likely cause:
    • Authorization errors against the GitLab API → expired/invalid TARGET_API_PRIVATE_TOKEN (the INC-11283 root cause); rotate the token.
    • Mimir query auth/connection errors → the job cannot read shard disk-space metrics; check the Mimir query credentials and endpoint.
    • Pushgateway connection errors → the completion/failure metric cannot be published (e.g. a changed pushgateway IP or firewall change).
  3. Once the cause is fixed, re-run the schedule manually and confirm the next run succeeds and clears the failure flag.
  • INC-11283: a bad/expired TARGET_API_PRIVATE_TOKEN caused the job to silently stop running contributing to Gitaly disk saturation. Resolved by fixing the token and restoring the daily run. Related: production#22346.

Internal and external dependencies which could potentially cause this alert:

  • TARGET_API_PRIVATE_TOKEN: the credential used to read and write repository storage weights via the GitLab API; an expired or invalid token is the most common cause of failure.
  • Mimir/Prometheus query API: provides the shard available-disk-space metrics the job needs to compute weights; auth or connectivity errors fail the run.
  • Prometheus pushgateway: receives the job-failure/job-completion metrics; connectivity errors fail the publish step.
  • ops.gitlab.net scheduled pipelines: the runner and schedule that execute the daily job.

If the issue cannot be resolved quickly, escalate to the appropriate engineering or operations team for further investigation. Refer to the service catalogue for the service owners and escalation.

The definition for this alert can be found at:

The for duration lets a same-day failure self-recover on the next daily run; keep it below the daily cadence so that failures persisting across runs still page.