GitalyShardWeightsAssignerFailed

Overview

This alert is triggered when the gitaly-shard-weights-assigner (the shard rebalancer) reports a failure on its most recent run. The assigner runs daily as a scheduled pipeline on ops.gitlab.net and reassigns Gitaly storage shard weights based on each shard’s available disk space, so that new repositories are placed on shards with capacity.

When a run fails, shard weights are not updated. A single transient failure is usually harmless because the next daily run corrects it, but persistent failures mean weights stop tracking real disk usage: new repositories continue to be placed according to stale weights even as shards fill up, eroding disk headroom on the active shards. Known causes include:

An expired or invalid TARGET_API_PRIVATE_TOKEN (INC-11283).
Mimir query authentication errors, so the job cannot read available disk space.
The Prometheus pushgateway being unreachable.

Unlike its companion GitalyShardWeightsAssignerStale deadman alert (which fires on the absence of a recent success), this alert fires on an explicit failure signal from the job.

Services

Refer to the service catalogue for the service owners and escalation.

Metrics

This alert is based on the job-failure metric gitlab_job_failed, which the assigner publishes to a Prometheus pushgateway with labels resource="assign_weights", tier="stor", and exported_type="gitaly-shard-weight-assigner". It is a boolean gauge: a run sets it to 1 on failure, and a subsequent successful run clears it back to 0.

The alert is generated for both production (env="gprd", mimir-gitlab-gprd datasource) and staging (env="gstg", mimir-gitlab-gstg datasource), since a scheduled assigner job runs in each environment.

The for duration is set so that a same-day failure has a chance to self-recover on the next scheduled (daily) run before the alert pages, while a failure that persists across runs still escalates. Under normal conditions a successful run keeps the flag at 0.

Explore the underlying metric in Grafana:

Job failure flag (gprd): the raw gitlab_job_failed{resource="assign_weights", env="gprd"} series this alert evaluates; values flapping between 1 and 0 indicate runs failing and then recovering.
Job failure flag (gstg): the same flag in staging.

The exact query and for duration are defined in code and may change over time; see the Definitions section for the authoritative values.

Alert Behavior

This alert fires when the job-failure flag stays at 1 for the rule’s for duration, i.e. a failure that is not cleared by a subsequent successful run within that window.

There are no automated silencing rules. Silence only if the failure is known and a fix is already in flight.

The alert expression and for duration are defined in code and may change over time; see the Definitions section for the authoritative values.

Severities

s4, does not page. Production (gprd) notifications go to #g_tenant_services; staging (gstg) notifications are blackholed.

This alert is intentionally s4 for an initial bake-in period to confirm it is accurate and not noisy. The intention is to promote it to a paging alert (severity: s2 plus pager: pagerduty) once it has proven reliable, because the failure mode it guards against caused a real incident (INC-11283) and warrants paging.

There is no immediate end-user impact from a single failed run, but persistent failures stop weight rebalancing and contributed to a disk-saturation incident on the Gitaly fleet. To gauge severity, determine whether the failure is transient or recurring and check the current disk headroom on the active Gitaly shards.

Verification

Inspect the most recent run on the gitaly-shard-weights-assigner pipeline schedules and confirm whether the latest run failed.
Open the failing pipeline’s job log to read the exact failure reason.
Query gitlab_job_failed{resource="assign_weights", env="gprd"} to confirm the flag is set and to see whether prior runs have been flapping between success and failure.

Recent changes

The assigner is intentionally deactivated during Gitaly fleet expansions while new storages are provisioned, then reactivated. The job’s schedule and credentials are configured in the gitaly-shard-weights-assigner project on ops.gitlab.net.
Related production change requests are tracked in gitlab-com/gl-infra/production.

Troubleshooting

Under normal operation the daily run succeeds and the failure flag stays at 0. A failure that persists past the for window almost always points to a broken credential or dependency rather than a transient blip. Read the job log first.

Open the latest pipeline job log via the pipeline schedules and read the failure reason.
Map the failure to its likely cause:
- Authorization errors against the GitLab API → expired/invalid TARGET_API_PRIVATE_TOKEN (the INC-11283 root cause); rotate the token.
- Mimir query auth/connection errors → the job cannot read shard disk-space metrics; check the Mimir query credentials and endpoint.
- Pushgateway connection errors → the completion/failure metric cannot be published (e.g. a changed pushgateway IP or firewall change).
Once the cause is fixed, re-run the schedule manually and confirm the next run succeeds and clears the failure flag.

Possible Resolutions

INC-11283: a bad/expired TARGET_API_PRIVATE_TOKEN caused the job to silently stop running contributing to Gitaly disk saturation. Resolved by fixing the token and restoring the daily run. Related: production#22346.

Dependencies

Internal and external dependencies which could potentially cause this alert:

TARGET_API_PRIVATE_TOKEN: the credential used to read and write repository storage weights via the GitLab API; an expired or invalid token is the most common cause of failure.
Mimir/Prometheus query API: provides the shard available-disk-space metrics the job needs to compute weights; auth or connectivity errors fail the run.
Prometheus pushgateway: receives the job-failure/job-completion metrics; connectivity errors fail the publish step.
ops.gitlab.net scheduled pipelines: the runner and schedule that execute the daily job.

Escalation

If the issue cannot be resolved quickly, escalate to the appropriate engineering or operations team for further investigation. Refer to the service catalogue for the service owners and escalation.

Definitions

The definition for this alert can be found at:

The for duration lets a same-day failure self-recover on the next daily run; keep it below the daily cadence so that failures persisting across runs still page.

Gitaly service
GitalyShardWeightsAssignerStale: the companion deadman alert that fires when no successful run is seen
GitLab Job Completion: the deadman/job-completion metric convention this alert is built on
Related alerts