GitalyRequestsDropped

Overview

Gitaly is dropping incoming requests on a node because its concurrency queue is full. When the per-RPC concurrency limiter (or the adaptive limiter) has reached its maximum queue size, new requests are rejected immediately with GRPC::ResourceExhausted. This is measured by the gitaly_requests_dropped_total counter.

When a request is dropped, user-facing errors will be triggered.

Services

Service Overview
Team that owns the service: Tenant Scale:Gitaly Team

Metrics

The alert uses the following query which looks for dropped requests:

sum by (fqdn, reason) (rate(gitaly_requests_dropped_total{env="gprd", fqdn="<gitaly-node-here>"}[5m]))

The alert fires when rate(gitaly_requests_dropped_total[5m]) > 0 for 5 minutes on any node.

Alert Behavior

Under normal conditions this metric is zero. Any sustained non-zero rate means requests are actively being dropped.

Severities

Currently configured as s3 to allow a period of monitoring and tuning before upgrading to s2.
Dropped requests indicate active user impact and can escalate into a wider incident.

Verification

Gitaly Service Overview dashboard
Gitaly host detail dashboard
- See the “gitaly per-RPC metrics” and “Adaptive limit metrics” panels.
Identify the affected nodes from the fqdn label and the affected RPCs from grpc_method.
Check node saturation (CPU, memory, disk) on the affected node.
Check whether a single repository or traffic source is driving the load, using Elasticsearch.
Review whether the adaptive limiter is backing off (see GitalyAdaptiveLimiterBackoff), which usually precedes dropped requests.

Escalation

Gitaly has a Tier 2 rotation. Follow the How to Escalate guidance to page the team.

Several Slack channels are also available: