GitalyServiceGoserverTrafficCessationSingleNode

Overview

The GitalyServiceGoserverTrafficCessationSingleNode alert is a SLI that monitors Gitaly GRPC requests in aggregate, excluding the OperationService. GRPC failures which are considered to be the “server’s fault” are counted as errors. The apdex score is based on a subset of GRPC methods which are expected to be fast.

This alert signifies that the SLI is reporting a cessation of traffic to the goserver component of the gitaly service on a single node; the signal is present, but is zero. Since the service is not fully redundant, SLI violations on a single node may represent a user-impacting service degradation.

The following conditions must be met to trigger this alert:

gitlab_component_node_ops:rate_30m{component="goserver",env="[env]",monitor="global",type="gitaly"} == 0 Checks if the rate of operations for goserver in the gitaly service is zero over the past 30 minutes. This condition confirms that we are not currently seeing any traffic.
gitlab_component_node_ops:rate_30m{component="goserver",env="[env]",monitor="global",type="gitaly"} offset 1h >= 0.16666666666666666 Checks if the rate of operations for goserver in the gitaly service one hour ago was greater than or equal to approximately 0.167 (1/6) requests per second. This condition confirms that we saw some traffic an hour ago.
A stuck process on a Gitaly node may cause this issue.
The Gitaly node might not be able to serve traffic
The recipient is required to figure out the impact of the service outage, validate if the Gitaly node is at all serving traffic , if the root cause seems to be linked to Gitaly it is suggested to reach out to the Gitaly team so that they can help with the investigation.
To figure out the impact it is important to note Gitaly does not replicate any data. If a Gitaly server goes down, any of its clients can’t read or write to the repositories stored on that server.

Services

Gitaly Service
Team that owns the service: Core Platform : Gitaly
Label: gitlab-com/gl-infra/production~“Service::Gitaly”

Metrics

The alert is based on the metric gitaly_service_client_requests_total, which tracks the total number of gRPC requests made to the Gitaly service. Specifically, it monitors the rate of these requests over a specified time window, excluding the OperationService. The alert calculates the rate of requests over a 5-minute window to determine if there has been a cessation of traffic. Link to Metrics Catalog
Checking the log file for new log entries (e.g. tail -f /var/opt/gitlab/gitaly/current), if there are then it’s a false-positive alert.
Dashboard when the alert is firing

Alert Behavior

To cross-check if a Gitaly Migration is in-progress of Gitaly nodes may cause this alert, which might require us to silence this alert.
Silencing will also be required if a Gitaly node got recently created
This alert is expected to be rare
Historical trends of the alert firing here

Severities

This alert might create S2 incidents.
There might be some gitlab.com users impacted , to figure out the exact number of respositories that cannot be accessed a query like this would give you a good estimate.
Review Incident Severity Handbook page to identify the required Severity Level

Verification

Prometheus link to query that triggered the alert
Gitaly Service Overview dashboard
Mimir Gitaly instances status in gprd environment

Recent changes

Troubleshooting

Checking the log file for new log entries (e.g. tail -f /var/opt/gitlab/gitaly/current), if there are then it’s a false-positive alert.
Checking if there are more than one Gitaly instance running (e.g. ps faux | grep “gitaly serve”). Multiple Gitaly instances could indicate that Prometheus is scraping metrics from an old process that’s exiting and no longer serve requests, hence the alert.
Is there a Gitaly server running, check the logs for mis-configuration or node-specific errors (bad permissions, insufficient memory or disk space, etc.)
Has the node been removed from Rails config and thus no longer receiving traffic from Rails? This is a rare situation and it would be obvious from other alerts (usually 500s) as accessing repos would just fail.
Has the node been recently created? If so then the creator forgot to add a silence.
If the Gitaly nodes are unreachable for example in an incident like this a solution might be to increase the number of ansible forks

Possible Resolutions

Dependencies

Internal dependencies like migrations , a stuck process in Gitaly node , insufficient number of ansible forks may cause this alert.
External dependencies like regional outage may cause this alert.
Please use /devoncall <incident_url> on Slack for any escalation that meets the criteria.
There would be soon a PagerDuty escalation policy for Gitaly incidents view here

For escalation contact the following channels:

#g_gitaly

Alternative slack channels: