AiGatewayServiceRunwayIngressTrafficCessationRegional

Overview

This alert is designed to detect an abnormal cessation of traffic to the runway_ingress component of the ai-gateway service in any US region. The conditions ensure that the service was receiving a non-zero amount of traffic one hour ago and has since dropped to zero traffic over the last 30 minutes.
This could be caused by a change to the metrics used in the SLI, or by the service not receiving traffic.

Services

AI-Gateway service overview
Team that owns the service: AI Framework

Metrics

In the alert AiGatewayServiceRunwayIngressTrafficCessationRegional, the metric used is gitlab_regional_sli_ops:rate_30m. This metric measures the rate of HTTP requests to the runway_ingress component of the ai-gateway service. It is measured in average number of HTTP requests per second over the last 5 minutes, averaged over 30 minutes. Link to metric catalogue

Alert Behavior

To silence the alert , please visit Alert Manger Dashboard
Till only recently this was a high volume alert in particular because non-us regions having zero traffic is not unlikely but now that we only track us regions this alert is expected to be rare
Historical trends of the alert firing here

Severities

This alert might create S3 incidents.
There might be some gitlab.com users impact
Review Incident Severity Handbook page to identify the required Severity Level

Verification

Recent changes

Troubleshooting

First step should be to verify if it is a false alarm , if not the cessation could be caused by Service not receiving traffic due to saturation of

It might also be helpful to look out for recent changes made to the service or recent ongoing issues, a quick look at the dashboard to check if a recent deployed caused it.

If a recent deployment/change caused this issue consider rolling back , we can revert the MR, or re-run the previous deployment job

AI Gateway uses capacity planning provided by Runway for long-term forecasting of saturation resources. To view forecasts, refer to Tamland page.

Possible Resolutions

Consider rolling back to a previous working version of the AI Gateway

Dependencies

Anthropic API
GCP Vertex
If the outage is due to a Google Cloud issue, you will need to open a support ticket via the web console
If the outage is due to an Anthropic issue, reach to #ext-anthropic on Slack
For investigation and resolution assistance, reach to #g_ai_framework on Slack
Review the alert here
Tune the alert here
Update the template used to format this playbook
Related alerts
AI Gateway Runbook docs
Update the template used to format this playbook