AiGatewayServiceRunwayIngressTrafficCessationRegional
Overview
Section titled “Overview”-
This alert is designed to detect an abnormal cessation of traffic to the runway_ingress component of the ai-gateway service in any US region. The conditions ensure that the service was receiving a non-zero amount of traffic one hour ago and has since dropped to zero traffic over the last 30 minutes.
-
This could be caused by a change to the metrics used in the SLI, or by the service not receiving traffic.
Services
Section titled “Services”- AI-Gateway service overview
- Team that owns the service: AI Framework
Metrics
Section titled “Metrics”- In the alert AiGatewayServiceRunwayIngressTrafficCessationRegional, the metric used is gitlab_regional_sli_ops:rate_30m. This metric measures the rate of HTTP requests to the runway_ingress component of the ai-gateway service. It is measured in average number of HTTP requests per second over the last 5 minutes, averaged over 30 minutes. Link to metric catalogue
Alert Behavior
Section titled “Alert Behavior”- To silence the alert , please visit Alert Manger Dashboard
- Till only recently this was a high volume alert in particular because non-us regions having zero traffic is not unlikely but now that we only track us regions this alert is expected to be rare
- Historical trends of the alert firing here
Severities
Section titled “Severities”- This alert might create S3 incidents.
- There might be some gitlab.com users impact
- Review Incident Severity Handbook page to identify the required Severity Level
Verification
Section titled “Verification”- Prometheus link to query that triggered the alert
- AI Gateway Service Overview Dashboard
- MLOps logging
Recent changes
Section titled “Recent changes”- Recent AI-gateway Production Change/Incident Issues
- Recent chef-repo Changes
- Recent k8s-workloads Changes
Troubleshooting
Section titled “Troubleshooting”- First step should be to verify if it is a false alarm , if not the cessation could be caused by Service not receiving traffic due to saturation of
It might also be helpful to look out for recent changes made to the service or recent ongoing issues, a quick look at the dashboard to check if a recent deployed caused it.
If a recent deployment/change caused this issue consider rolling back , we can revert the MR, or re-run the previous deployment job
- AI Gateway uses capacity planning provided by Runway for long-term forecasting of saturation resources. To view forecasts, refer to Tamland page.
Possible Resolutions
Section titled “Possible Resolutions”- Consider rolling back to a previous working version of the AI Gateway
Dependencies
Section titled “Dependencies”-
If the outage is due to a Google Cloud issue, you will need to open a support ticket via the web console
-
If the outage is due to an Anthropic issue, reach to
#ext-anthropic
on Slack -
For investigation and resolution assistance, reach to
#g_ai_framework
on Slack