AI Gateway Service
- Service Overview
- Alerts: https://alerts.gitlab.net/#/alerts?filter=%7Btype%3D%22ai-gateway%22%2C%20tier%3D%22sv%22%7D
- Label: gitlab-com/gl-infra/production~“Service::AIGateway”
Logging
Section titled “Logging”Summary
Section titled “Summary”The AI-gateway is a standalone-service that will give access to AI features to all users of GitLab, no matter which instance they are using: self-managed, dedicated or GitLab.com.
The AI Gateway was formerly known as Model Gateway and Code Suggestions.
Operational Roles and Responsibilities
Section titled “Operational Roles and Responsibilities”- Regional deployment management - The AI-Gateway team is responsible for selecting, provisioning and de-provisioning regional deployments. Selection and provisioning can be self-served via the runway config (multi region docs). Currently deprovisioning should be requested by contacting the Runway team.
- Quota Saturation Monitoring and Response - The AI-Gateway team is responsible for monitoring the saturation warnings and responding to the warnings when raised.
Architecture
Section titled “Architecture”See the AI Gateway architecture blueprint at https://docs.gitlab.com/ee/architecture/blueprints/ai_gateway/
For a higher level view of how the AI Gateway fits into our AI Architecture, see https://docs.gitlab.com/ee/development/ai_architecture.html
Example API call graph
Section titled “Example API call graph”For context, here is a typical call graph for a Code Suggestions API request from an IDE on an end-user’s laptop. This call graph is current as of 2023-12-15 but may change in the future.
sequenceDiagram box User laptop actor User participant IDE as VSCode participant LS as LanguageServer process end box Cloudflare POP nearest to User participant CFGL as gitlab.com end box GitLab Rails AI-assisted infrastructure participant WH as Workhorse participant RB as Rails end box Cloudflare POP nearest to Rails participant CFCS as codesuggestions.gitlab.com end box GitLab AI-gateway infrastructure (GCP) participant GW as AI gateway end box Model engine service participant ML as ML model engine (Vertex or Anthropic) end
IDE ->>+ LS: Request code completion for cursor context LS ->>+ CFGL: /api/v4/code_suggestions/completions CFGL ->>+ WH: /api/v4/code_suggestions/completions
WH ->>+ RB: /api/v4/code_suggestions/completions RB -->>- WH: composes request for ai-gateway and delegates to workhorse
WH ->>+ CFCS: /v2/code/completions CFCS ->>+ GW: /v2/code/completions GW ->>+ ML: model params
ML -->>- GW: model response GW -->>- CFCS: ai-gateway API response CFCS -->>- WH: ai-gateway API response WH -->>- CFGL: workhorse API response CFGL -->>- LS: workhorse API response LS -->>- IDE: Code completion suggestion
Notes:
- Over the last few months, the endpoints and control flow have evolved, sometimes in non-backward-compatible ways.
- e.g. Prior to GitLab 16.3, clients directly accessed a now deprecated request endpoint
/v2/completions
. Some self-managed GitLab deployments running older versions while Code Suggestions was still in beta release may still be using those now-broken endpoints.
- e.g. Prior to GitLab 16.3, clients directly accessed a now deprecated request endpoint
- Transits Cloudflare twice, once from end-user to Rails AI Assisted, and later from Rails to
ai-gateway
.- Typically at least one of those is fairly low latency: only 10 ms RTT between GCP’s
us-east1
region and Cloudflare’sATL
POP. - Cloudflare tools (logs, analytics, rules, etc.) are available for both of those API calls.
- Typically at least one of those is fairly low latency: only 10 ms RTT between GCP’s
- The requests to
ai-gateway
are expected to be slow, so Rails composes the request headers and then delegates it for Workhorse to send that request toai-gateway
. (Workhorse can handle slow requests much more efficiently than Rails; this conserves puma worker threads.) - Caching and reuse of TCP and TLS sessions allows most requests to avoid extra round-trips for connection setup.
- Currently
ai-gateway
containers run as a GCP Cloud Run service.
Starter gcloud
commands:
$ gcloud run services describe --project gitlab-runway-production --region us-east1 --format yaml ai-gateway$ gcloud run revisions list --project gitlab-runway-production --region us-east1 --service ai-gateway
Deployment
Section titled “Deployment”AI Gateway is deployed through Runway:
For more details, refer to Runway runbook.
Pausing AI Gateway Deployments
Section titled “Pausing AI Gateway Deployments”Runway handles the multi-stage deployments of AI Gateway. In certain situations (for example, during incident mitigation or maintenance), it may be beneficial to temporarily disable these continuous deployments. To do this, the pipeline can be configured to require manual intervention before deploying changes.
Temporarily disabling automatic deployments:
- Open the AI Gateway project in Gitlab: Go to the project’s Settings >> CI/CD page.
- Add a CI/CD variable: Under Variables, define a new variable named
RUNWAY_DEPLOYMENT_ACTION
. - Set the value to “manual”: Enter
manual
as the value forRUNWAY_DEPLOYMENT_ACTION
, Mark the variable as protected and save changes. - Confirm pipeline behavior: With this variable set, any new AI Gateway deployment pipeline will pause before deploying. The deployment job will not proceed to staging or production until manually triggered.
This configuration effectively pauses all continuous deployments. To resume normal automated deployments, remove the RUNWAY_DEPLOYMENT_ACTION
variable. Once this variable is removed, Runway will revert to automatically deploying new changes on pipeline runs as usual.
Environments
Section titled “Environments”Regions
Section titled “Regions”AI Gateway is currently deployed in the following regions:
- us-east4
- asia-northeast1
- asia-northeast3
- europe-west2
- europe-west3
- europe-west9
When the decision is made to provision a new region, the following steps should be taken:
- Request a quota increase in the new region (for instructions see this section below)
- Follow the Runway documentation to set up the new region
- Notify the testing team that a new region has been set up so that they can run the necessary tests. Our current contact is Abhinaba Ghosh and the active Slack channel is #ai-gateway-testing.
Services and Accounts
Section titled “Services and Accounts”The Cloud Run service accounts are managed by Runway and have aiplatform.user
role set, granting the service accounts the aiplatform.endpoints.predict
permission. Other permissions granted by this role are unused. To set additional roles, update ai-gateway
entry in Runway provisioner.
This IAM membership is managed via the gl-infra/config-mgmt
repository, using Terraform.
Performance
Section titled “Performance”AI Gateway includes the following SLIs/SLOs:
Service degradation could be result of the following saturation resources:
- Memory Utilization
- CPU Utilization
- Instance Utilization
- Concurency Utilization
- Vertex AI API Quota Limit
Determining Affected Components
Section titled “Determining Affected Components”As the AI Gateway serves multiple features, you can use the above dashboards to determine if the degredation is related to a specific feature.
For more information about the features covered by the server_code_generations
and server_code_completions
SLIs, refer to the Code Suggestions Runbook.
Scalability
Section titled “Scalability”AI Gateway will autoscale with traffic. To manually scale, update runway.yml
based on documentation.
It is also possible to directly edit the tunables for the ai-gateway
service via the Cloud Run console’s Edit YAML interface. This takes effect faster, but be sure to make the equivalent updates to the runway.yml
as described above; otherwise the next deploy will revert your manual changes to the service YAML.
Capacity Planning
Section titled “Capacity Planning”AI Gateway uses capacity planning provided by Runway for long-term forecasting of saturation resources. To view forecasts, refer to Tamland page.
GCP Quotas
Section titled “GCP Quotas”Apart from our quota monitoring in our usual GCP projects, the AI Gateway relies on resources that live in the following projects:
gitlab-ai-framework-dev
gitlab-ai-framework-stage
gitlab-ai-framework-prod
Refer to https://gitlab-com.gitlab.io/gl-infra/tamland/saturation.html?highlight=code_suggestions#other-utilization-and-saturation-forecasting-non-horizontally-scalable-resources for quota usage trends and projections.
Many of our AI features use GCP’s Vertex AI service. Vertex AI consists of various base models
that represent logic for different types of AI models (such as code generation, or chat bots).
Each model has its own usage quota, which can be viewed in the GCP Console.
To request a quota alteration:
- Visit the following page in the GCP Console: Quotas by Base Model
- Select each base model that requires an quota decrease/increase
- Click ‘EDIT QUOTAS’
- Input the desired quota limit for each service and submit the request.
- Existing/previous requests can be viewed here
If you do not have access to the GCP console, please file an access request asking for the Quota Administrator
role on the gitlab-ai-framework-prod
project.
Fireworks Capacity / Usage
Section titled “Fireworks Capacity / Usage”Fireworks capacity is based on the hardware configuration of our endpoints, which are directly controlled by us either through the deployment dashboard or through the firectl command.
You can view current usage at the usage dashboard
To increase the hardware available to a given endpoint in a specific region, contact the Fireworks team via the #ext-gitlab-fireworks
channel and tag @Shaunak Godbole
for visibility.
If you do not have access to the GitLab Fireworks account, please file an access request or reach out to @Allen Cook
or @bob
for Fireworks access
Anthropic Rate Limits
Section titled “Anthropic Rate Limits”Anthropic applies per-model limits to concurrency, requests per minute, input tokens per minute, and output tokens per minute. You can see the current limits set for the GitLab account in https://console.anthropic.com/settings/limits.
To request a rate limit increase, contact the Anthropic team via the #ext-anthropic
channel and tag @wortschi
for visibility.
If you do not have access to the GitLab Anthropic account, please file an access request.
Monitoring/Alerting
Section titled “Monitoring/Alerting”AI Gateway uses both custom metrics scrapped from application and default metrics provided by Runway. Right now, alerts are routed to #g_mlops-alerts
in Slack. To route to different channel, refer to documentation.
- AiGatewayServiceRunwayIngressTrafficCessationRegional alert playbook
- AI Gateway Service Overview Dashboard
- AI Gateway Logs
- AI Gateway Alerts
- AI Gateway Logs Overview Dashboard
- AI Gateway Logs Errors Dashboard
- Runway Service Metrics
- Runway Service Logs
Troubleshooting
Section titled “Troubleshooting”How do I rotate ANTHROPIC_API_KEY
?
Section titled “How do I rotate ANTHROPIC_API_KEY?”AI Gateway uses secrets management for Anthropic API key. To rotate a secret, refer to documentation.
For troubleshooting deployment pipelines, refer to Runway runbook.