AI Gateway Service

Service Overview
Alerts: https://alerts.gitlab.net/#/alerts?filter=%7Btype%3D%22ai-gateway%22%2C%20tier%3D%22sv%22%7D
Label: gitlab-com/gl-infra/production~“Service::AIGateway”

Logging

Summary

The AI-gateway is a standalone-service that will give access to AI features to all users of GitLab, no matter which instance they are using: self-managed, dedicated or GitLab.com.

The AI Gateway was formerly known as Model Gateway and Code Suggestions.

Please update the status page whenever we encounter LLM provider disruptions. Key indicators include elevated error rates for provider inference SLI, e.g. inference_anthropic . For additional context, you may also want to monitor LLM provider status pages:

Operational Roles and Responsibilities

Regional deployment management - The AI-Gateway team is responsible for selecting, provisioning and de-provisioning regional deployments. Selection and provisioning can be self-served via the runway config (multi region docs). Currently deprovisioning should be requested by contacting the Runway team.
Quota Saturation Monitoring and Response - The AI-Gateway team is responsible for monitoring the saturation warnings and responding to the warnings when raised.

Architecture

See the AI Gateway architecture blueprint at https://docs.gitlab.com/ee/architecture/blueprints/ai_gateway/

For a higher level view of how the AI Gateway fits into our AI Architecture, see https://docs.gitlab.com/ee/development/ai_architecture.html

Example API call graph

For context, here is a typical call graph for a Code Suggestions API request from an IDE on an end-user’s laptop. This call graph is current as of 2023-12-15 but may change in the future.

sequenceDiagram
    box User laptop
    actor User
    participant IDE as VSCode
    participant LS as LanguageServer process
    end
    box Cloudflare POP nearest to User
    participant CFGL as gitlab.com
    end
    box GitLab Rails AI-assisted infrastructure
    participant WH as Workhorse
    participant RB as Rails
    end
    box Cloudflare POP nearest to Rails
    participant CFCS as codesuggestions.gitlab.com
    end
    box GitLab AI-gateway infrastructure (GCP)
    participant GW as AI gateway
    end
    box Model engine service
    participant ML as ML model engine (Vertex or Anthropic)
    end

    IDE ->>+ LS: Request code completion for cursor context
    LS ->>+ CFGL: /api/v4/code_suggestions/completions
    CFGL ->>+ WH: /api/v4/code_suggestions/completions

    WH ->>+ RB: /api/v4/code_suggestions/completions
    RB -->>- WH: composes request for ai-gateway and delegates to workhorse

    WH ->>+ CFCS: /v2/code/completions
    CFCS ->>+ GW: /v2/code/completions
    GW ->>+ ML: model params

    ML -->>- GW: model response
    GW -->>- CFCS: ai-gateway API response
    CFCS -->>- WH: ai-gateway API response
    WH -->>- CFGL: workhorse API response
    CFGL -->>- LS: workhorse API response
    LS -->>- IDE: Code completion suggestion

Notes:

Over the last few months, the endpoints and control flow have evolved, sometimes in non-backward-compatible ways.
- e.g. Prior to GitLab 16.3, clients directly accessed a now deprecated request endpoint /v2/completions. Some self-managed GitLab deployments running older versions while Code Suggestions was still in beta release may still be using those now-broken endpoints.
Transits Cloudflare twice, once from end-user to Rails AI Assisted, and later from Rails to ai-gateway.
- Typically at least one of those is fairly low latency: only 10 ms RTT between GCP’s us-east1 region and Cloudflare’s ATL POP.
- Cloudflare tools (logs, analytics, rules, etc.) are available for both of those API calls.
The requests to ai-gateway are expected to be slow, so Rails composes the request headers and then delegates it for Workhorse to send that request to ai-gateway. (Workhorse can handle slow requests much more efficiently than Rails; this conserves puma worker threads.)
Caching and reuse of TCP and TLS sessions allows most requests to avoid extra round-trips for connection setup.
Currently ai-gateway containers run as a GCP Cloud Run service.
- See the Cloud Run docs and console.
- Those containers are not accessible via the tools we use for GKE-based services (kubectl, etc.).
- The gcloud CLI tool exposes specs for the containers and their revisions (deployments).

Starter gcloud commands:

$ gcloud run services describe --project gitlab-runway-production --region us-east1 --format yaml ai-gateway
$ gcloud run revisions list --project gitlab-runway-production --region us-east1 --service ai-gateway

Deployment

AI Gateway is deployed through Runway:

For more details, refer to Runway runbook.

Pausing AI Gateway Deployments

Runway handles the multi-stage deployments of AI Gateway. In certain situations (for example, during incident mitigation or maintenance), it may be beneficial to temporarily disable these continuous deployments. To do this, the pipeline can be configured to require manual intervention before deploying changes.

Expedited AI Gateway Deployments

For incident response or critical changes, the AI Gateway CI process can be expedited through MR pipeline configurations.

To fully expedite an AI Gateway deployment, you must complete both of the following steps:

Pre-merge Acceleration:
- Apply the pipeline::expedited label to your MR
- This skips non-essential CI jobs, reducing pipeline time from ~30 minutes to ~3-4 minutes
Post-merge Acceleration:
- When merging the MR to main, include pipeline_expedited: true in the commit body (not the subject line)
- You can use any conventional commit format for the subject line (e.g., feat: add new feature)
- This will trigger an expedited deployment pipeline on the main branch which only runs required stages for runway deployments

Important: Both steps are required for complete acceleration. MR labels don’t carry over to the main branch pipeline, which is why the commit body configuration is necessary for the second half of the process.

Temporarily disabling automatic deployments:

Open the AI Gateway project in Gitlab: Go to the project’s Settings >> CI/CD page.
Add a CI/CD variable: Under Variables, define a new variable named RUNWAY_DEPLOYMENT_ACTION.
Set the value to “manual”: Enter manual as the value for RUNWAY_DEPLOYMENT_ACTION, Mark the variable as protected and save changes.
Confirm pipeline behavior: With this variable set, any new AI Gateway deployment pipeline will pause before deploying. The deployment job will not proceed to staging or production until manually triggered.

This configuration effectively pauses all continuous deployments. To resume normal automated deployments, remove the RUNWAY_DEPLOYMENT_ACTION variable. Once this variable is removed, Runway will revert to automatically deploying new changes on pipeline runs as usual.

Environments

Regions

Runway Multi-Region Documentation

AI Gateway is currently deployed in the following regions:

us-east4
asia-northeast1
asia-northeast3
europe-west2
europe-west3
europe-west9

When the decision is made to provision a new region, the following steps should be taken:

Request a quota increase in the new region (for instructions see this section below)
Follow the Runway documentation to set up the new region
Notify the testing team that a new region has been set up so that they can run the necessary tests. Our current contact is Abhinaba Ghosh and the active Slack channel is #ai-gateway-testing.

Services and Accounts

The Cloud Run service accounts are managed by Runway and have aiplatform.user role set, granting the service accounts the aiplatform.endpoints.predict permission. Other permissions granted by this role are unused. To set additional roles, update ai-gateway entry in Runway provisioner. This IAM membership is managed via the gl-infra/config-mgmt repository, using Terraform.

Service Account Configuration

Performance

AI Gateway includes the following SLIs/SLOs:

Service degradation could be the result of the following saturation resources:

Determining Affected Components

As the AI Gateway serves multiple features, you can use the above dashboards to determine if the degradation is related to a specific feature.

For more information about the features covered by the server_code_generations and server_code_completions SLIs, refer to the Code Suggestions Runbook.

Scalability

AI Gateway will autoscale with traffic. To manually scale, update runway.yml based on documentation.

It is also possible to directly edit the tunables for the ai-gateway service via the Cloud Run console’s Edit YAML interface. This takes effect faster, but be sure to make the equivalent updates to the runway.yml as described above; otherwise the next deploy will revert your manual changes to the service YAML.

Capacity Planning

AI Gateway uses capacity planning provided by Runway for long-term forecasting of saturation resources. To view forecasts, refer to Tamland page.

GCP Quotas

Apart from our quota monitoring in our usual GCP projects, the AI Gateway relies on resources that live in the following projects:

gitlab-ai-framework-dev
gitlab-ai-framework-stage
gitlab-ai-framework-prod

Refer to https://gitlab-com.gitlab.io/gl-infra/tamland/saturation.html?highlight=code_suggestions#other-utilization-and-saturation-forecasting-non-horizontally-scalable-resources for quota usage trends and projections.

Many of our AI features use GCP’s Vertex AI service. Vertex AI consists of various base models that represent logic for different types of AI models (such as code generation, or chat bots). Each model has its own usage quota, which can be viewed in the GCP Console.

To request a quota alteration:

Visit the following page in the GCP Console: Quotas by Base Model
Select each base model that requires a quota decrease/increase
Click ‘EDIT QUOTAS’
Input the desired quota limit for each service and submit the request.
Existing/previous requests can be viewed here

If you do not have access to the GCP console, please file an access request asking for the Quota Administrator role on the gitlab-ai-framework-prod project.

Fireworks Capacity / Usage

Fireworks capacity is based on the hardware configuration of our endpoints, which are directly controlled by us either through the deployment dashboard or through the firectl command.

You can view current usage at the usage dashboard

To increase the hardware available to a given endpoint in a specific region, contact the Fireworks team via the #ext-gitlab-fireworks channel and tag @Shaunak Godbole for visibility.

If you do not have access to the GitLab Fireworks account, please file an access request or reach out to @Allen Cook or @bob for Fireworks access

Anthropic Rate Limits

Anthropic applies per-model limits to concurrency, requests per minute, input tokens per minute, and output tokens per minute. You can see the current limits set for the GitLab account in https://console.anthropic.com/settings/limits.

To request a rate limit increase, contact the Anthropic team via the #ext-anthropic channel and tag @wortschi for visibility.

If you do not have access to the GitLab Anthropic account, please file an access request.

Monitoring/Alerting

AI Gateway uses both custom metrics scraped from application and default metrics provided by Runway. Right now, alerts are routed to #g_mlops-alerts in Slack. To route to different channel, refer to documentation.

Troubleshooting

How do I rotate `ANTHROPIC_API_KEY`?

AI Gateway uses secrets management for Anthropic API key. To rotate a secret, refer to documentation.

For troubleshooting deployment pipelines, refer to Runway runbook.

Links to further Documentation

AI Gateway Blueprint