Skip to content

AI Gateway Service

The AI-gateway is a standalone-service that will give access to AI features to all users of GitLab, no matter which instance they are using: self-managed, dedicated or GitLab.com.

The AI Gateway was formerly known as Model Gateway and Code Suggestions.

  1. Regional deployment management - The AI-Gateway team is responsible for selecting, provisioning and de-provisioning regional deployments. Selection and provisioning can be self-served via the runway config (multi region docs). Currently deprovisioning should be requested by contacting the Runway team.
  2. Quota Saturation Monitoring and Response - The AI-Gateway team is responsible for monitoring the saturation warnings and responding to the warnings when raised.

See the AI Gateway architecture blueprint at https://docs.gitlab.com/ee/architecture/blueprints/ai_gateway/

For a higher level view of how the AI Gateway fits into our AI Architecture, see https://docs.gitlab.com/ee/development/ai_architecture.html

For context, here is a typical call graph for a Code Suggestions API request from an IDE on an end-user’s laptop. This call graph is current as of 2023-12-15 but may change in the future.

sequenceDiagram
box User laptop
actor User
participant IDE as VSCode
participant LS as LanguageServer process
end
box Cloudflare POP nearest to User
participant CFGL as gitlab.com
end
box GitLab Rails AI-assisted infrastructure
participant WH as Workhorse
participant RB as Rails
end
box Cloudflare POP nearest to Rails
participant CFCS as codesuggestions.gitlab.com
end
box GitLab AI-gateway infrastructure (GCP)
participant GW as AI gateway
end
box Model engine service
participant ML as ML model engine (Vertex or Anthropic)
end
IDE ->>+ LS: Request code completion for cursor context
LS ->>+ CFGL: /api/v4/code_suggestions/completions
CFGL ->>+ WH: /api/v4/code_suggestions/completions
WH ->>+ RB: /api/v4/code_suggestions/completions
RB -->>- WH: composes request for ai-gateway and delegates to workhorse
WH ->>+ CFCS: /v2/code/completions
CFCS ->>+ GW: /v2/code/completions
GW ->>+ ML: model params
ML -->>- GW: model response
GW -->>- CFCS: ai-gateway API response
CFCS -->>- WH: ai-gateway API response
WH -->>- CFGL: workhorse API response
CFGL -->>- LS: workhorse API response
LS -->>- IDE: Code completion suggestion

Notes:

  • Over the last few months, the endpoints and control flow have evolved, sometimes in non-backward-compatible ways.
    • e.g. Prior to GitLab 16.3, clients directly accessed a now deprecated request endpoint /v2/completions. Some self-managed GitLab deployments running older versions while Code Suggestions was still in beta release may still be using those now-broken endpoints.
  • Transits Cloudflare twice, once from end-user to Rails AI Assisted, and later from Rails to ai-gateway.
    • Typically at least one of those is fairly low latency: only 10 ms RTT between GCP’s us-east1 region and Cloudflare’s ATL POP.
    • Cloudflare tools (logs, analytics, rules, etc.) are available for both of those API calls.
  • The requests to ai-gateway are expected to be slow, so Rails composes the request headers and then delegates it for Workhorse to send that request to ai-gateway. (Workhorse can handle slow requests much more efficiently than Rails; this conserves puma worker threads.)
  • Caching and reuse of TCP and TLS sessions allows most requests to avoid extra round-trips for connection setup.
  • Currently ai-gateway containers run as a GCP Cloud Run service.
    • See the Cloud Run docs and console.
    • Those containers are not accessible via the tools we use for GKE-based services (kubectl, etc.).
    • The gcloud CLI tool exposes specs for the containers and their revisions (deployments).

Starter gcloud commands:

$ gcloud run services describe --project gitlab-runway-production --region us-east1 --format yaml ai-gateway
$ gcloud run revisions list --project gitlab-runway-production --region us-east1 --service ai-gateway

AI Gateway is deployed through Runway:

For more details, refer to Runway runbook.

Runway handles the multi-stage deployments of AI Gateway. In certain situations (for example, during incident mitigation or maintenance), it may be beneficial to temporarily disable these continuous deployments. To do this, the pipeline can be configured to require manual intervention before deploying changes.

Temporarily disabling automatic deployments:

  1. Open the AI Gateway project in Gitlab: Go to the project’s Settings >> CI/CD page.
  2. Add a CI/CD variable: Under Variables, define a new variable named RUNWAY_DEPLOYMENT_ACTION.
  3. Set the value to “manual”: Enter manual as the value for RUNWAY_DEPLOYMENT_ACTION, Mark the variable as protected and save changes.
  4. Confirm pipeline behavior: With this variable set, any new AI Gateway deployment pipeline will pause before deploying. The deployment job will not proceed to staging or production until manually triggered.

This configuration effectively pauses all continuous deployments. To resume normal automated deployments, remove the RUNWAY_DEPLOYMENT_ACTION variable. Once this variable is removed, Runway will revert to automatically deploying new changes on pipeline runs as usual.

AI Gateway is currently deployed in the following regions:

  1. us-east4
  2. asia-northeast1
  3. asia-northeast3
  4. europe-west2
  5. europe-west3
  6. europe-west9

When the decision is made to provision a new region, the following steps should be taken:

  1. Request a quota increase in the new region (for instructions see this section below)
  2. Follow the Runway documentation to set up the new region
  3. Notify the testing team that a new region has been set up so that they can run the necessary tests. Our current contact is Abhinaba Ghosh and the active Slack channel is #ai-gateway-testing.

The Cloud Run service accounts are managed by Runway and have aiplatform.user role set, granting the service accounts the aiplatform.endpoints.predict permission. Other permissions granted by this role are unused. To set additional roles, update ai-gateway entry in Runway provisioner. This IAM membership is managed via the gl-infra/config-mgmt repository, using Terraform.

AI Gateway includes the following SLIs/SLOs:

Service degradation could be result of the following saturation resources:

As the AI Gateway serves multiple features, you can use the above dashboards to determine if the degredation is related to a specific feature.

For more information about the features covered by the server_code_generations and server_code_completions SLIs, refer to the Code Suggestions Runbook.

AI Gateway will autoscale with traffic. To manually scale, update runway.yml based on documentation.

It is also possible to directly edit the tunables for the ai-gateway service via the Cloud Run console’s Edit YAML interface. This takes effect faster, but be sure to make the equivalent updates to the runway.yml as described above; otherwise the next deploy will revert your manual changes to the service YAML.

AI Gateway uses capacity planning provided by Runway for long-term forecasting of saturation resources. To view forecasts, refer to Tamland page.

Apart from our quota monitoring in our usual GCP projects, the AI Gateway relies on resources that live in the following projects:

  • gitlab-ai-framework-dev
  • gitlab-ai-framework-stage
  • gitlab-ai-framework-prod

Refer to https://gitlab-com.gitlab.io/gl-infra/tamland/saturation.html?highlight=code_suggestions#other-utilization-and-saturation-forecasting-non-horizontally-scalable-resources for quota usage trends and projections.

Many of our AI features use GCP’s Vertex AI service. Vertex AI consists of various base models that represent logic for different types of AI models (such as code generation, or chat bots). Each model has its own usage quota, which can be viewed in the GCP Console.

To request a quota alteration:

  • Visit the following page in the GCP Console: Quotas by Base Model
  • Select each base model that requires an quota decrease/increase
  • Click ‘EDIT QUOTAS’
  • Input the desired quota limit for each service and submit the request.
  • Existing/previous requests can be viewed here

If you do not have access to the GCP console, please file an access request asking for the Quota Administrator role on the gitlab-ai-framework-prod project.

Fireworks capacity is based on the hardware configuration of our endpoints, which are directly controlled by us either through the deployment dashboard or through the firectl command.

You can view current usage at the usage dashboard

To increase the hardware available to a given endpoint in a specific region, contact the Fireworks team via the #ext-gitlab-fireworks channel and tag @Shaunak Godbole for visibility.

If you do not have access to the GitLab Fireworks account, please file an access request or reach out to @Allen Cook or @bob for Fireworks access

Anthropic applies per-model limits to concurrency, requests per minute, input tokens per minute, and output tokens per minute. You can see the current limits set for the GitLab account in https://console.anthropic.com/settings/limits.

To request a rate limit increase, contact the Anthropic team via the #ext-anthropic channel and tag @wortschi for visibility.

If you do not have access to the GitLab Anthropic account, please file an access request.

AI Gateway uses both custom metrics scrapped from application and default metrics provided by Runway. Right now, alerts are routed to #g_mlops-alerts in Slack. To route to different channel, refer to documentation.

AI Gateway uses secrets management for Anthropic API key. To rotate a secret, refer to documentation.

For troubleshooting deployment pipelines, refer to Runway runbook.