Code Suggestions

About Code Suggestions

Contact Information

Group: Create:Code Creation
Handbook: Code Creation
Slack: #g_code_creation

Core Functionality

GitLab Duo Code Suggestions provides two distinct AI-powered coding assistance functions:

Code Completion:
- Powered by Vertex AI-hosted Codestral (17.11 and earlier) and Fireworks AI-hosted Codestral (18.0 and later)
- Response time: Satisfied < 1s, Tolerated < 10s
- Activated automatically while typing
Code Generation:
- Powered by Anthropic Claude 3.7 Sonnet
- Response time: Can exceed 5 seconds for complex algorithms, satisfied < 30s
- Triggered by natural language comments followed by Enter or empty functions
- Supports streaming in JetBrains and Visual Studio IDEs

IDE Integration

Available in:

VS Code (GitLab Workflow extension v6.2.2+)
JetBrains IDEs (GitLab extension v3.6.5+)
Visual Studio (GitLab extension v0.51.0+)
Neovim (GitLab plugin v1.1.0+)
GitLab Web IDE

Connectivity

Primary Connection Method:

Direct to AI Gateway: Almost all users (including most GitLab Self-Managed customers) connect directly from their IDE to the AI Gateway at cloud.gitlab.com

Alternative Connection Method:

Through Self-Managed Instance: GitLab Self-Managed customers can optionally configure their installation to route Code Suggestions requests through their local GitLab Rails application instead of direct connections. This alternative method is configurable by the GitLab administrator but is less commonly used

Authentication Flow:

Users authenticate using personal access tokens for secure API connections
For Self-Managed users calling the AI Gateway directly, authentication follows the same pattern as SaaS users
Detailed authentication and authorization flows are documented in the AI Gateway Architecture Design

Additional Resources:

Complete connectivity diagrams and technical details: Code Creation Engineering Overview

Requirements

Premium or Ultimate subscription with GitLab Duo Pro or Enterprise add-on
Assigned seat in GitLab Duo subscription
GitLab 17.2+ for optimal experience
Personal access token for secure API connection

Usage Patterns

We typically see more usage Monday to Friday and less on the weekends.
The traffic tends to be highest during traditional working hours in the different regions.

Documentation

Code Suggestions Engineering Overview
- Interaction diagrams
- Dependencies
Code Suggestion Documentation - GitLab Documentation

Service Level Indicators (SLIs)

Our monitoring is built around two key SLIs that align with our core functionality:

Code Completions SLI (`server_code_completions`)

Target: Response time < 1 second
Tolerated: Response time < 10 seconds
Failure: 5XX errors on /v2/code/completions or /v2/completions endpoints
User Impact: When errors occur, users don’t see any completions in their editor. This fails silently, no error is presented. In practice, users can retry fetching the completions by continuing to write code.
Models Used: Vertex AI-hosted or Fireworks AI-hosted Codestral
- inference_vertex - Tracks Vertex performance
- inference_other - Tracks Fireworks performance

Code Generation SLI (`server_code_generations`)

Target: Response time < 5 seconds
Tolerated: Response time < 30 seconds
Failure: 5XX errors on /v2/code/generation endpoint
User Impact: When errors occur, users don’t receive generated code from their comments. This fails silently, no error is presented. In practice, users can retry generating the code by pressing Enter again.
Models Used: Anthropic Claude 3.7 Sonnet
- inference_anthropic - Tracks Anthropic performance

Note: The alerts and dashboards in the Initial Triage section below are organized by these SLIs

Initial Triage

Alerting

Code Suggestions alerts are surfaced through AI Gateway alerts. You can find more about that in the AI Gateway Runbook / Monitoring-Alerting Section.

The specific alerts for Code Suggestions are:

AiGatewayServiceServerCodeCompletions…
- AiGatewayServiceServerCodeCompletionsApdexSLOViolation
- AiGatewayServiceServerCodeCompletionsApdexSLOViolationRegional
- AiGatewayServiceServerCodeCompletionsErrorSLOViolation
- AiGatewayServiceServerCodeCompletionsErrorSLViloationRegional
AiGatewayServiceServerCodeGenerations…
- AiGatewayServiceServerCodeGenerationsApdexSLOViolation
- AiGatewayServiceServerCodeGenerationsApdexSLOViolationRegional
- AiGatewayServiceServerCodeGenerationsErrorSLOViolation
- AiGatewayServiceServerCodeGenerationsErrorSLViloationRegional

Alert: ApdexSLOViolations

This could be caused by an increase in latency or an increase in errors. The user impact will be slower response times when generating code suggestions.

Client Behavior During Slow Requests:

Loading Indicator: Users will see a loading indicator in their IDE extension while waiting for suggestions
If users wait without typing or navigating, slow suggestions will eventually appear in their editor once the request completes
If users continue typing, moving the cursor, or navigating to different files while waiting, the delayed suggestion will be discarded when it finally returns (as it’s no longer contextually relevant)
User Experience Impact: During SLO violations, users may experience:
- More frequent “loading” states in their editor
- Reduced suggestion frequency as they continue working while requests are pending

Alert: ErrorSLOViolation

This is caused by an increase in 5XX errors. When this happens the user will not see code suggestions appear in their IDE.

AI Gateway Apdex Error

Step 1: Determine which AI Service is affected

Go to the AI Gateway Dashboard and identify if the issue is related to code completions, code generation, or some other service.

If this is a more general AI Gateway problem, refer to the AI Gateway Runbook

Step 2: Investigate Code Completions Issues

If the dashboard indicates a code completions problem:

Check SLI Apdex metrics
- Server_code_completions SLI Apdex: Grafana Link
- Apdex attribution for server_code_completion: Grafana Link
Analyze Error Rates
- View Server_code_completions Errors: Grafana Link
- Investigate error details in Log dashboard or search in Elastic (data view = pubsub-mlops-inf-gprd-*)
Review Latency Issues
- Check p95 server_code_completions Latency: Grafana Link
- Examine charts in Grafana (AI Gateway Overview) for inference_vertex or inference_other (Fireworks)
- Consider these factors:
  - Is the issue isolated to one region or affecting all regions?
  - Is it specific to a particular model?
  - Is it related to a specific provider (vertex/other/anthropic)?
  - Has there been an increase in requests (RPS)?
- Review the log dashboard
  - Has there been an increase in prompt length? This (also called input tokens) can lead to slower response times

Step 3: Investigate Code Generation Issues

If the dashboard indicates a code generation problem:

Check SLI Apdex metrics
- Server_code_generation SLI Apdex: Grafana Link
- Apdex attribution for server_code_generation: Grafana Link
Analyze Error Rates
- View Server_code_generation Errors: Grafana Link
- Investigate error details in Log dashboard or search in Elastic (data view = pubsub-mlops-inf-gprd-*)
Review Latency Issues
- Check p95 server_code_generation Latency: Grafana Link
- Examine charts in Grafana (AI Gateway Overview) for inference_anthropic
- Consider these factors:
  - Is the issue isolated to one region or affecting all regions?
  - Is it specific to a particular model?
  - Is it specific to a provider (vertex/other/anthropic)?
  - Has there been an increase in requests (RPS)?
- Review the log dashboard
  - Has there been an increase in prompt length? This (also called input tokens) can lead to slower response times

Common Resolution Steps

High Error Rates

When experiencing high error rates, the most common cause is quota or rate limit issues with our LLM providers. Follow these steps to diagnose and resolve:

Step 1: Check Provider Quota Utilization

Different providers have different methods for checking quota usage:

In the saturation panel of the AI gateway service dashboard. This lists all quota as measured clientside.
Anthropic: Check usage and rate limits at console.anthropic.com
Vertex AI (Google Cloud): Check quota usage at console.cloud.google.com
Fireworks: we are using dedicated deployments, so no quota limitations, but a deployment could get overwhelmed by too many requests. Fireworks provides some eyes into this in their console.

Step 2: Correlate Quota Issues with Error Patterns

After checking quotas, correlate findings with error logs:

Review the Log dashboard for HTTP 429 (rate limit) or 403 (quota exceeded) errors
Look for error patterns that align with the provider experiencing quota issues
Check if errors are concentrated during peak usage hours

Step 3: Escalation and Resolution

If quota/rate limit issues are confirmed:

Immediate: Document the affected provider, quota type, and current utilization percentage
Contact Provider: Reach out through the appropriate channel:
- Google Cloud/Vertex: #ext-google-cloud slack channel
- Anthropic: #ext-anthropic slack channel
- Fireworks: #ext-gitlab-fireworks slack channel (internal access required)
Include Details: When contacting providers, include:
- Current quota utilization percentage
- Time range when issues began
- Expected traffic patterns requiring higher limits
- Business impact summary

Step 4: Monitor Resolution

Continue monitoring the AI Gateway Dashboard error rates
Verify quota increases take effect by re-checking provider consoles
Confirm error rates return to normal baseline levels

Latency Issues

An increase in traffic can lead to latency issues. This could be caused by saturation of the LLM which takes longer to respond.
If there is an increase in tokens sent, then the requests could take longer.
Check out the AI Gateway Scalability Runbook

Provider-Specific Problems

If there are problems with a specific provider we will need to work directly with them to resolve the problem. Here are some ways to reach out in slack:
- #ext-google-cloud
- #ext-anthropic
- #ext-gitlab-fireworks (not currently public)

Prolonged Provider Outages - Model Failover

When a provider experiences extended outages or degraded performance that cannot be quickly resolved, Code Suggestions has a failover system to switch traffic to alternative model providers using feature flags.

When to Consider Failover:

Provider outage expected to last more than 30 minutes
Sustained high error rates (>10%) from a specific provider
Severe latency issues affecting user experience across a provider
Provider communication indicates extended maintenance windows

Failover Process:

Failover procedures require coordination with on-call engineers and must follow established protocols
Complete failover documentation and procedures: Code Suggestion Failover Runbook
Feature flag changes require appropriate approvals and should be coordinated through incident management procedures

Important Notes:

Failover is a significant operational change that affects all users
Always document the business justification and expected duration before initiating failover
Monitor closely after failover to ensure the alternative provider can handle the traffic load
Plan for failback once the primary provider issues are resolved

Dashboards

Logging

Be sure the datasource (data view) is “pubsub-mlops-inf-gprd-*”

Look for the json.jsonPayload.path that looks like “/v2/code/completions” - could be different versions or variations on the path like “/v3/code/completions” or “/v4/code/suggestions”

Code Suggestions Overview Dashboard

For both code completion and code generation
Request rates
Error counts by error code
User counts
Latency
Prompt lengths

Code Completion Durations

Latency for code completion (not code generation)
Broken down by region, provider, model name

There are specific filtered versions as well:

Fireworks: https://log.gprd.gitlab.net/app/r/s/8igQR
Vertex: https://log.gprd.gitlab.net/app/r/s/VciTn
Codestral: https://log.gprd.gitlab.net/app/r/s/7MASa
Codestral on Fireworks: https://log.gprd.gitlab.net/app/r/s/bi6fU
Codestral in europe-west-2: https://log.gprd.gitlab.net/app/r/s/yER2Q
Codestral in us-east-4: https://log.gprd.gitlab.net/app/r/s/5y4Dt
Qwen: https://log.gprd.gitlab.net/app/r/s/PbBOj

Grafana Dashboards

AI Gateway Overview

Since all the Code Suggestions traffic flows through the AI Gateway, this dashboard is the best place to look. It has information about other services too (like Duo Chat).

SLI Details:inference *

Details on the various model providers:

Fireworks can be found in inference_other. We currently use this for code completion with the text-completion-fireworks_ai/codestral-2501 or text-completion-fireworks_ai/qwen2p5-coder-7b models
Vertex/GCP can be found in inference_vertex. We currently use vertex_ai/codestral-2501 for code completions
Anthropic can be found in inference_anthropic. We use Claude for code generation, but so do other Duo Features.

SLI Details: server_code_completions or SLI Details: server_code_generations

Details on latency, requests per second (RPS), and errors
Overall breakdown, per API endpoint, per Region

Code Suggestions Error Budget Details or Code Suggestions Group Dashboard

These have much less valuable information than the AI Gateway Overview dashboard

Tableau

This is a good source of information for historic data, but is not updated in real time. Most of these charts can be filtered by model, provider, deployment type (SaaS, SM, etc), and more.

Sentry

Limited alerting data can be found in Sentry