Skip to content

Code Suggestions

GitLab Duo Code Suggestions provides two distinct AI-powered coding assistance functions:

  • Code Completion:
    • Powered by Vertex AI-hosted Codestral (17.11 and earlier) and Fireworks AI-hosted Codestral (18.0 and later)
    • Response time: Satisfied < 1s, Tolerated < 10s
    • Activated automatically while typing
  • Code Generation:
    • Powered by Anthropic Claude 3.7 Sonnet
    • Response time: Can exceed 5 seconds for complex algorithms, satisfied < 30s
    • Triggered by natural language comments followed by Enter or empty functions
    • Supports streaming in JetBrains and Visual Studio IDEs

Available in:

  • VS Code (GitLab Workflow extension v6.2.2+)
  • JetBrains IDEs (GitLab extension v3.6.5+)
  • Visual Studio (GitLab extension v0.51.0+)
  • Neovim (GitLab plugin v1.1.0+)
  • GitLab Web IDE

Primary Connection Method:

  • Direct to AI Gateway: Almost all users (including most GitLab Self-Managed customers) connect directly from their IDE to the AI Gateway at cloud.gitlab.com

Alternative Connection Method:

  • Through Self-Managed Instance: GitLab Self-Managed customers can optionally configure their installation to route Code Suggestions requests through their local GitLab Rails application instead of direct connections. This alternative method is configurable by the GitLab administrator but is less commonly used

Authentication Flow:

  • Users authenticate using personal access tokens for secure API connections
  • For Self-Managed users calling the AI Gateway directly, authentication follows the same pattern as SaaS users
  • Detailed authentication and authorization flows are documented in the AI Gateway Architecture Design

Additional Resources:

  • Premium or Ultimate subscription with GitLab Duo Pro or Enterprise add-on
  • Assigned seat in GitLab Duo subscription
  • GitLab 17.2+ for optimal experience
  • Personal access token for secure API connection
  • We typically see more usage Monday to Friday and less on the weekends.
  • The traffic tends to be highest during traditional working hours in the different regions.

Our monitoring is built around two key SLIs that align with our core functionality:

Code Completions SLI (server_code_completions)

Section titled “Code Completions SLI (server_code_completions)”
  • Target: Response time < 1 second
  • Tolerated: Response time < 10 seconds
  • Failure: 5XX errors on /v2/code/completions or /v2/completions endpoints
  • User Impact: When errors occur, users don’t see any completions in their editor. This fails silently, no error is presented. In practice, users can retry fetching the completions by continuing to write code.
  • Models Used: Vertex AI-hosted or Fireworks AI-hosted Codestral
    • inference_vertex - Tracks Vertex performance
    • inference_other - Tracks Fireworks performance

Code Generation SLI (server_code_generations)

Section titled “Code Generation SLI (server_code_generations)”
  • Target: Response time < 5 seconds
  • Tolerated: Response time < 30 seconds
  • Failure: 5XX errors on /v2/code/generation endpoint
  • User Impact: When errors occur, users don’t receive generated code from their comments. This fails silently, no error is presented. In practice, users can retry generating the code by pressing Enter again.
  • Models Used: Anthropic Claude 3.7 Sonnet
    • inference_anthropic - Tracks Anthropic performance

Note: The alerts and dashboards in the Initial Triage section below are organized by these SLIs

Code Suggestions alerts are surfaced through AI Gateway alerts. You can find more about that in the AI Gateway Runbook / Monitoring-Alerting Section.

The specific alerts for Code Suggestions are:

  • AiGatewayServiceServerCodeCompletions…
    • AiGatewayServiceServerCodeCompletionsApdexSLOViolation
    • AiGatewayServiceServerCodeCompletionsApdexSLOViolationRegional
    • AiGatewayServiceServerCodeCompletionsErrorSLOViolation
    • AiGatewayServiceServerCodeCompletionsErrorSLViloationRegional
  • AiGatewayServiceServerCodeGenerations…
    • AiGatewayServiceServerCodeGenerationsApdexSLOViolation
    • AiGatewayServiceServerCodeGenerationsApdexSLOViolationRegional
    • AiGatewayServiceServerCodeGenerationsErrorSLOViolation
    • AiGatewayServiceServerCodeGenerationsErrorSLViloationRegional

This could be caused by an increase in latency or an increase in errors. The user impact will be slower response times when generating code suggestions.

Client Behavior During Slow Requests:

  • Loading Indicator: Users will see a loading indicator in their IDE extension while waiting for suggestions
  • If users wait without typing or navigating, slow suggestions will eventually appear in their editor once the request completes
  • If users continue typing, moving the cursor, or navigating to different files while waiting, the delayed suggestion will be discarded when it finally returns (as it’s no longer contextually relevant)
  • User Experience Impact: During SLO violations, users may experience:
    • More frequent “loading” states in their editor
    • Reduced suggestion frequency as they continue working while requests are pending

This is caused by an increase in 5XX errors. When this happens the user will not see code suggestions appear in their IDE.

Step 1: Determine which AI Service is affected

Go to the AI Gateway Dashboard and identify if the issue is related to code completions, code generation, or some other service.

If this is a more general AI Gateway problem, refer to the AI Gateway Runbook

Step 2: Investigate Code Completions Issues

If the dashboard indicates a code completions problem:

  1. Check SLI Apdex metrics
  2. Analyze Error Rates
  3. Review Latency Issues
    • Check p95 server_code_completions Latency: Grafana Link
    • Examine charts in Grafana (AI Gateway Overview) for inference_vertex or inference_other (Fireworks)
    • Consider these factors:
      • Is the issue isolated to one region or affecting all regions?
      • Is it specific to a particular model?
      • Is it related to a specific provider (vertex/other/anthropic)?
      • Has there been an increase in requests (RPS)?
    • Review the log dashboard
      • Has there been an increase in prompt length? This (also called input tokens) can lead to slower response times

Step 3: Investigate Code Generation Issues

If the dashboard indicates a code generation problem:

  1. Check SLI Apdex metrics
  2. Analyze Error Rates
  3. Review Latency Issues
    • Check p95 server_code_generation Latency: Grafana Link
    • Examine charts in Grafana (AI Gateway Overview) for inference_anthropic
    • Consider these factors:
      • Is the issue isolated to one region or affecting all regions?
      • Is it specific to a particular model?
      • Is it specific to a provider (vertex/other/anthropic)?
      • Has there been an increase in requests (RPS)?
    • Review the log dashboard
      • Has there been an increase in prompt length? This (also called input tokens) can lead to slower response times

When experiencing high error rates, the most common cause is quota or rate limit issues with our LLM providers. Follow these steps to diagnose and resolve:

Different providers have different methods for checking quota usage:

Step 2: Correlate Quota Issues with Error Patterns

Section titled “Step 2: Correlate Quota Issues with Error Patterns”

After checking quotas, correlate findings with error logs:

  • Review the Log dashboard for HTTP 429 (rate limit) or 403 (quota exceeded) errors
  • Look for error patterns that align with the provider experiencing quota issues
  • Check if errors are concentrated during peak usage hours

If quota/rate limit issues are confirmed:

  1. Immediate: Document the affected provider, quota type, and current utilization percentage
  2. Contact Provider: Reach out through the appropriate channel:
    • Google Cloud/Vertex: #ext-google-cloud slack channel
    • Anthropic: #ext-anthropic slack channel
    • Fireworks: #ext-gitlab-fireworks slack channel (internal access required)
  3. Include Details: When contacting providers, include:
    • Current quota utilization percentage
    • Time range when issues began
    • Expected traffic patterns requiring higher limits
    • Business impact summary
  • Continue monitoring the AI Gateway Dashboard error rates
  • Verify quota increases take effect by re-checking provider consoles
  • Confirm error rates return to normal baseline levels
  1. An increase in traffic can lead to latency issues. This could be caused by saturation of the LLM which takes longer to respond.
  2. If there is an increase in tokens sent, then the requests could take longer.
  3. Check out the AI Gateway Scalability Runbook
  1. If there are problems with a specific provider we will need to work directly with them to resolve the problem. Here are some ways to reach out in slack:
    • #ext-google-cloud
    • #ext-anthropic
    • #ext-gitlab-fireworks (not currently public)

Prolonged Provider Outages - Model Failover

Section titled “Prolonged Provider Outages - Model Failover”

When a provider experiences extended outages or degraded performance that cannot be quickly resolved, Code Suggestions has a failover system to switch traffic to alternative model providers using feature flags.

When to Consider Failover:

  • Provider outage expected to last more than 30 minutes
  • Sustained high error rates (>10%) from a specific provider
  • Severe latency issues affecting user experience across a provider
  • Provider communication indicates extended maintenance windows

Failover Process:

  • Failover procedures require coordination with on-call engineers and must follow established protocols
  • Complete failover documentation and procedures: Code Suggestion Failover Runbook
  • Feature flag changes require appropriate approvals and should be coordinated through incident management procedures

Important Notes:

  • Failover is a significant operational change that affects all users
  • Always document the business justification and expected duration before initiating failover
  • Monitor closely after failover to ensure the alternative provider can handle the traffic load
  • Plan for failback once the primary provider issues are resolved

Be sure the datasource (data view) is “pubsub-mlops-inf-gprd-*”

Look for the json.jsonPayload.path that looks like “/v2/code/completions” - could be different versions or variations on the path like “/v3/code/completions” or “/v4/code/suggestions”

Code Suggestions Overview Dashboard

  • For both code completion and code generation
  • Request rates
  • Error counts by error code
  • User counts
  • Latency
  • Prompt lengths

Code Completion Durations

  • Latency for code completion (not code generation)
  • Broken down by region, provider, model name

There are specific filtered versions as well:

AI Gateway Overview

Since all the Code Suggestions traffic flows through the AI Gateway, this dashboard is the best place to look. It has information about other services too (like Duo Chat).

SLI Details:inference *

Details on the various model providers:

  • Fireworks can be found in inference_other. We currently use this for code completion with the text-completion-fireworks_ai/codestral-2501 or text-completion-fireworks_ai/qwen2p5-coder-7b models
  • Vertex/GCP can be found in inference_vertex. We currently use vertex_ai/codestral-2501 for code completions
  • Anthropic can be found in inference_anthropic. We use Claude for code generation, but so do other Duo Features.

SLI Details: server_code_completions or SLI Details: server_code_generations

  • Details on latency, requests per second (RPS), and errors
  • Overall breakdown, per API endpoint, per Region

Code Suggestions Error Budget Details or Code Suggestions Group Dashboard

These have much less valuable information than the AI Gateway Overview dashboard

This is a good source of information for historic data, but is not updated in real time. Most of these charts can be filtered by model, provider, deployment type (SaaS, SM, etc), and more.

Limited alerting data can be found in Sentry