Duo Chat Runbook
Table of Contents
Section titled “Table of Contents”- Overview
- Quick Links
- Duo Chat specific Error Codes
- Expanded AI logging
- Tracing requests across different services
Overview
Section titled “Overview”This page explains how to investigate Duo Chat issues on production.
Quick Links
Section titled “Quick Links”- Dashboard:
- Kibana based dashboard
- Prometheus based dashboard … TBD
- Evaluation by Prompt Library
- Error budget dashboard
- GitLab-Rails:
- GitLab-Sidekiq:
- Redis:
- AI Gateway:
- Anthropic APIs:
- SLI/SLO dashboard … TBD
- Anthropic API Status Page (NOTE: This displays entire Anthropic system operational status, which might be unrelated to our workload)
- Others:
Before start investigation
Section titled “Before start investigation”NOTE: Do NOT expose customer’s RED data in public issues. Redact them or make a confidential issue if you’re unsure.
Before starting the investigation, please collect the following information:
- User name who encountered the bug (e.g.
@johndoe
) - What happened (e.g. User asked a question in Duo chat and saw an error code A1001)
- When it happened (e.g. Around 2024/09/16 01:00 UTC)
- Where it happened (e.g. VS Code, Web UI)
- How often it happens (e.g. It happens everytime)
- Steps to reproduce (e.g. 1. Ask a question “xxx” 2. Click …)
- Whether we can enable Expanded AI logging and retry the bug so that we can collect process-level logging.
- (V2 Chat Agent only) Whether the bug can be reproduced with Chat Agent V1 as well. See how to disable the feature flag for a specific user.
- (Self-managed only) GitLab version (e.g. v17.4)
- (Self-managed only) GitLab host name (e.g.
my-org.gitlab.io
) - (Self-managed only) whether they use GitLab-managed AI Gateway or self-managed AI Gateway (If they use custom models, it’s likely latter.)
- A link to a Slack discussion, if any.
and create an issue in the GitLab Issue tracker and ping @gitlab-org/ai-powered/duo-chat
.
Log links for various environments can be found here.
Different deployments use different indexes. The following indexes are most helpful when debugging Duo Chat:
- AI Gateway logs are in the
pubsub-mlops-inf-gprd-*
index - GitLab Rails Sidekiq logs are in the
pubsub-sidekiq-inf-gprd*
index- All LLM Sidekiq trafic is sent to a single Sidekiq shard, filtering on
json.shard.keyword: "ai-abstraction-layer"
will only returnai-abstraction-layer
traffic. - When searching this index, filtering on
json.subcomponent : "llm"
ensures only LLM logs are returned
- All LLM Sidekiq trafic is sent to a single Sidekiq shard, filtering on
- GitLab Rails logs are in the
pubsub-rails-inf-gprd-*
index
Chat GraphQL request logs for a user can be found with the following Kibana query in the Rails (pubsub-rails-inf-gprd-*
) index:
json.meta.user : "your-gitlab-username" and json.meta.caller_id : "graphql:chat"
If you find requests for a user there but do not find any results for them using a Kibana query in the Sidekiq (pubsub-sidekiq-inf-gprd*
) index:
“json.meta.user : “username-that-received-error” and json.subcomponent : “llm”`
That probably indicates a problem with Sidekiq where the job is not being kicked off. Check the #incident-management
to see if there are any ongoing Sidekiq issues. Chat relies on Sidekiq and should be considered “down” if Sidekiq is backed up. See Duo Chat does not respond or responds very slowly below.
AI Abstraction Layer Sidekiq Traffic
Section titled “AI Abstraction Layer Sidekiq Traffic”Duo Chat requests and some Duo experimental features go through an isolated urgent-ai-abstraction-layer
Sidekiq shard which provides a centralize platform to handle asynchronous jobs for our external LLM inferences. As part of the AI Framework’s reslience objective, we’ve migrated our Sidekiq traffic onto one single shard to seperate LLM requests from the entire Gitlab’s sidekiq jobs.
To find only Duo traffic, you can click on the the pubsub-sidekiq-inf-*
Elastic Search.
- Filter the logs by selecting
json.shard.keyword: "urgent-ai-abstraction-layer"
to limit logs coming from our respective Sidekiq containers.
Important Feature Category information:
Sidekiq feature category: urgent-ai-abstraction-layer
GKE Deployment: gkeDeployment: 'gitlab-sidekiq-urgent-ai-abstraction-layer-v2'
Queue Urgency: throttled
See this issue for more information.
Tracing requests across different services
Section titled “Tracing requests across different services”We utilize a correlation_id
attribute to track and correlate log entries across different services. This unique identifier serves as a key to tie together logs for different systems and components.
In Duo Chat case, mainly these components are invovled:
- GitLab-Rails … It provides the GraphQL API interface for frontend clients (VS Code extension, WebUI). Since it invokes a sidekiq job to perform the actual process as a separate process, this correlation ID can’t be used for tracing the actual processes.
- GitLab-Sidekiq … It performs the main process of Duo Chat request. If you’re debugging Duo Chat functionalities, you need to grab a correlation ID in the Sidekiq logs.
- AI Gateway … It performs the AI related process of Duo Chat request. You can find a correlated logs from the Sidekiq’s correlation ID.
Here is an example of how to find correlated logs in the AI Gateway:
- Access the
pubsub-mlops-inf-gprd-*
index: - Filter for the logs with
json.jsonPayload.correlation_id : <correlation_id
- optional: Click on the expanded logs icon and select “Surrounding documents” to view logs within relative same time stamp.
How to determine global user ID for a user
Section titled “How to determine global user ID for a user”When troubleshooting requests from self-managed users on the AI Gateway, it may be helpful to find their global user ID to narrow down requests. They should run this command on their instance rails console:
u = User.find_by_username(<USERNAME>)Gitlab::GlobalAnonymousId.user_id(u)
Then you can filter by json.jsonPayload.gitlab_global_user_id
to see requests from that specific user.
You can also attempt to figure it out if you know the gitlab_host_name
and approximate timestamp.
- Go to AI Gateway log (Duo Chat)
- Filter by the
json.jsonPayload.gitlab_host_name
- Narrow down the request by the timestamp given by the customer
- Look at the requests in the given time period and try to determine a
gitlab_global_user_id
that fits.
This process involves guesswork, so it is best to ask the customer directly.
Extra Kibana links
Section titled “Extra Kibana links”You can find other helpful log searches by looking at saved Kibana objects with the group::ai_framework
tag.
Duo Chat specific error codes
Section titled “Duo Chat specific error codes”All of GitLab Duo Chat error codes are documented here. The error code prefix letter can help you choose which Kibana logs to search.
Error Code Layer Identifier
Section titled “Error Code Layer Identifier”Code | Layer |
---|---|
M | Monolith - A network communication error in the monolith layer. |
G | AI Gateway - A data formatting/processing error in the AI gateway layer. |
A | Third-party API - An authentication or data access permissions error in a third-party API. |
Debugging Error Codes A1000-6000
Section titled “Debugging Error Codes A1000-6000”When you receive an error code starting with ‘A’, there’s an error coming from the AI Gateway.
This can mean that the AI Gateway service itself is erroring or that a third-party LLM provider is returning an error to the AI Gateway.
- Check for any ongoing outages with our third-party LLM providers:
- Use the Grafana Dashboard to determine the overall impact. The
Aggregated Service Level Indicators (𝙎𝙇𝙄𝙨)
metric on that page indicates what percentage of users/requests are encountering errors. - Track down the specific error
- Search for any Chat requests with errors for the user in the Sidekiq logs (
pubsub-sidekiq-inf-gprd-*
):json.meta.user : "username-that-received-error" and json.subcomponent : "llm" and json.error : *
. The log line with thejson.error
value that matches what the user is seeing is what you want to use. Copy thejson.correlation_id
value. - Search for the request in the AI Gateway logs (
pubsub-mlops-inf-gprd-*
):json.jsonPayload.correlation_id : "correlation_id-from-last-result"
- The
json.payload.Message
value in the AI Gateway log results should indicate what error message we are receiving from Anthropic, if any.
- Search for any Chat requests with errors for the user in the Sidekiq logs (
Debugging Error Codes M3002 - M3004
Section titled “Debugging Error Codes M3002 - M3004”The issue most likely exists within the Monolith. Look for this error in the Sidekiq logs.
- Filter out json logs to the subcomponent
llm.log
withjson_subcomponent.keyword : "llm"
- Filter out error codes with the specific M error code with
json.error_code : "<error_code>"
- Check to see issue occurs with a specific user with
json.meta.user : "<user_name>"
- Make sure the “Calendar Icon” has the query active for the relevant issue. The default time stamp is 15 minutes realtive from the current date.
The following should provide enough information to boil down error logs for a specific user, error code, and all relevant llm logs that follow underneath our AI logs. Some common issues that cause the following error are:
- Rails application having issue with an access check for the current resource.
- Duo features aren’t enabled with the group or project.
Debugging Error Codes M3005
Section titled “Debugging Error Codes M3005”The following error is pretty straightforward for reasoning. The M3005 error code indicates that the user is requesting a chat capability that belongs to a higher add-on tier, which the user does not currently have access to. This error occurs when attempting to use features or functionalities that are not included in the user’s current subscription level or plan.
Please report this issue to the development team #g_ai_framework#. It most likely indicates an issue with the access control for guarding unit primitives.
Debugging Error Codes M4000
Section titled “Debugging Error Codes M4000”The following all relate towards a slash command issues.
Slash Command | Tool | SME Slack Channel |
---|---|---|
/troubleshoot | #f_ci_rca | |
/explain | #g_code_creation | |
/tests | #g_code_creation | |
/refactor | #g_code_creation | |
/summarize_comments | #f_plan_ai | |
/refactor | #g_code_creation | |
/vulnerability_explain | #f_ci_rca |
Duo Chat does not respond or responds very slowly
Section titled “Duo Chat does not respond or responds very slowly”This could be caused by an issue with Sidekiq queues getting backed up. First, check the GitLab status page to see if there are any reported problems with Sidekiq or “background job processing”. Then, check this dashboard. If you see that ‘scheduling time for the completion worker’ values are much higher than normal, it indicates the Sidekiq backup may be the problem.
Expanded AI logging
Section titled “Expanded AI logging”WARNING: DO NOT ENABLE FOR CUSTOMERS. GitLab does not retain input and output data unless customers provide consent through a GitLab Support Ticket.
We do allow the option to enable enhanced ai logging by enabling the expanded_ai_logging
feature flag. The flag will allow you to see input and ouput of any of the following AI tools.
To enable expanded AI logging, access the #production
Slack channel and run the following command.
/chatops run feature set --user=$USERNAME expanded_ai_logging true
After the the expanded_ai_logging
feature flag is enabled for a user, you view the user input and LLM output for any the GitLab Duo Chat requests made by the user. We’ve extended the support to AI Gateway as well,
so you can get a process-level logging, including actual request parameters and LLM response in the AI Gateway logs.
Tips:
- To trace the request across differnt services, use correlation-id.
- We only need to enable the flag while we reproduce the bug on production. After we sampled a couple of problematic requests, we can disable the flag again and continue examining the logs.
When problem is only identified on staging
Section titled “When problem is only identified on staging”Here are the log links for staging:
Make sure you have access to Duo Chat on staging. If not, request access to Duo Enterprise on the #g_provision
Slack channel (for non-production environments only).
If there is a problem only on staging but not production, the env
variables may be at fault.
Compare the default env
variables from staging and production to see if you can spot a relevant difference.
How to identify IDE-specific problems
Section titled “How to identify IDE-specific problems”When a customer reports a problem with Duo Chat in the IDE, it can be difficult to tell if the problem is IDE-specific or not.
The first step is to have the customer test their query on the web version of Duo Chat. If they have the same problem on web, it is not IDE-specific.
Then, they can perform these steps to determine if it is a backend (AI Gateway) problem, or client-side on the editor:
- Ask the question in web
- Observe it works
- Run the
/reset
command on the web version - Ask the same question in the IDE plugin
- Go back to the web and refresh the page.
- Check if there is a response there
If the response does not show up on web, it is likely an AI Gateway problem. If the response does show up on web, it is likely a client-side IDE problem.
When a Duo Chat specific error code happened on self-managed GitLab
Section titled “When a Duo Chat specific error code happened on self-managed GitLab”When a Duo Chat specific error code happened on self-managed GitLab, the following logs are helpful for further investigation:
- LLM log
- To get the full details,
expanded_ai_logging
feature flag needs to be enabled. Please see the admin doc for more information.
- To get the full details,
- Sidekiq log
Collect the log from the timestamp that the user reproduced the error code. 5-10 minutes of timerange should be enough.
After we’ve collected the log, we do:
- Filter the llm.log by the error code (LLM log outputs the error code as-is). Extract the correlation-id in the same log line.
- Filter the llm.log and sidekloq.log by the extracted correlation-id. This gives us the details of the process flow, which is crucial to identify where the thing went wrong.
Rate Limits
Section titled “Rate Limits”Duo Chat has the following rate limits:
- AI Action Rate Limit: 160 calls per 8 hours per authenticated user
- This limit applies to GraphQL aiAction mutations
- When exceeded, returns error code A1001 with message “This endpoint has been requested too many times. Try again later”
- Configured via
application_settings.ai_action_api_rate_limit
- Can be monitored in GitLab Rails error rates dashboard
When users hit rate limits:
- Check current rate limit usage:
# In Rails console user = User.find_by_username('username') Gitlab::ApplicationRateLimiter.throttled?(:ai_action, scope: [user], peek: true)
- For temporary relief, rate limits can be reset:
# In Rails console Gitlab::RateLimitHelpers.new.reset_rate_limits(:ai_action, user)
- For persistent issues, consider:
- Reviewing usage patterns to identify potential abuse
- Adjusting the global rate limit via application settings if needed
- Adding user to allowlist if legitimate high usage case