DuoWorkflowSvcServiceLlmErrorSLOViolation
Overview
Section titled “Overview”- This alert fires when the error rate of LLM inference requests exceeds the SLO threshold.
- LLM failures indicate that requests to the LLM provider is failing.
- This alert indicates that we’re experiencing higher-than-acceptable request failure rates.
- The root cause can be due to invalid input, too large payload, or an issue with the LLM provider (overloaded, etc).
- Possible user impacts
- Users will see errors like ‘There was an error connecting to the chosen LLM provider, please try again or contact support if the issue persists.’ or ‘There was an error processing your request in the Duo Agent Platform, please try again or contact support if the issue persists.’
- It’s possible that these errors will be isolated for an LLM provider or a specific model.
Services
Section titled “Services”- Duo Workflow Service overview
- Team that owns the service: Agent Foundations
Metrics
Section titled “Metrics”- The metric used is
gitlab_component_errors:confidence:ratio_1handgitlab_component_errors:confidence:ratio_6hfor thellmcomponent ofduo-workflow-svc. - This metric measures the error rate of LLM inference requests, expressed as a percentage (0-100%).
- The SLO threshold is 1% error rate, meaning the alert fires when errors exceed this threshold.
- Link to metric catalogue
Alert Behavior
Section titled “Alert Behavior”- To silence the alert, please visit Alert Manager Dashboard
- This alert is expected to be rare under normal conditions. High frequency indicates a problem with the payload we send to the LLM provider or issues with the LLM provider itself.
- The following reasons can be causing this alert
- The LLM provider is experiencing issues or an outage
- The request is invalid (eg. tool call is not followed by a tool result)
- The requests are failing due to authentication
- The payload is over the token limit of the LLM provider
Severities
Section titled “Severities”- This alert creates S2 incidents (High severity, pages on-call).
- All gitlab.com, self-managed and dedicated customers (other than those using self-hosted DAP) using Duo Workflow features are potentially impacted.
- Review Incident Severity Handbook page to identify the required Severity Level.
Verification
Section titled “Verification”- Prometheus link to query that triggered the alert
- Duo Workflow Service Overview Dashboard
- See “SLI Detail: llm” section in the Duo Workflow Service Overview Dashboard for further information.
Recent changes
Section titled “Recent changes”Troubleshooting
Section titled “Troubleshooting”-
Identify the source of LLM errors:
- Check if there are bad request errors in DWS logs
- If so, check which LLM providers are affected (eg. Vertex / Anthropic)
-
Check LLM provider status:
- Anthropic API Status
- GCP Vertex AI Status
- Check if the LLM provider is experiencing issues.
-
Check for recent changes:
- Review recent changes mentioned under Recent changes section.
- If a recent change caused the issue, consider rolling back.
Possible Resolutions
Section titled “Possible Resolutions”- N.A. We don’t have historical data on this alert’s resolutions.
Dependencies
Section titled “Dependencies”- Anthropic API
- GCP Vertex AI
- AI Gateway / Duo Workflow Service
Escalation
Section titled “Escalation”- For investigation and resolution assistance, reach out to
#g_agent_foundationson Slack.