Skip to content

DuoWorkflowSvcServiceLlmErrorSLOViolation

  • This alert fires when the error rate of LLM inference requests exceeds the SLO threshold.
  • LLM failures indicate that requests to the LLM provider is failing.
  • This alert indicates that we’re experiencing higher-than-acceptable request failure rates.
  • The root cause can be due to invalid input, too large payload, or an issue with the LLM provider (overloaded, etc).
  • Possible user impacts
    • Users will see errors like ‘There was an error connecting to the chosen LLM provider, please try again or contact support if the issue persists.’ or ‘There was an error processing your request in the Duo Agent Platform, please try again or contact support if the issue persists.’
    • It’s possible that these errors will be isolated for an LLM provider or a specific model.
  • The metric used is gitlab_component_errors:confidence:ratio_1h and gitlab_component_errors:confidence:ratio_6h for the llm component of duo-workflow-svc.
  • This metric measures the error rate of LLM inference requests, expressed as a percentage (0-100%).
  • The SLO threshold is 1% error rate, meaning the alert fires when errors exceed this threshold.
  • Link to metric catalogue
  • To silence the alert, please visit Alert Manager Dashboard
  • This alert is expected to be rare under normal conditions. High frequency indicates a problem with the payload we send to the LLM provider or issues with the LLM provider itself.
  • The following reasons can be causing this alert
    • The LLM provider is experiencing issues or an outage
    • The request is invalid (eg. tool call is not followed by a tool result)
    • The requests are failing due to authentication
    • The payload is over the token limit of the LLM provider
  • This alert creates S2 incidents (High severity, pages on-call).
  • All gitlab.com, self-managed and dedicated customers (other than those using self-hosted DAP) using Duo Workflow features are potentially impacted.
  • Review Incident Severity Handbook page to identify the required Severity Level.
  1. Identify the source of LLM errors:

    • Check if there are bad request errors in DWS logs
    • If so, check which LLM providers are affected (eg. Vertex / Anthropic)
  2. Check LLM provider status:

  3. Check for recent changes:

    • Review recent changes mentioned under Recent changes section.
    • If a recent change caused the issue, consider rolling back.
  • N.A. We don’t have historical data on this alert’s resolutions.
  • For investigation and resolution assistance, reach out to #g_agent_foundations on Slack.