DuoWorkflowSvcServiceToolUseErrorSLOViolation
Overview
Section titled “Overview”- This alert fires when the error rate of tool calls within agent platform sessions exceeds the SLO threshold.
- Tool failures are non-fatal events that don’t immediately terminate sessions since agents can retry or use alternative approaches.
- However, increasing failure rates serve as early warning indicators of potential system issues.
- This alert indicates that tools are failing more frequently than expected. These tools can be invoked within Duo Workflow Service or in the clients (eg. users machine in editor extension)
- Possible user impacts
- Some tools are not usable (eg gitlab search, run command).
- Overall quality of the platform would be low, as some tools cannot be used.
Services
Section titled “Services”- Duo Workflow Service overview
- Team that owns the service: Agent Foundations
Metrics
Section titled “Metrics”- The metric used is
gitlab_component_errors:confidence:ratio_1handgitlab_component_errors:confidence:ratio_6hfor thetool_usecomponent ofduo-workflow-svc. - This metric measures the error rate of tool calls, expressed as a percentage (0-100%).
- The SLO threshold is 5% error rate, meaning the alert fires when errors exceed this threshold.
- Link to metric catalogue
Alert Behavior
Section titled “Alert Behavior”- To silence the alert, please visit Alert Manager Dashboard
- This alert is expected to be rare under normal conditions. High frequency indicates tool integration issues.
Severities
Section titled “Severities”- This alert creates S2 incidents (High severity, pages on-call).
- All gitlab.com users using Duo Workflow features with tool integrations are potentially impacted.
- Review Incident Severity Handbook page to identify the required Severity Level.
Verification
Section titled “Verification”- Prometheus link to query that triggered the alert
- Duo Workflow Service Overview Dashboard
- See “SLI Detail: tool_use” section in the Duo Workflow Service Overview Dashboard for further information.
Recent changes
Section titled “Recent changes”Troubleshooting
Section titled “Troubleshooting”-
Check which tools are failing
- See tool errors graph for a breakdown of errors per tool.
- Check whether the issue is limited for a specific tool
-
Check duo workflow service logs:
-
Check for recent changes:
- Review recent changes mentioned under Recent changes section.
- Check if a recent deployment affected tool errors.
- If a recent change caused the issue, consider rolling back.
Possible Resolutions
Section titled “Possible Resolutions”- N.A. We don’t have historical data on this alert’s resolutions.
Dependencies
Section titled “Dependencies”- AI Gateway / Duo Workflow Service
Escalation
Section titled “Escalation”- For investigation and resolution assistance, reach out to
#g_agent_foundationson Slack.