DuoWorkflowSvcServiceServerErrorSLOViolation
Overview
Section titled “Overview”- This alert fires when the error rate of GRPC requests to the Duo Workflow Service server component exceeds the SLO threshold.
- The alert indicates that the service is experiencing a higher-than-acceptable error rate, which impacts users relying on the Duo Workflow Service.
- This is a user-impacting alert that requires immediate investigation and remediation.
- Possible user impacts
- Users will see Duo Agent Platform sessions failing.
Services
Section titled “Services”- Duo Workflow Service overview
- Team that owns the service: Agent Foundations
Metrics
Section titled “Metrics”- The metric used is
gitlab_component_errors:confidence:ratio_1handgitlab_component_errors:confidence:ratio_6hfor theservercomponent ofduo-workflow-svc. - This metric measures the error rate of GRPC requests, expressed as a percentage (0-100%).
- The SLO threshold is 5% error rate, meaning the alert fires when errors exceed this threshold.
- Link to metric catalogue
Alert Behavior
Section titled “Alert Behavior”- To silence the alert, please visit Alert Manager Dashboard
- This alert is expected to be rare under normal conditions. High frequency indicates systemic issues.
Severities
Section titled “Severities”- This alert creates S2 incidents (High severity, pages on-call).
- All gitlab.com, self-managed and dedicated customers (other than those using self-hosted DAP) using Duo Workflow features are potentially impacted.
- Review Incident Severity Handbook page to identify the required Severity Level.
Verification
Section titled “Verification”- Prometheus link to query that triggered the alert
- Duo Workflow Service Overview Dashboard
- Duo Workflow Service Error Breakdown Dashboard
Recent changes
Section titled “Recent changes”Troubleshooting
Section titled “Troubleshooting”-
Identify the scope of errors:
- Visit Error Breakdown Dashboard.
- Use flow_type, gitlab_version, client_type, gitlab_realm variable dropdowns on the dashboard to narrow down the problem
-
Check service health:
-
Check for recent changes:
- Review recent changes mentioned under Recent changes section.
- If a recent change caused the issue, consider rolling back.
Possible Resolutions
Section titled “Possible Resolutions”- N.A. We don’t have historical data on this alert’s resolutions.
Dependencies
Section titled “Dependencies”- GitLab Rails + Postgres DB
- Workhorse
- AI Gateway / Duo Workflow Service
Escalation
Section titled “Escalation”- For investigation and resolution assistance, reach out to
#g_agent_foundationson Slack.