DuoWorkflowSvcServiceCheckpointErrorsErrorSLOViolation
Overview
Section titled “Overview”- This alert fires when the error rate of checkpoint operations exceeds the SLO threshold.
- A checkpoint is the state of a LangGraph graph at a particulat point in time. They are critical for workflow persistence and recovery.
- Duo Workflow Service makes HTTP requests to Rails API (/api/v4/ai/duo_workflows/workflows/:id/checkpoints) to fetch / save checkpoints. They are stored in the Postgres DB.
- This alert indicates that the checkpoint system is experiencing higher-than-acceptable failure rates.
- Possible user impacts
- Agentic chat loses context from previous messages eg it won’t remember previous messages.
- Users cannot resume software development sessions after a pause event such as tool call approval, user input, etc.
- Users cannot resume older sessions.
Services
Section titled “Services”- Duo Workflow Service overview
- Rails
- Team that owns the service: Agent Foundations
Metrics
Section titled “Metrics”- The metric used is
gitlab_component_errors:confidence:ratio_1handgitlab_component_errors:confidence:ratio_6hfor thecheckpoint_errorscomponent ofduo-workflow-svc. - This metric measures the error rate of checkpoint operations (4xx and 5xx errors), expressed as a percentage (0-100%).
- The SLO threshold is 5% error rate, meaning the alert fires when errors exceed this threshold.
- Link to metric catalogue
Alert Behavior
Section titled “Alert Behavior”- To silence the alert, please visit Alert Manager Dashboard
- This alert is expected to be rare under normal conditions. High frequency indicates checkpoint storage or persistence issues.
Severities
Section titled “Severities”- This alert creates S2 incidents (High severity, pages on-call).
- All gitlab.com, self-managed and dedicated customers (other than those using self-hosted DAP) using Duo Workflow features are potentially impacted, especially long-running workflows.
- Review Incident Severity Handbook page to identify the required Severity Level.
Verification
Section titled “Verification”- Prometheus link to query that triggered the alert
- Duo Workflow Service Overview Dashboard
- See “SLI Detail: checkpoint_errors” section in the Duo Workflow Service Overview Dashboard for further information.
- See Rails logs for the checkpoints endpoints (/api/v4/ai/duo_workflows/workflows/:id/checkpoints)
Recent changes
Section titled “Recent changes”- Recent Duo Workflow Service Production Changes
- Since the error could be originated from GitLab REST API, see also recent changes in the GitLab repository, specifically this endpoint.
Troubleshooting
Section titled “Troubleshooting”-
See checkpoint endpoint logs:
- Visit https://log.gprd.gitlab.net/app/r/s/VCMrK or filter by
json.path : *checkpointsin Rails logs in Kibana. - Check the json.status field to see what is the http status code.
- Visit https://log.gprd.gitlab.net/app/r/s/VCMrK or filter by
-
Check duo workflow service logs:
- Get session IDs from the step 1. You can see the id in the
json.path. - Go to runway logs for duo workflow service and filter by
jsonPayload.workflow_id
- Get session IDs from the step 1. You can see the id in the
-
Check for recent changes:
- Review recent changes mentioned under Recent changes section.
- Check if a recent deployment affected checkpoint handling.
- If a recent change caused the issue, consider rolling back.
Possible Resolutions
Section titled “Possible Resolutions”- N.A. We don’t have historical data on this alert’s resolutions.
Dependencies
Section titled “Dependencies”- GitLab Rails + Postgres DB
- Workhorse
- AI Gateway / Duo Workflow Service
Escalation
Section titled “Escalation”- For investigation and resolution assistance, reach out to
#g_agent_foundationson Slack.