Skip to content

DuoWorkflowSvcServiceCheckpointErrorsErrorSLOViolation

  • This alert fires when the error rate of checkpoint operations exceeds the SLO threshold.
  • A checkpoint is the state of a LangGraph graph at a particulat point in time. They are critical for workflow persistence and recovery.
  • Duo Workflow Service makes HTTP requests to Rails API (/api/v4/ai/duo_workflows/workflows/:id/checkpoints) to fetch / save checkpoints. They are stored in the Postgres DB.
  • This alert indicates that the checkpoint system is experiencing higher-than-acceptable failure rates.
  • Possible user impacts
    • Agentic chat loses context from previous messages eg it won’t remember previous messages.
    • Users cannot resume software development sessions after a pause event such as tool call approval, user input, etc.
    • Users cannot resume older sessions.
  • The metric used is gitlab_component_errors:confidence:ratio_1h and gitlab_component_errors:confidence:ratio_6h for the checkpoint_errors component of duo-workflow-svc.
  • This metric measures the error rate of checkpoint operations (4xx and 5xx errors), expressed as a percentage (0-100%).
  • The SLO threshold is 5% error rate, meaning the alert fires when errors exceed this threshold.
  • Link to metric catalogue
  • To silence the alert, please visit Alert Manager Dashboard
  • This alert is expected to be rare under normal conditions. High frequency indicates checkpoint storage or persistence issues.
  • This alert creates S2 incidents (High severity, pages on-call).
  • All gitlab.com, self-managed and dedicated customers (other than those using self-hosted DAP) using Duo Workflow features are potentially impacted, especially long-running workflows.
  • Review Incident Severity Handbook page to identify the required Severity Level.
  1. See checkpoint endpoint logs:

  2. Check duo workflow service logs:

  3. Check for recent changes:

    • Review recent changes mentioned under Recent changes section.
    • Check if a recent deployment affected checkpoint handling.
    • If a recent change caused the issue, consider rolling back.
  • N.A. We don’t have historical data on this alert’s resolutions.
  • GitLab Rails + Postgres DB
  • Workhorse
  • AI Gateway / Duo Workflow Service
  • For investigation and resolution assistance, reach out to #g_agent_foundations on Slack.