Duo Workflow Service
- Service Overview
- Alerts: https://alerts.gitlab.net/#/alerts?filter=%7Btype%3D%22duo-workflow-svc%22%2C%20tier%3D%22sv%22%7D
- Label: gitlab-com/gl-infra/production~“Service::DuoWorkflow”
Logging
Section titled “Logging”Log Filtering Tips
Section titled “Log Filtering Tips”Service Logs
Section titled “Service Logs”- In
gitlab-runway-productionorgitlab-runway-stagingprojects - Use filter:
resource.labels.service_name="duo-workflow-svc"
Billing Events (Staging)
Section titled “Billing Events (Staging)”- Successful workflows in staging send billing events with token consumption data to:
- Endpoint:
https://billing.stgsub.gitlab.net
- Endpoint:
- To check billing-related logs:
- Filter:
jsonPayload.logger="workflow_checkpointer" - Staging logs URL: Workflow Checkpointer Logs
- Filter:
Quick Links
Section titled “Quick Links”- Logs
- Langsmith traces
- Sentry error tracking
- Grafana Service Overview
- Grafana Error Breakdown Dashboard
- Log based dashboard
Before starting the investigation
Section titled “Before starting the investigation”NOTE: Do NOT expose customer’s RED data in public issues. Redact them or make a confidential issue if you’re unsure.
Before starting the investigation, please collect the following information:
- GitLab username for the user that encountered the bug (e.g.
@johndoe) - What happened (e.g. User asked a question in Flows tab in VSCode extension and the agent platform did not respond)
- When it happened (e.g. Around 2024/09/16 01:00 UTC)
- Is it happening in .com, self-managed or dedicated instances? If self-managed or dedicated, what GitLab version they’re using?
- GitLab Workflow VS Code extension version (e.g. v.6.26.1) if applicable.
- If using VSCode extension, ask whether they use gRPC connection or webhooks to communicate.
- Are there executor logs?
- For VSCode -> Command + P -> Show Extension Logs -> Choose GitLab Language Server from dropdown
- For flows running in CI -> CI job logs
- What is the flow type (chat for agentic chat, software_development if it’s Flows tab, issue to MR, or a custom flow etc.)
- How often it happens (e.g. It happens everytime)
- Steps to reproduce (e.g. 1. Ask a question “xxx” 2. Click …)
- AI Gateway or self-managed AI Gateway (If they use custom models, it’s likely latter.)
- A link to a Slack discussion, if any.
Workflows fail to start (startup connection failures)
Section titled “Workflows fail to start (startup connection failures)”WARNING: Quiet dashboards do not rule this out. Startup failures can happen on the runner/executor side, before requests are visible in normal Duo Workflow Service dashboards — which may look normal even during widespread startup failures. With multiple “fails to start” reports, check logs before trusting the dashboards.
Where to look:
- CI job logs — primary source; open them directly from a failing session/pipeline link.
- Executor logs (local IDE flows) — the EOC can’t pull these; ask the reporter to capture them (see the intake checklist above).
Common fingerprints: WebSocketWorkflowClient, 1006, Failed to connect, WebSocket error.
These are examples, not an exhaustive list.
Default vs. custom config: if only default setups are affected while projects with custom
agent-config.yml or custom images still work, suspect the startup/connection path rather than a
broad service failure.
Escalate to the Agent Foundations team (#g_agent_foundations on Slack) with failing session
links and CI/executor logs.
Summary
Section titled “Summary”The Duo Workflow Service is a Python service that manages and executes Duo Agent Platform sessions using LangGraph. Within AI-Gateway, it handles communication between the user interface, the LLM provider, and the executors, while maintaining workflow state through periodic checkpoints saved to GitLab. This service provides the intelligence layer that interprets user goals, plans execution steps, processes LLM responses, and orchestrates the necessary commands to complete tasks, all while maintaining a secure boundary between untrusted code execution and the core GitLab infrastructure. .
Architecture
Section titled “Architecture”See design document at https://handbook.gitlab.com/handbook/engineering/architecture/design-documents/duo_workflow/
Scalability
Section titled “Scalability”Duo Workflow Service will autoscale with traffic. To manually scale, update runway-production.yml based on documentation.
It is also possible to directly edit the tunables for the duo-workflow-svc service via the Cloud Run console’s Edit YAML interface. This takes effect faster, but be sure to make the equivalent updates to the runway-production.yml as described above; otherwise the next deploy will revert your manual changes to the service YAML.
Monitoring/Alerting
Section titled “Monitoring/Alerting”Duo Workflow Service uses both custom metrics scraped from application and default metrics provided by Runway. These alerts are routed to g_duo_agent_platform_prometheus_alerts in Slack. To route to different channel, refer to documentation.
Currently, error logs from Sentry also trigger alerts. These alerts are directed to g_duo_workflow_alerts in Slack.