CiRunnersServiceQueuingQueriesDurationApdexSLOViolation

Overview

This alert indicates that the CI Runners service is experiencing slower-than-expected queuing query response times, violating the defined Service Level Objectives (SLO) for job scheduling performance.

alt text

Services

CI Runners Service Overview
Team: Verify:Runner

Quick Links

Contributing Factors

High volume of concurrent CI job requests
Database performance issues
Runner manager capacity constraints
Resource exhaustion in the runner fleet
Runner manager unable to spin ephemeral VMs

Affected Components

CI Runner job scheduling system
Runner managers
Database queries related to job queuing
CI/CD pipeline execution times

Expected Action

Investigate the cause of increased queuing duration and take appropriate action to restore normal service performance.

Metrics

Metrics Catalog

Metric: Duration of queuing-related queries for CI runners
Unit: Milliseconds
Normal Behavior: Query duration should remain below the Apdex threshold
Threshold Reasoning: Based on historical performance data and user experience requirements

Alert Behavior

Silencing: Can be silenced temporarily during planned maintenance
Expected Frequency: Medium - may trigger during peak usage periods
Historical Trends: Check CI Runner alerts dashboard

Severities

The incident severity can range from Sev3 to Sev1 depending on the specific shard affected.

Impact Assessment

Affects all GitLab.com users trying to run CI jobs.
May cause delays in CI/CD pipeline execution.
Could affect both public and private projects.

Severity Checks

Check number of affected jobs in the queue.
Verify impact on pipeline completion times.
Monitor error rates in job scheduling.

Verification

Troubleshooting

Basic Steps

Check for recent surge in CI job creation.
Verify runner manager health.
Review Patroni performance metrics.
Check if GitLab.com usage has outgrown it’s surge capacity

Additional Checks

Review scheduled pipeline timing conflicts.
Verify runner pool capacity.
Check for stuck jobs.
Check for deadtuples-related issues below

Possible Resolutions

Scale up runner manager capacity.
Optimize database queries.
Block abusive users/projects.
Adjust job scheduling algorithms.

During reindexing operations, deadtuples may accumulate and degrade query performance.

alt text

How to Check Ongoing Reindexing Operations

Use the following SQL query to identify reindexing operations causing long query durations:

SELECT
  now(),
  now() - query_start AS query_age,
  now() - xact_start AS xact_age,
  pid,
  backend_type,
  state,
  client_addr,
  wait_event_type,
  wait_event,
  xact_start,
  query_start,
  state_change,
  query
FROM pg_stat_activity
WHERE
  state != 'idle'
  AND backend_type != 'autovacuum worker'
  AND xact_start < now() - '60 seconds'::interval
ORDER BY xact_age DESC NULLS LAST;

How to Cancel Reindexing and Resume Deadtuple Cleanup

Use the pg_cancel_backend() function to cancel the ongoing reindexing operation, using the pid from the query above.

SELECT pg_cancel_backend(1641690);

Once canceled, you should see immediate relief in the gitlab_ci_queue_retrieval_duration_seconds_bucket metrics

alt text

And SLI should recover

alt text

Recent changes

Recent incidents

Dependencies

PostgreSQL database
Runner manager VMs
Internal load balancers
GCP infrastructure

Escalation

When to Escalate

Alert persists for >30 minutes.
Multiple runner shards affected.
Significant impact on pipeline completion times.

Support Channels

#production Slack channel
#g_hosted_runners Slack channel
#g_runner Slack channel
#f_hosted_runners_on_linux Slack channel

Definitions

Alert Definition
Tuning Considerations: Adjust thresholds based on peak usage patterns and user feedback.

CiRunnersServiceQueuingQueriesDurationApdexSLOViolation

Overview

Services

Quick Links

Contributing Factors

Affected Components

Expected Action

Metrics

Alert Behavior

Severities

Impact Assessment

Severity Checks

Verification

Troubleshooting

Basic Steps

Additional Checks

Possible Resolutions

Verify for deadtuples-related performance issues

How to Check Ongoing Reindexing Operations

How to Cancel Reindexing and Resume Deadtuple Cleanup

Recent changes

Recent incidents

Dependencies

Escalation

When to Escalate

Support Channels

Definitions

Related Links