Skip to content

CiRunnersServiceQueuingQueriesDurationApdexSLOViolation

This alert indicates that the CI Runners service is experiencing slower-than-expected queuing query response times, violating the defined Service Level Objectives (SLO) for job scheduling performance.

alt text



  • High volume of concurrent CI job requests
  • Database performance issues
  • Runner manager capacity constraints
  • Resource exhaustion in the runner fleet
  • Runner manager unable to spin ephemeral VMs
  • CI Runner job scheduling system
  • Runner managers
  • Database queries related to job queuing
  • CI/CD pipeline execution times

Investigate the cause of increased queuing duration and take appropriate action to restore normal service performance.

Metrics Catalog

  • Metric: Duration of queuing-related queries for CI runners
  • Unit: Milliseconds
  • Normal Behavior: Query duration should remain below the Apdex threshold
  • Threshold Reasoning: Based on historical performance data and user experience requirements

  • Silencing: Can be silenced temporarily during planned maintenance
  • Expected Frequency: Medium - may trigger during peak usage periods
  • Historical Trends: Check CI Runner alerts dashboard

  • The incident severity can range from Sev3 to Sev1 depending on the specific shard affected.
  • Affects all GitLab.com users trying to run CI jobs.
  • May cause delays in CI/CD pipeline execution.
  • Could affect both public and private projects.
  1. Check number of affected jobs in the queue.
  2. Verify impact on pipeline completion times.
  3. Monitor error rates in job scheduling.


  1. Check for recent surge in CI job creation.
  2. Verify runner manager health.
  3. Review Patroni performance metrics.
  4. Check if GitLab.com usage has outgrown it’s surge capacity
  • Review scheduled pipeline timing conflicts.
  • Verify runner pool capacity.
  • Check for stuck jobs.
  • Check for deadtuples-related issues below

  1. Scale up runner manager capacity.
  2. Optimize database queries.
  3. Block abusive users/projects.
  4. Adjust job scheduling algorithms.
Section titled “Verify for deadtuples-related performance issues”

During reindexing operations, deadtuples may accumulate and degrade query performance.

alt text

How to Check Ongoing Reindexing Operations

Section titled “How to Check Ongoing Reindexing Operations”

Use the following SQL query to identify reindexing operations causing long query durations:

SELECT
now(),
now() - query_start AS query_age,
now() - xact_start AS xact_age,
pid,
backend_type,
state,
client_addr,
wait_event_type,
wait_event,
xact_start,
query_start,
state_change,
query
FROM pg_stat_activity
WHERE
state != 'idle'
AND backend_type != 'autovacuum worker'
AND xact_start < now() - '60 seconds'::interval
ORDER BY xact_age DESC NULLS LAST;

How to Cancel Reindexing and Resume Deadtuple Cleanup

Section titled “How to Cancel Reindexing and Resume Deadtuple Cleanup”

Use the pg_cancel_backend() function to cancel the ongoing reindexing operation, using the pid from the query above.

SELECT pg_cancel_backend(1641690);

Once canceled, you should see immediate relief in the gitlab_ci_queue_retrieval_duration_seconds_bucket metrics

alt text

And SLI should recover

alt text


  • PostgreSQL database
  • Runner manager VMs
  • Internal load balancers
  • GCP infrastructure

  • Alert persists for >30 minutes.
  • Multiple runner shards affected.
  • Significant impact on pipeline completion times.
  • #production Slack channel
  • #g_hosted_runners Slack channel
  • #g_runner Slack channel
  • #f_hosted_runners_on_linux Slack channel

  • Alert Definition
  • Tuning Considerations: Adjust thresholds based on peak usage patterns and user feedback.