`sidekiq_queueing` apdex violation

Summary

This alert is triggered when jobs are being picked up by a Sidekiq-worker later than the target set based on the urgency of a worker.

high-urgency workloads need to start execution 5s after scheduling
low-urgency workloads need to start execution 5m after scheduling

An alert will fire if more than 0.1% of jobs don’t start within their set target.

Debugging

Check inflight workers for a specific shard: https://dashboards.gitlab.net/d/sidekiq-shard-detail/sidekiq3a-shard-detail?orgId=1&viewPanel=11
- A specific worker might be running a large amount of jobs.
Check started jobs for a specific queue: https://log.gprd.gitlab.net/app/r/s/v28cQ
- A specific worker might be enqueing a lot of jobs.
Latency of job duration: https://log.gprd.gitlab.net/app/r/s/oZnYz
- We might be finishing jobs slower, so we get queue build up.
Throughput: https://dashboards.gitlab.net/d/sidekiq-shard-detail/sidekiq3a-shard-detail?orgId=1&var-PROMETHEUS_DS=mimir-gitlab-gprd&var-environment=gprd&var-stage=main&var-shard=catchall&viewPanel=panel-17&from=now-6h/m&to=now/m&timezone=utc
- If there is a sharp drop of a specific worker it might have slowed down.
- If there is a sharp increase of a speicific worker it’s saturating the queue.

Resolution

Increase Capacity

You can increase the maxReplicas for the specific shard.

Things to keep in mind:

If we run more concurrent jobs it might add more pressure to downstream services (Database, Gitaly, Redis)
Check whether it makes sense to increase capacity, the bottleneck could be elsewhere, most likely a connection pool being statured.
Check if this was a sudden spike or if it’s sustained load.

New Worker

It could be that this is a new worker that started running hopefully behind a feature flag that we can turn off.

Drop worker jobs

Drop all jobs, be sure that droping the jobs is safe and won’t leave the application in a wierd state.

Mail queue

If the queue is all in mailers and is in the many tens to hundreds of thousands it is possible we have a spam/junk issue problem. If so, refer to the abuse team for assistance, and also https://gitlab.com/gitlab-com/runbooks/snippets/1923045 for some spam-fighting techniques we have used in the past to clean up. This is in a private snippet so as not to tip our hand to the miscreants. Often shows up in our gitlab public projects but could plausibly be in any other project as well.

Get queues using sq.rb script

sq is a command-line tool that you can run to assist you in viewing the state of Sidekiq and killing certain workers. To use it, first download a copy:

curl -o /tmp/sq.rb https://gitlab.com/gitlab-com/runbooks/raw/master/scripts/sidekiq/sq.rb

To display a breakdown of all the workers, run:

sudo gitlab-rails runner /tmp/sq.rb

Remove jobs with certain metadata from a queue (e.g. all jobs from a certain user)

We currently track metadata in sidekiq jobs, this allows us to remove sidekiq jobs based on that metadata.

Interesting attributes to remove jobs from a queue are root_namespace, project and user. The admin Sidekiq queues API can be used to remove jobs from queues based on these medata values.

For instance:

curl --request DELETE --header "Private-Token: $GITLAB_API_TOKEN_ADMIN" https://gitlab.com/api/v4/admin/sidekiq/queues/post_receive?user=reprazent&project=gitlab-org/gitlab

Will delete all jobs from post_receive triggered by a user with username reprazent for the project gitlab-org/gitlab.

Check the output of each call:

It will report how many jobs were deleted. 0 may mean your conditions (queue, user, project etc) do not match anything.
This API endpoint is bound by the HTTP request time limit, so it will delete as many jobs as it can before terminating. If the completed key in the response is false, then the whole queue was not processed, so we can try again with the same command to remove further jobs.