`sidekiq_queueing` apdex violation
Summary
Section titled “Summary”This alert is triggered when jobs are being picked up by a Sidekiq-worker later than the target set based on the urgency of a worker.
high
-urgency workloads need to start execution 5s after schedulinglow
-urgency workloads need to start execution 5m after scheduling
An alert will fire if more than 0.1% of jobs don’t start within their set target.
Debugging
Section titled “Debugging”- Check inflight workers for a specific shard: https://dashboards.gitlab.net/d/sidekiq-shard-detail/sidekiq3a-shard-detail?orgId=1&viewPanel=11
- A specific worker might be running a large amount of jobs.
- Check started jobs for a specific queue: https://log.gprd.gitlab.net/app/r/s/v28cQ
- A specific worker might be enqueing a lot of jobs.
- Latency of job duration: https://log.gprd.gitlab.net/app/r/s/oZnYz
- We might be finishing jobs slower, so we get queue build up.
- Throughput: https://dashboards.gitlab.net/d/sidekiq-shard-detail/sidekiq3a-shard-detail?orgId=1&var-PROMETHEUS_DS=mimir-gitlab-gprd&var-environment=gprd&var-stage=main&var-shard=catchall&viewPanel=panel-17&from=now-6h/m&to=now/m&timezone=utc
- If there is a sharp drop of a specific worker it might have slowed down.
- If there is a sharp increase of a speicific worker it’s saturating the queue.
Resolution
Section titled “Resolution”Increase Capacity
Section titled “Increase Capacity”You can increase the maxReplicas
for the specific shard.
Things to keep in mind:
- If we run more concurrent jobs it might add more pressure to downstream services (Database, Gitaly, Redis)
- Check whether it makes sense to increase capacity, the bottleneck could be elsewhere, most likely a connection pool being statured.
- Check if this was a sudden spike or if it’s sustained load.
New Worker
Section titled “New Worker”It could be that this is a new worker that started running hopefully behind a feature flag that we can turn off.
Drop worker jobs
Section titled “Drop worker jobs”Drop all jobs, be sure that droping the jobs is safe and won’t leave the application in a wierd state.
Mail queue
Section titled “Mail queue”If the queue is all in mailers
and is in the many tens to hundreds of thousands it is
possible we have a spam/junk issue problem. If so, refer to the abuse team for assistance,
and also https://gitlab.com/gitlab-com/runbooks/snippets/1923045 for some spam-fighting
techniques we have used in the past to clean up. This is in a private snippet so as not
to tip our hand to the miscreants. Often shows up in our gitlab public projects but could
plausibly be in any other project as well.
Get queues using sq.rb script
Section titled “Get queues using sq.rb script”sq is a command-line tool that you can run to assist you in viewing the state of Sidekiq and killing certain workers. To use it, first download a copy:
curl -o /tmp/sq.rb https://gitlab.com/gitlab-com/runbooks/raw/master/scripts/sidekiq/sq.rb
To display a breakdown of all the workers, run:
sudo gitlab-rails runner /tmp/sq.rb
Remove jobs with certain metadata from a queue (e.g. all jobs from a certain user)
Section titled “Remove jobs with certain metadata from a queue (e.g. all jobs from a certain user)”We currently track metadata in sidekiq jobs, this allows us to remove sidekiq jobs based on that metadata.
Interesting attributes to remove jobs from a queue are root_namespace
,
project
and user
. The admin Sidekiq queues
API can be
used to remove jobs from queues based on these medata values.
For instance:
curl --request DELETE --header "Private-Token: $GITLAB_API_TOKEN_ADMIN" https://gitlab.com/api/v4/admin/sidekiq/queues/post_receive?user=reprazent&project=gitlab-org/gitlab
Will delete all jobs from post_receive
triggered by a user with
username reprazent
for the project gitlab-org/gitlab
.
Check the output of each call:
- It will report how many jobs were deleted. 0 may mean your conditions (queue, user, project etc) do not match anything.
- This API endpoint is bound by the HTTP request time limit, so it will delete as many jobs as it can before terminating. If the
completed
key in the response isfalse
, then the whole queue was not processed, so we can try again with the same command to remove further jobs.