Title: SidekiqQueueTooLarge
Overview
Section titled “Overview”- What does this alert mean? This alert indicates that the Sidekiq queue has exceeded a predefined size threshold. This signifies a backlog of jobs waiting to be processed.
- What factors can contribute?
- Sudden Traffic Spikes: Unexpected surges in workload can overwhelm Sidekiq’s processing capacity, causing a queue buildup.
- Slow Workers: Inefficient jobs or external dependencies causing slow processing can lead to task pile-up.
- Configuration Issues: Limited Sidekiq worker processes might not be able to keep up with incoming jobs.
- Database Interactions: Inefficient database queries or slow database performance can significantly impact task processing speed, leading to queue growth.
- What parts of the service are effected?
- Background Processing: All background jobs managed by Sidekiq will experience delays.
- Time-sensitive Jobs might be significantly impacted.
- Overall Application Performance: Delayed background jobs can indirectly affect the responsiveness of your application.
Services
Section titled “Services”-
Team that owns the service: Core Platform:Gitaly Team
-
Label: gitlab-com/gl-infra/production~“Service::Sidekiq”
Verification
Section titled “Verification”Troubleshooting
Section titled “Troubleshooting”Analyze recent application changes, traffic patterns, and identify slow-running jobs.
- Check inflight workers for a specific shard: https://dashboards.gitlab.net/d/sidekiq-shard-detail/sidekiq3a-shard-detail?orgId=1&viewPanel=11
- A specific worker might be running a large amount of jobs.
- Check started jobs for a specific queue: https://log.gprd.gitlab.net/app/r/s/v28cQ
- A specific worker might be enqueing a lot of jobs.
- Latency of job duration: https://log.gprd.gitlab.net/app/r/s/oZnYz
- We might be finishing jobs slower, so we get queue build up.
- Throughput: https://dashboards.gitlab.net/d/sidekiq-shard-detail/sidekiq3a-shard-detail?orgId=1&var-PROMETHEUS_DS=mimir-gitlab-gprd&var-environment=gprd&var-stage=main&var-shard=catchall&viewPanel=panel-17&from=now-6h/m&to=now/m&timezone=utc
- If there is a sharp drop of a specific worker it might have slowed down.
- If there is a sharp increase of a speicific worker it’s saturating the queue.
Resolution
Section titled “Resolution”Scale Workers
Section titled “Scale Workers”Increase the number of concurrent Sidekiq workers if processing speed is the bottleneck.
- You can increase the maxReplicas for the specific shard. Things to keep in mind:
- If we run more concurrent jobs it might add more pressure to downstream services (Database, Gitaly, Redis)
- Check if this was a sudden spike or if it’s sustained load.
Check for new workers
Section titled “Check for new workers”It could be that this is a new worker that started running hopefully behind a feature flag that we can turn off.
Mail queue
Section titled “Mail queue”If the queue is all in mailers and is in the many tens to hundreds of thousands it is possible we have a spam/junk issue problem. If so, refer to the abuse team for assistance, and also https://gitlab.com/gitlab-com/runbooks/snippets/1923045 for some spam-fighting techniques we have used in the past to clean up. This is in a private snippet so as not to tip our hand to the miscreants. Often shows up in our gitlab public projects but could plausibly be in any other project as well.
Get queues using sq.rb script
Section titled “Get queues using sq.rb script”sq is a command-line tool that you can run to assist you in viewing the state of Sidekiq and killing certain workers. To use it, first download a copy:
curl -o /tmp/sq.rb https://gitlab.com/gitlab-com/runbooks/raw/master/scripts/sidekiq/sq.rb
To display a breakdown of all the workers, run:
sudo gitlab-rails runner /tmp/sq.rb
How to disable a worker
Section titled “How to disable a worker”In case of an incident caused by a misbehaving sidekiq worker, here’s the immediate actions you should take.
- Identify which sidekiq job class (worker) is causing the incident.
The sidekiq: Worker Detail dashboard may be helpful in checking a worker’s enqueue rate, queue size, and summed execution time spent in shared dependencies like the DB.
- Defer execution of all jobs of that class, using either:
- Chatops: This should be run in the
#production
/chatops run feature set run_sidekiq_jobs_Example::SlowWorker false --ignore-feature-flag-consistency-check --ignore-production-check
- On the production rails node run the following to start the rails console:
sudo gitlab-rails console
After it starts, run:
Feature.disable(:"run_sidekiq_jobs_Example::SlowWorker")
The action will cause sidekiq workers to defer (rather than execute) all jobs of that class, including jobs currently waiting in the queue. This should provide some immediate relief. For more details Disabling a worker
Examples of disabling workers via ChatOps
Section titled “Examples of disabling workers via ChatOps”When the feature flag is set to true, 100% of the jobs will be deferred. But, we can also use percentage of actors rollout (an actor being each execution of job) to progressively let the jobs processed. For example:
/chatops run feature set run_sidekiq_jobs_SlowRunningWorker false --ignore-feature-flag-consistency-check
/chatops run feature set run_sidekiq_jobs_SlowRunningWorker --actors 10 --ignore-feature-flag-consistency-check
/chatops run feature set run_sidekiq_jobs_SlowRunningWorker --actors 50 --ignore-feature-flag-consistency-check
/chatops run feature delete run_sidekiq_jobs_SlowRunningWorker --ignore-feature-flag-consistency-check
Note that --ignore-feature-flag-consistency-check
is necessary as it bypasses the consistency check between staging and production.
It is totally safe to pass this flag as we don’t need to turn on the feature flag in staging during an incident.
To ensure we are not leaving any worker being deferred forever, check all feature flags matching run_sidekiq_jobs
:
/chatops run feature list --match run_sidekiq_jobs
Dropping jobs using feature flags via ChatOps
Section titled “Dropping jobs using feature flags via ChatOps”Similar to deferring the jobs, we could enable drop_sidekiq_jobs_{WorkerName}
FF (disabled by default) to drop the jobs entirely (removed from the queue).
Example:
/chatops run feature set drop_sidekiq_jobs_SlowRunningWorker true --ignore-feature-flag-consistency-check
/chatops run feature delete drop_sidekiq_jobs_SlowRunningWorker --ignore-feature-flag-consistency-check
Note that drop_sidekiq_jobs
FF has precedence over the run_sidekiq_jobs
FF. This means when drop_sidekiq_jobs
FF is enabled and run_sidekiq_jobs
FF is disabled,
drop_sidekiq_jobs
FF takes priority, thus the job is dropped. Once drop_sidekiq_jobs
FF is back to disabled, jobs are then deferred due to run_sidekiq_jobs
still disabled.
Disabling a Sidekiq queue
Section titled “Disabling a Sidekiq queue”When the system is under strain due to job processing, it may be necessary to completely disable a queue so that jobs will queue and not be processed. To disable a queue it needs to be excluded from the routing rules
- Identify which shard is associated to the queue, the ways to determine this are:
- Find the queue in the Shard Overview Dashboard
- Find the
resource_boundary
for the queue app/workers/all_queues.yml or ee/app/workers/all_queues.yml and see which routing rules in values.yml - If the queue is being processed by catchall on K8s, remove the queue from values.yml
- If the queue is being processed by one of the other shards in K8s, add a selector
routingRules: resource_boundary=memory&name!=<queue name>
How to transfer a queue from an existing shard to a new one
Section titled “How to transfer a queue from an existing shard to a new one”In situations where one of the worker is flooding the queue, you can create a new shard or use an existing one. Once that is done transfer queue traffic to an existing shard by modifying the routing rules
Example MRs:
- [[Gprd] Route elasticsearch shard's workers to elasticsearch queue](https://gitlab.com/gitlab-com/gl-infra/k8s-workloads/gitlab-com/-/merge_requests/974) to update elasticsearch queue to route elasticsearch shard's workers to elasticsearch queue.- [Move Members::DestroyWorker to quarantine queue](https://gitlab.com/gitlab-com/gl-infra/k8s-workloads/gitlab-com/-/merge_requests/3822)- [Define `urgent-authorized-projects` sidekiq shard](https://gitlab.com/gitlab-com/gl-infra/k8s-workloads/gitlab-com/-/merge_requests/1963)- [[gprd] Re-route quarantine, gitaly_throttled + database_throttled jobs to a single queue per shard](https://gitlab.com/gitlab-com/gl-infra/k8s-workloads/gitlab-com/-/merge_requests/1196)
Remove jobs with certain metadata from a queue (e.g. all jobs from a certain user)
Section titled “Remove jobs with certain metadata from a queue (e.g. all jobs from a certain user)”We currently track metadata in sidekiq jobs, this allows us to remove sidekiq jobs based on that metadata.
Interesting attributes to remove jobs from a queue are root_namespace
,
project
and user
. The admin Sidekiq queues
API can be
used to remove jobs from queues based on these medata values.
For instance:
curl --request DELETE --header "Private-Token: $GITLAB_API_TOKEN_ADMIN" https://gitlab.com/api/v4/admin/sidekiq/queues/post_receive?user=reprazent&project=gitlab-org/gitlab
Will delete all jobs from post_receive
triggered by a user with
username reprazent
for the project gitlab-org/gitlab
.
Check the output of each call:
- It will report how many jobs were deleted. 0 may mean your conditions (queue, user, project etc) do not match anything.
- This API endpoint is bound by the HTTP request time limit, so it will delete as many jobs as it can before terminating. If the
completed
key in the response isfalse
, then the whole queue was not processed, so we can try again with the same command to remove further jobs.
Metrics
Section titled “Metrics”This alert is based on the maximum value of the sidekiq_queue_size
across different environments and queue names. It helps identify the queue with the most jobs waiting. It will complare the maximum queue size to a threshold of 50,000. If the maximum queue size exceeds 50,000 the alert triggers. This was based on historical data which under normal conditions the graph should show a consistent pattern
Alert Behavior
Section titled “Alert Behavior”- Information on silencing the alert (if applicable). When and how can silencing be used? Are there automated silencing rules?
- If the current threshold is too sensitive for typical traffic, adjust it to a more suitable level.
- Expected frequency of the alert. Is it a high-volume alert or expected to be rare?
- This is a rare alert and mainly happens when sidekiq is overloaded
Severities
Section titled “Severities”- The severity of this alert is generally going to be a ~severity::3 or ~severity::4
- There might be customer user impact depending on which queue is affected
Recent changes
Section titled “Recent changes”Possible Resolutions
Section titled “Possible Resolutions”-
Slack channels where help is likely to be found:
#g_scalability