Sidekiq Concurrency Limit
Throttling/Circuit Breaker based on database usage
Section titled “Throttling/Circuit Breaker based on database usage”To protect the primary database against misbehaving/inefficient workers which can lead into incidents like slowdown of jobs processing, web availability, etc, we have developed a circuit breaking mechanism within Sidekiq itself.
When the database usage of a worker violates an indicator, Sidekiq will throttle the worker by decreasing its concurrency limit
at an interval of every minute. In the worst case scenario, the worker’s concurrency limit will be suppressed down to 1
.
Once the database usage has gone to a healthy level, the concurrency limit will automatically recover towards its default limit, but at a much slower rate than the throttling rate. The definition of the throttling/recovery rate is defined here.
Observability around concurrency limit can be viewed at sidekiq: Worker Concurrency Detail dashboard.
Database usage indicators
Section titled “Database usage indicators”There are 2 indicators on which the application will throttle a worker:
-
DB duration usage (primary DBs only)
Dashboard:
By default, the per-minute DB duration should not exceed a limit of 20,000 DB seconds/minute on non-high urgency worker and 100,000 DB seconds/minute on high-urgency workers (source).
The limits above can also be overwritten as described below. To check the current limit:
Terminal window glsh application_settings get resource_usage_limits -e gprd -
Number of non-idle DB connections
Dashboard:
- pgbouncer connection saturation
- pgbouncer-ci connection saturation
- pgbouncer-sec connection saturation
Sidekiq periodically samples non-idle DB connections from
pg_stat_activity
to determine which worker classes are consuming the most connections.The system determines the predominant worker (the worker consuming the most connections) by:
- Summing the number of connections used by a worker over the last 4 samples of
pg_stat_activity
(approximately 4 minutes of data) - The worker with the most aggregated connections is the “predominant worker”
Throttling events
Section titled “Throttling events”The table below illustrates what happens when each indicator is violated:
Indicator 1 (DB duration) | Indicator 2 (DB connections) | Throttling Event |
---|---|---|
❌ | ✅ | Soft Throttle |
❌ | ❌ | Hard Throttle |
✅ | ❌ | No throttling. Not throttled as some workers may momentarily hold many connections during normal workload |
✅ | ✅ | No throttling |
Updating DB duration limits
Section titled “Updating DB duration limits”The DB duration usage described above can only be updated by calling the application settings API. It cannot currently be set using the admin web UI.
-
Prepare a JSON file. Here’s an example to update a single worker
Chaos::DbSleepWorker
to have its own limit on the main DB:Click to expand
❯ cat rules.json{"main_db_duration_limit_per_worker": {"resource_key": "db_main_duration_s","metadata": {"db_config_name": "main"},"scopes": ["worker_name"],"rules": [{"selector": "worker_name=Chaos::DbSleepWorker","threshold": 5,"interval": 60},{"selector": "urgency=high","threshold": 100000,"interval": 60},{"selector": "*","threshold": 20000,"interval": 60}]},"ci_db_duration_limit_per_worker": {"resource_key": "db_ci_duration_s","metadata": {"db_config_name": "ci"},"scopes": ["worker_name"],"rules": [{"selector": "urgency=high","threshold": 100000,"interval": 60},{"selector": "*","threshold": 20000,"interval": 60}]},"sec_db_duration_limit_per_worker": {"resource_key": "db_sec_duration_s","metadata": {"db_config_name": "sec"},"scopes": ["worker_name"],"rules": [{"selector": "urgency=high","threshold": 100000,"interval": 60},{"selector": "*","threshold": 20000,"interval": 60}]}}To prepare a file with the current configuration to edit, run:
Terminal window glsh application_settings get resource_usage_limits > rules.json -
Run a helper script
glsh application_settings resource_usage_limits
to update the limits with an admin PAT.Terminal window glsh application_settings set resource_usage_limits -f rules.json -e gprd
Disabling the Throttling/Circuit Breaker feature entirely
Section titled “Disabling the Throttling/Circuit Breaker feature entirely”During the rollout of this feature, there are several feature flags that we can disable, if anything goes wrong with the throttling.
To disable throttling globally for all workers:
/chatops run feature set concurrency_limit_current_limit_from_redis false/chatops run feature set sidekiq_throttling_middleware false
To disable throttling for a worker:
# replace Security::SecretDetection::GitlabTokenVerificationWorker with the worker you want to disable/chatops run feature set `disable_sidekiq_throttling_middleware_Security::SecretDetection::GitlabTokenVerificationWorker` true
SidekiqConcurrencyLimitQueueBacklogged Alert
Section titled “SidekiqConcurrencyLimitQueueBacklogged Alert”This alert fires when a Sidekiq worker has accumulated too many jobs in the Concurrency Limit queue (>100,000 jobs for more than 1 hour).
These jobs are queued in Redis Cluster SharedState, so large amount of jobs could saturate Redis Cluster SharedState memory if left untreated.
Option 1: Increase Worker Concurrency Limit (Preferred)
Section titled “Option 1: Increase Worker Concurrency Limit (Preferred)”If the worker can safely handle more concurrent jobs:
- Locate the worker definition in the codebase
- Check current concurrency limit setting from the dashboard or the worker class definition.
- Create an MR to increase the limit to an appropriate value based on:
- Current processing rate
- System resource constraints
Option 2: Temporarily Disable Concurrency Limit
Section titled “Option 2: Temporarily Disable Concurrency Limit”As a last resort, sidekiq_concurrency_limit_middleware
feature flag can be disabled to help clear the backlogs faster
without waiting for deployment as in Option 1.
Note that this FF affects all workers globally and disabling on worker level is not supported.
- Check the Worker Concurrency Detail dashboard to ensure no other workers have significant backlog
- Disable the feature flag using:
/chatops run feature set sidekiq_concurrency_limit_middleware false --ignore-feature-flag-consistency-check
- Monitor the queue size to confirm it’s draining
- Important: Re-enable the feature flag once the backlog is cleared:
/chatops run feature set sidekiq_concurrency_limit_middleware true --ignore-feature-flag-consistency-check
When the feature flag is disabled:
- Up to 5000 jobs will be immediately processed from the queue
- New jobs will execute immediately without throttling
Post-Incident Tasks
Section titled “Post-Incident Tasks”- Create an issue to properly address the root cause if Option 2 was used
- Update monitoring thresholds if needed
- Document any findings about the worker’s behavior under load