Skip to content

IMDS Throttling

GitLab Runner relies on IMDS to obtain short-lived AWS credentials for S3 cache access. When IMDS was throttled, those credential requests fail, causing failed cache retrieval attempts. If the affected jobs were configured to fail when cache is missing, this results in a high rate of job failures.

IMDS Throttling might be suspected if

  • If the customer is reporting unexpected, transient cache related job failures.
  • There is a sudden, unexpected increase in script failures disproportionate to the increase in running jobs as seen on the customers Hosted Runners Overview Dashboard

It is also possible that if IMDS throttling continues long enough and there is an influx of constantly retrying jobs, more than what the system can handle, you might also get paged for HostedRunnersServiceCiRunnerJobsApdexSLOViolationSingleShard

At its very worst IMDS Throttling could cause a Sev1 incident

If you are experiencing cache related IMDS throttling, you will see an excessive number of (json.msg: *no EC2 IMDS role found* OR json.err: *no EC2 IMDS role found*) AND fluent_d: *{shard}-manager-logs in the Opensearch logs for a customer.

It is also worth briefly looking over all fluent_d: *{shard}-manager-logs and json.level: error in the Opensearch logs for a customer during the problem period, just in case the message is formatted differently after recent observability improvements.

Alternatively you could run the monitoring commands AWS shared sudo ss -tnp dst 169.254.169.254 and sudo tcpdump -i any host 169.254.169.254 -nn -c 500 on the affected Runner Manager to track active connections and traffic patterns.

You can also investigate these metrics to get a deeper understanding of what is going on.

MetricTypeDescriptionGitlab-Runner Binary Availability
gitlab_runner_cache_s3_assume_role_requests_in_flightGaugeNumber of AssumeRole requests to AWS STS in progress.v18.11.0 onwards
gitlab_runner_cache_s3_assume_role_wait_secondsHistogramWait time to acquire a concurrency slot before issuing an AssumeRole request.v18.11.0 onwards
gitlab_runner_cache_s3_assume_role_duration_secondsHistogramDuration of AssumeRole API calls to AWS STS.v18.11.0 onwards
gitlab_runner_cache_s3_assume_role_cache_hits_totalCounterNumber of AssumeRole credential cache hits (STS call avoided).v18.11.0 onwards
gitlab_runner_cache_s3_assume_role_cache_misses_totalCounterNumber of AssumeRole credential cache misses (This is also a count of the STS calls for cache credentials that were made).v18.11.0 onwards
gitlab_runner_cache_s3_assume_role_cached_credentialsGaugeNumber of AssumeRole credentials held in the in-memory LRU cache.v18.11.0 onwards
gitlab_runner_cache_s3_assume_role_failures_totalCounterNumber of AssumeRole requests which failed.v19.0.0 onwards
  1. Add this to the Runner Stacks RUNNER_MODEL Overrides in Switchboard to switch that runner away from role based authentication (which relies on IMDS) to using instead static user based authentication
{
...
"stack": {
"cache": {
"bucketAuthType": "user-based"
},
...
}
...
}
  1. Run hosted_runner_deploy via Switchboard to redeploy

The IMDS errors should disappear as soon as the provision job has completed.

Changes to how credentials are gathered for IAM Roles in gitlab-runner v18.11 should greatly reduce the likelihood of IMDS Throttling on connections to an s3 cache.