
- HAProxy may not be able to keep up
- Gitaly is overloaded
- Gitaly rate-limiting issues
*2018-11-06: Up to 15 minute delays on clones from GitLab repositories, including www-gitlab-com, gitlab-ee, gitlab-ce
- A S2 level incident lasting 3 days, led to disruption to git clones, in particular for the
www-gitlab-com
, gitlab-ee
, gitlab-ce
although many others were affected to.
- Diagnosis went around in circles:
- Initially targeted abuse
- High CI activity rates
- Workhorse throughput
- Network issues
- Gitaly concurrency limits (which had contributed)
- Smoking gun: not only git clones which were slow, artifact downloads against S3 had also sky-rocket in latency
- Testing git clones and artifact downloads via https://gitlab.com, then against front-end load balancers, then again Workhorse helps us pinpoint the issue with the HAProxy front-end fleet.