Gitaly repository cgroups
Each repository will be assigned to a cgroup to enforce resource limits for memory and cpu (cgroup v1). This implementation of cgroups prevents any one project from starving other projects from resources. There are several ways to induce unbounded memory and CPU usage on Gitaly nodes. In general, an unbounded resource usage pattern can only be fully prevented by adding an upper bound. Efficiency improvements can help reduce incident frequency, but they cannot prevent incidents. For example, if we somehow made git object traversal 5x more memory efficient, that helps the general case, but a bad actor can still trigger the same saturation behavior by adding 5x more objects to their repo or running 5x more concurrent commands. Cgroups provides that missing bounding behavior. In designing a resource isolation model, we chose the customer oriented boundary of per-project limits. This boundary is easy for users to understand and work with, and it matches well with most of the saturation incidents observed in production, where a single project’s git commands collectively saturated the host’s CPU or memory.
The limits are calibrated to ensure that the normal workload on all Gitaly nodes would not approach the limits. All projects have a generous burst capacity, but that ceiling is now less than the machine’s full capacity. Enough capacity is reserved that all other projects on the Gitaly host can continue with their normal workload while any one project is bursting to its limit of CPU or memory usage.
The cgroup Hierarchy:
/sys/fs/cgroup||--memory| |--gitaly| |--gitaly-<pid>| |--memory.limit_in_bytes| |--repos-0| | |--memory.limit_in_bytes| |--repos-1| | |--memory.limit_in_bytes| |--repos-2| | |--memory.limit_in_bytes| |--repos-3| | |--memory.limit_in_bytes| |--repos-4| | |--memory.limit_in_bytes| |--repos-5| | |--memory.limit_in_bytes| |--repos-6| | |--memory.limit_in_bytes| |--repos-7| | |--memory.limit_in_bytes| |--repos-8| | |--memory.limit_in_bytes| |--repos-9| | |--memory.limit_in_bytes| |--repos-10| |--memory.limit_in_bytes||-cpu| |--gitaly| |--gitaly-<pid>| |--cpu.shares| |--repos-0| | |--cpu.shares| | |--cpu.cfs_quota_us| |--repos-1| | |--cpu.shares| | |--cpu.cfs_quota_us| |--repos-2| | |--cpu.shares| | |--cpu.cfs_quota_us| |--repos-3| | |--cpu.shares| | |--cpu.cfs_quota_us| |--repos-4| | |--cpu.shares| | |--cpu.cfs_quota_us| |--repos-5| | |--cpu.shares| | |--cpu.cfs_quota_us| |--repos-6| | |--cpu.shares| | |--cpu.cfs_quota_us| |--repos-7| | |--cpu.shares| | |--cpu.cfs_quota_us| |--repos-8| | |--cpu.shares| | |--cpu.cfs_quota_us| |--repos-9| | |--cpu.shares| | |--cpu.cfs_quota_us| |--repos-10| |--cpu.shares| |--cpu.cfs_quota_us
/gitaly
: Known as hierarchy root, we don’t specify any limits, or put any pids in this cgroup, it’s mostly for memory accounting./gitaly-<pid>
: Gets created when gitaly starts, we set the uppoer cpu and memory limits, we don’t put an pids in this./gitlay-<pid>/repos-0
: The acutalgit
commands go inside of this cgroup, where we have cpu and memory limits for those git commands. Thegit
commands can for 1 or more git repositories, depending on the number of cgroups we have.
Learning about cgroup in Linux
Section titled “Learning about cgroup in Linux”- Kernel documentation
- CFS Bandwidth Control
- Understanding and Working with the Cgroups Interface
- How to understand the linux control groups cgroups
Reference links for Gitaly cgroup
Section titled “Reference links for Gitaly cgroup”- Gitaly cgroup documentation
- Justification/Original Issue
- Produciton rollout
- Infrastructure cgroup epic
- Rejected RFC
- First experiment
Symptoms
Section titled “Symptoms”- Increase latency due to CPU saturation.
- Higher error rate because we are killing
git
processes. - High amount of oom kills on the node.
Monitoring
Section titled “Monitoring”We monitor gitaly cgroups using cadvisor, we only scrape part of the information the most interesting ones are:
container_cpu_usage_seconds_total
: A counter which specifies the CPU usage of each cgroups, if it’s a flat line it might indicate that the cgroup is not being used or it’s being saturated.container_cpu_cfs_throttled_seconds_total
: A counter specifies how much we are throttling the cgroup. The high the more throttling we are doing.container_memory_usage_bytes
: A gauge which specifies the usage of memory for that cgroup.
Inside Gitaly: Host Detail you can find a cgroup
panel that will give you information about cgroups:
When a cgroup reaches it quota on memory usage the kernel will OOM kill that process, which we can see in the kernel logs.
To find out which cgroup
was used for the commands that run for a specific RPC you can look at the json.command.cgroup_path
field.
Useful debugging commands
Section titled “Useful debugging commands”-
Find out which cgroups are being used
ps -o pid= --ppid $( pidof gitaly ) | xargs -i cat /proc/{}/cgroup 2> /dev/null | awk -F: '$2 ~ /cpu,cpuacct|memory/ { print $2, $3 }' | sort -V | uniq -c -
Total number of cgroups created
sudo find /sys/fs/cgroup/{cpu,memory}/gitaly -mindepth 1 -type d | wc -l -
Get CPU shares
Terminal window ssh file-01-stor-gprd.c.gitlab-production.internal -- 'sudo cat /sys/fs/cgroup/cpu,cpuacct/gitaly/gitaly-$(pidof gitaly)/cpu.shares && sudo cat /sys/fs/cgroup/cpu,cpuacct/gitaly/gitaly-$(pidof gitaly)/repos-1/cpu.shares' -
Get CPU quota
Terminal window ssh file-01-stor-gprd.c.gitlab-production.internal -- 'sudo cat /sys/fs/cgroup/cpu,cpuacct/gitaly/gitaly-$(pidof gitaly)/cpu.cfs_quota_us && sudo cat /sys/fs/cgroup/cpu,cpuacct/gitaly/gitaly-$(pidof gitaly)/repos-1/cpu.cpu_quota_us' -
Get Memory limits
Terminal window ssh file-01-stor-gprd.c.gitlab-production.internal -- 'sudo cat /sys/fs/cgroup/memory/gitaly/gitaly-$(pidof gitaly)/memory.limit_in_bytes && sudo cat /sys/fs/cgroup/memory/gitaly/gitaly-$(pidof gitaly)/repos-1/memory.limit_in_bytes'