Gitaly
Overview
Section titled “Overview”Gitaly currently is deployed in a way that makes each individual server a single point of failure (SPoF) for GitLab.com.
Because of this, the service requires substantial coordination when scheduling reboots.
Work on the Gitaly raft implementation may make maintenance considerably easier by removing it as a SPoF, and possibly enabling the migration to Kubernetes.
Lead Time
Section titled “Lead Time”Lead time for performing patching will likely be substantial. Coordination must be done with Customer Support teams, as well as possibly account managers for larger customers before scheduling an outage period for git operations on GitLab.com.
Required Time For Execution
Section titled “Required Time For Execution”This is largely going to depend on how many Gitaly nodes we upgrade at once. If we do a single instance at a time, you could expect the maintenance window to be a little more than 5 hours. ~160 nodes * 2 minutes per reboot cycle
We could make the decision to increase the number of nodes upgraded at once to reduce the overall time, but accept that the likelihood of impact to any given user would be higher for the shorter duration.
System Identification
Section titled “System Identification”Knife query:
knife search node 'roles:gprd-base-stor'
Process
Section titled “Process”See Linux Patching Overview for generic processes applied to all Linux systems.
Run against instances in a rolling fashion, in a predetermined group size
- Disable the weight assigner before starting.
- Set the weight for new projects on the storage(s) to zero
- Update packages with apt
- Reboot instance
- Validate that Gitaly is running again.
gitlab-ctl status
- Check the log (/var/log/gitlab/gitaly/current) to ensure that startup errors haven’t occurred.
- Restore the weight on the storage to the previous value.
- Repeat for each node
- Enable the weight assigner
Additional Automation Tooling
Section titled “Additional Automation Tooling”None currently exists.