GitLab Storage Re-balancing
Moving project repositories between gitaly
storage shards presently
involves direct human intervention, which is obviously a less than
ideal arrangement. To help reduce the cognitive load involved in the
procedures involved, the following instructional walk-throughs are
documented here.
Summary
Section titled “Summary”Moving a project git repository from the file system of one gitaly
storage shard node to another is referred to as “migration”.
A migration consists of both a repository replication and the update of
the name of the repository_storage
field of the given Project
in the
GitLab database.
Only if both of these operations succeed is a migration considered to have successfully taken place.
In order to replicate a repository and update the database field which tracks
the residence of the project repository, the project repository must be marked
as read-only. Once the job has completed, the project must be marked back to
writable again by setting project.repository_read_only = false
.
In order to orchestrate this, a gitlab-rails
app feature is responsible
for scheduling a sidekiq job called ProjectUpdateRepositoryStorageWorker
.
During invocation, activity for this job will appear in the kibana logs. For
example: https://log.gprd.gitlab.net/goto/1a02b96a7066e7c2cbacbf55e3d5947d
You may find the implementation here: https://gitlab.com/gitlab-org/gitlab/-/blob/master/app/services/projects/update_repository_storage_service.rb#L16
Dashboards
Section titled “Dashboards”The Gitaly Rebalancing dashboard is designed to assist with decision making around manual re-balancing of repositories. It is recommended that this dashboard is consulted before triggering a manual re-balance, to get an idea of which servers are over-utilized and which ones are under-utilized.
How to migrate a project repository
Section titled “How to migrate a project repository”Over time, a couple of methods have been developed for accomplishing the re-location of a project repository from one gitaly storage shard node file system to another.
Single repo manual selection
Section titled “Single repo manual selection”- Login to gitlab.com using an admin account.
- Navigate to https://gitlab.com/profile/personal_access_tokens and generate a private token.
- Enable the scope for
api
. - Set an expiration date three or four days from now.
-
Take note of the project ID. You will need it to move the project via the API. You can find it in the project page, next to the project avatar and under the project name.
-
Export your admin auth token as an environmdent variable in your shell session.
Terminal window export GITLAB_GPRD_ADMIN_API_PRIVATE_TOKEN=CHANGEME -
Trigger the move using the API. For example:
Terminal window ./scripts/storage_repository_move.sh 'path/to/project' 'nfs-fileYY'
- Note: The parameter of
nfs-fileYY
is the name of the destination gitaly shard as configured in thegit_data_dirs
options of thegitlab.rb
file. - Note: The project will automatically be set into read-only and set back to read-write after the move.
- If needed, check logs for the sidekiq job in Kibana: https://log.gprd.gitlab.net/goto/35c31768d3be0137be06e562422ffba0
Multiple repo automated selection
Section titled “Multiple repo automated selection”Through the Balancer project, you can trigger a pipeline to automatically balance the gitaly shards with the highest amount of disk space. You can do this by triggering a pipeline with the following environment variables:
ENVIRONMENT
: Default is staging, set this toproduction
.DRY_RUN
: Defaults to true. Set this asfalse
orno
.SLACK_NOTIFY
: Notify the #production channel. Set this totrue
.
The migration logs will be saved as an artifact that you can download in the move_projects
job.
Slightly automated selection
Section titled “Slightly automated selection”A script exists in this repo:
scripts/storage_rebalance.rb
The goal of this script is to safely and reliably replicate project git repositories from one gitaly shard to another.
This script will select projects with the largest repositories on the given source gitaly shard and schedule them for replication to the destination gitaly shard. If a minimum amount of gigabytes is given, the script will continue to replicate repositories to the destination shard until the total gigabytes replicated has reached the given amount.
How to use it
Section titled “How to use it”-
Clone this repository:
git clone [email protected]:gitlab-com/runbooks.git
-
Change directory into the cloned runbooks project repository:
cd runbooks
-
Install any necessary rubies and dependencies:
Terminal window rbenv install $(rbenv local)gem install bundlerbundle install --path=vendor/bundle -
You will need a personal access token with the
api
scope enabled. Export the token as an environment variable in your shell session:Terminal window export GITLAB_GPRD_ADMIN_API_PRIVATE_TOKEN=CHANGEME -
Invoke the script using the
--help
flag for usage details:bundle exec scripts/storage_rebalance.rb --help
-
Create a new production change issue using the
storage_rebalancing
template and follow the instructions in the issue description. -
Invoke a dry run and record the output in the re-balancing issue.
Terminal window bundle exec scripts/storage_rebalance.rb nfs-fileXX nfs-fileYY --move-amount=1000 --dry-run=yes | tee scripts/logs/nfs-fileXX.migration.$(date --utc +%Y-%m-%d_%H:%M).log -
Invoke the same command except with the
--dry-run=no
argument.
Note: Repository replication errors are recorded, and their log artifacts may be reviewed:
find scripts/storage_migrations -name failed*.log -exec cat {} \; | jq
The script will automatically skip such failed project repository replications
in subsequent invocations. Additional projects may be skipped using the
--skip
command line argument.
Failure modes
Section titled “Failure modes”Plenty of progress has been made recently to reduce failure cases. There are still a handful of ways that a repository can fail to replicate onto the file system of another shard.
- Checksum validation failure
- This means that the collective refs of the replica do not match the collective refs of the original.
- Timeout
- This means that some process or
grpc
operation has taken too long, and did not complete within a pre-configured or programmatic parametric timeout.
- This means that some process or
In both of these situations, no roll-back is required, because the error is raised by gitaly and interrupts the worker process.
Reviewing replicated repositories
Section titled “Reviewing replicated repositories”It is useful, but not required, to record details about both the original
repository and the replica repository. Unfortunately, the existing projects
API does not include the disk_path
attribute of a particular project. This
makes it a little complicated to carefully examine details of the project
repository on the file system of a gitaly shard.
- Copy the
project_id
for a project which has completed or is undergoing replication. - Open a rails console session:
ssh <username>[email protected]
- Run this command:
Project.find(<project_id>)&.disk_path
- Copy the disk path of the project repository from the
storage_rebalance.rb
script output of the “successful” migration.
Install the info helper script
Section titled “Install the info helper script”- Secure shell to the source gitaly shard node system. For example:
ssh file-33-stor-gprd.c.gitlab-production.internal
- Download this script to the source shard node file system:
sudo mkdir -p /var/opt/gitlab/scripts; sudo curl --silent https://gitlab.com/gitlab-com/runbooks/raw/master/scripts/storage_repository_info.sh --output /var/opt/gitlab/scripts/storage_repository_info.sh; sudo chmod +x /var/opt/gitlab/scripts/storage_repository_info.sh
- Now exit the shell session to that shard node.
- Repeat these steps for the destination gitaly shard system.
How to use it
Section titled “How to use it”Using the pretend disk path @hashed/4a/68/4a68b75506effac26bc7660ffb4ff46cbb11ba00ed4795c1c5f0125f256d7f6a
:
export disk_path='@hashed/4a/68/4a68b75506effac26bc7660ffb4ff46cbb11ba00ed4795c1c5f0125f256d7f6a'ssh file-33-stor-gprd.c.gitlab-production.internal "sudo /var/opt/gitlab/scripts/storage_repository_info.sh '${disk_path}'"
Users of macOS can make their lives easier using pbcopy
:
ssh file-33-stor-gprd.c.gitlab-production.internal "sudo /var/opt/gitlab/scripts/storage_repository_info.sh '${disk_path}'" | pbcopy; pbpaste
You should execute the info.sh
script on both the source and target shard node systems.
ssh file-43-stor-gprd.c.gitlab-production.internal "sudo /var/opt/gitlab/scripts/storage_repository_info.sh '${disk_path}'" | pbcopy; pbpaste
Record the results of these commands in the issue for the re-balancing operations. This may be useful diagnostic for other engineers.
Failed replica repository deletion
Section titled “Failed replica repository deletion”For undoing the replica repository creation operation: ../../scripts/storage_repository_delete.sh
-
Download this script to the target shard node file system:
Terminal window sudo mkdir -p /var/opt/gitlab/scripts; cd /var/opt/gitlab/scripts; sudo curl --silent https://gitlab.com/gitlab-com/runbooks/raw/master/scripts/storage_repository_delete.sh --output /var/opt/gitlab/scripts/storage_repository_delete.sh; sudo chmod +x /var/opt/gitlab/scripts/storage_repository_delete.sh -
Invoke a dry-run with:
Terminal window sudo /var/opt/gitlab/scripts/storage_repository_delete.sh --dry-run=yes '@hashed/XX/XX/XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX'` (Where again the second parameter is the disk path of the project repository) -
Review the output.
-
Invoke:
Terminal window sudo /var/opt/gitlab/scripts/storage_repository_delete.sh --dry-run=no '@hashed/XX/XX/XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX'
General clean-up
Section titled “General clean-up”After each project repository has finished being completely mirrored to its new storage node home, each original repository must be removed from their source storage node.
Manual method
Section titled “Manual method”-
Create a list of moved repositories to delete on
file-XX
:Terminal window find /var/opt/gitlab/git-data/repositories/@hashed -mindepth 2 -maxdepth 3 -name *+moved*.git > files_to_remove.txt< files_to_remove.txt xargs du -ch | tail -n1 -
Have another SRE review the files to be removed to avoid loss of data.
-
Create GCP snapshot of disk on
file-XX
and include a link to the production issue in the snapshot description. -
Record the current disk space usage:
Terminal window df -h /dev/sdb` -
Remove the files:
Terminal window < files_to_remove.txt xargs -rn1 ionice -c 3 rm -fr -
Record the recovered disk space:
df -h /dev/sdb
Somewhat automated method
Section titled “Somewhat automated method”A script exists in this repo
scripts/storage_cleanup.rb
The goal of this script is to conduct a find
operation on a gitaly shard
node in order to discover individual project repositories which are marked as
+moved+
. The script will estimate the disk space which would be freed when
ran in dry-run mode, and will re-run the find command to rm -rf
each found
remnant git repository.
Clean-up script usage
Section titled “Clean-up script usage”-
Copy the script to your local workstation. (The script must be ran from your local workstation, because it will need secure shell access to the file storage nodes which contain the remaining project repositories.)
-
Confirm that the script can be ran:
bundle exec scripts/storage_cleanup.rb --help
-
Conduct a dry-run of the cleanup script:
Terminal window bundle exec scripts/storage_cleanup.rb file-XX-stor-gprd.c.gitlab-production.internal --verbose --scan --dry-run=yes -
For each unique storage node listed in the dry-run output, you should perform a GCP snapshot of its larger disk. This way any deleted repository can be recovered, if needed. For example:
Terminal window export disk_name='file-XX-stor-gprd-data'gcloud auth logingcloud config set project gitlab-productionexport zone=$(gcloud compute disks list --filter="name=('${disk_name}')" --format=json | jq -r '.[0]["zone"]' | cut -d'/' -f9)echo "${zone}"export snapshot_name=$(gcloud compute disks snapshot "${disk_name}" --zone="${zone}" --format=json | jq -r '.[0]["name"]')echo "${snapshot_name}"gcloud compute snapshots list --filter="name=('${snapshot_name}')" --format=json | jq -r '.[0]["status"]' -
Request a review from another SRE of the output of the dry-run execution plan of the cleanup script.
-
Finally, execute the cleanup script:
Terminal window bundle exec scripts/storage_cleanup.rb file-XX-stor-gprd.c.gitlab-production.internal --verbose --scan --dry-run=no
Verify information
Section titled “Verify information”Via the rails console, we have a few easy lookups to see where a project lives, what its filepath is, and if it is writeable. For example:
[ gstg ] production> project = Project.find(12345678)=> #<Project id:12345678 foo/bar>[ gstg ] production> project.repository_storage=> "nfs-file05"[ gstg ] production> project.disk_path=> "@hashed/4a/68/4a68b75506effac26bc7660ffb4ff46cbb11ba00ed4795c1c5f0125f256d7f6a"[ gstg ] production> project.repository_read_only=> false
Potential outcomes
Section titled “Potential outcomes”Success
Section titled “Success”Meaning both the git repo and the wiki repo will have moved to the new server,
the old directories will have been renamed <reponame>+moved.*
Failure
Section titled “Failure”Typical failure modes involve a scenario wherein the original git repository remains intact on the source shard, but there may be an inconsistent replica repository left on the file system of the destination shard.
It is important to note that the end-user will not notice any problems with this, so when failures like this occur, there is no reason to take any immediate corrective action.
There is currently no mandate to delete or clean up the inconsistent replica repository which was the subject of a failed replication process.
In order to accomplish such a task, it would be necessary to install an audit
script onto each gitaly shard, and scan the /var/opt/gitlab/git-data/repositories/@hashed
directory and query each and every single found disk_path in the database to
check for an invalid residence.
Improvements for this script/process
Section titled “Improvements for this script/process”- Automatically send logs to elasticsearch instead of using
tee
. - Automate the invocation of the
storage_rebalance.rb
script with Ansible or similar. Even just having an automated script that can migrate 500GB at a time from the most used to least used gitaly node would help make this less of a chore. - Ideally, the GitLab application itself could autonomously balance git repositories in the background.
Behind the scenes
Section titled “Behind the scenes”The gitlab-rails
Worker which is enqueued in sidekiq to run asynchronously
is invoking a grpc method in gitaly called ReplicateRepository
after creating
a repository directory on the destination shard file system. If the repository
directory already exists, it invokes the grpc method in gitaly called
FetchInternalRemote
which pulls the data from the original repository into
the replica repository. Once this data replication has completed, the Worker
then updates the Project.repository_storage
in the application database to
specify the name of the new shard, i.e.: nfs-fileYY
. The original repository
is renamed by the Worker to mark it as +moved+
. The project is marked
read-only in the database throughout this entire procedure.