Skip to content

`gitalyctl`

gitalyctl implements the solution spec to drain git storages. This is achived by moving all the git repositories from the configured Gitaly storage to different Gitaly storages by use of the woodhouse gitalyctl storage drain so that the drained storages can be decomissioned.

Deployment of gitalyctl is done by through the gitalyctl helm release:

  • Create a new GitLab Admin user and a PAT for that user, example.
  • Create new gitalyctl-api-token which is used by gitalyctl by following the steps in this runbook. gitalyctl uses a personal admin credential via the environment variable GITALYCTL_API_TOKEN that allows it to send requests to the Gitlab API.
  • Add the release to the gitalyctl release helmfile.yaml

Once woodhouse gitalyctl drain storage is executed the following will happen:

  1. Drain Group Wikis using GitLab’s API and move the repository to the available Gitaly storage.
  2. Drain Snippets using GitLab’s API and move the repository to the available Gitaly storage.
  3. Drain Project Wikis/Repositories using GitLab’s API and move the repository to the available Gitaly storage.
  4. --dry-run this will only print out the projects that will be migrated, and their statistics.repository_size, repository_storage this allows us to check if we are picking the right projects both for local development and the production deployment.

More details on how this happens can be found in the Gitaly Multi Project solution from step 2

The main metric for throughput is the increase in the number of GB/s we are moving or the number of moves/s. The best metric we have is the Success Rate, the higher it is the faster we are moving repositories

demo of success rate/s

source

When draining storage there are multiple configuration fields to increase the throughput:

  1. storage.concurrency:
    • How many storages from the list it will drain in 1 go.
  2. concurrency; Group, Snippet, Project
    • The higher the value the more concurrent moves it will do.
    • The concurrency value is per storage.
  3. move_status_update; Group, Snippet, Project
  1. sidekiq: All of the move jobs run on the gitaly_throttled. This will be the main bottleneck, if you see a large queue length it might be time to scale up maxReplicas
  2. gitaly: Both the source and destination storage might end up getting resource saturated, below is a list of resources that get saturated

During migration of gitaly projects from current to new servers, we are perodically moving projects in bulk from old servers (file-*) to new (gitaly-*) servers and that is being done concurrently while new projects are also being created on the same set of available servers. If a migration is not going as expected, we can quickly overrun available disk space and soon be unable to create new projects for customers.

We can determine if there is an ongoing migration by seeing an uptrend in Repository on new Gitaly VMs here

To stop migrations in emergency situation, run:

(gitalyctl is running in ops cluster)

Terminal window
glsh kube use-cluster ops

Then in a separate session/window, scale down the replicas to zero:

Terminal window
kubectl scale deployment gitalyctl-gprd -n gitalyctl --replicas=0

And reach out to #wg_disaster-recovery slack channel as a followup, to let the team now.

Note: This will not leave any repositories in a bad state.