Skip to content

Gitaly multi-project migration

Symptoms:

{
"error": "move storage: 3 error(s) occurred:\n* drain project storage: list projects: GET https://staging.gitlab.com/api/v4/projects: 500 {message: 500 Internal Server Error}\n* drain project storage: list projects: GET https://staging.gitlab.com/api/v4/projects: 500 {message: 500 Internal Server Error}\n* drain project storage: list projects: GET https://staging.gitlab.com/api/v4/projects: 500 {message: 500 Internal Server Error}",
"event": "shutdown",
"level": "error",
"ts": "2023-09-05T04:15:47.875264403Z"
}

source

Runbook:

  1. Find 500 error logs in the API: https://nonprod-log.gitlab.net/app/r/s/s2ES1

  2. Check the json.exception.message and json.exception.class for the error message. Also, notice the json.params.value to know which page is failing. api logs showing the error

  3. Verify that you also see the 500 error locally

    Terminal window
    $ curl -s --header "PRIVATE-TOKEN: $(op read op://private/GitLab-Staging/PAT)" "https://staging.gitlab.com/api/v4/projects?repository_storage=nfs-file02&order_by=id&sort=asc&statistics=true&per_page=100&page=12"
    {"message":"500 Internal Server Error"}⏎
  4. Find the faulty project through the rails console.

    The offset will depend on which page is failing, for example, page=12 is failing so calculating the offset (page - 1) * 100 = 1100.

    Terminal window
    Project.where("repository_storage = ?", "nfs-file02").order(id: :asc).offset(1100).limit(100).each {|p| puts "ID: #{p.id}, valid: #{p.valid?}, errors: #{p.errors.full_messages}" unless p.valid? }; 0
    ID: 219566, valid: false, errors: ["Name can contain only letters..."]

Project repository move timeout with state:initial / state: schedule / state:started

Section titled “Project repository move timeout with state:initial / state: schedule / state:started”

Symptoms:

Move timeout and state="initial" / state="scheduled" / state="started"

{
"destination_storage": "gitaly-01-stor-gstg.c.gitlab-gitaly-gstg-380a.internal",
"event": "move repository timeout",
"level": "error",
"project": "https://staging.gitlab.com/clai/cuest-web-landing",
"project_repository_move_id": 5967558,
"repository_size": 0,
"state": "scheduled",
"storage": "nfs-file05",
"timeout_duration": "1h0m0s",
"ts": "2023-09-11T05:36:39.25137642Z"
}

source

Runbook:

  1. Connect to the rails console

  2. Find the Projects::RepositoryStroageMove using the project_repository_move_id from the log message.

    Terminal window
    [ gstg ] production> Projects::RepositoryStorageMove.find(5967558)
    =>
    #<Projects::RepositoryStorageMove:0x00007fcb1acab4e8
    id: 5967558,
    created_at: Mon, 04 Sep 2023 09:09:04.454070000 UTC +00:00,
    updated_at: Mon, 04 Sep 2023 09:09:04.454070000 UTC +00:00,
    project_id: 3359362,
    state: 2,
    source_storage_name: "nfs-file05",
    destination_storage_name:
    "gitaly-01-stor-gstg.c.gitlab-gitaly-gstg-380a.internal">

    state: 2 means it’s scheduled

  3. Mark the move as a failure, and disable read only mode.

    Terminal window
    # Mark as failure
    [ gstg ] production> Projects::RepositoryStorageMove.find(5967558).do_fail!
    => true
    # Disable read_only
    [ gstg ] production> Project.find(Projects::RepositoryStorageMove.find(5967558).project_id).update!(repository_read_only: false)
    => true
  4. The repositry should be moved in the next round gitalyctl fetches projects for the storage.

Symptoms:

Move failed with state="failed"

{
"caller": "snippet_repository.go:182",
"destination_storage": "gitaly-02-stor-gstg.c.gitlab-gitaly-gstg-380a.internal",
"event": "move repository state failed",
"level": "error",
"snippet": "https://staging.gitlab.com/-/snippets/1664763",
"snippet_repository_move_id": 1040466,
"state": "failed",
"storage": "nfs-file07",
"timeout_duration": "1h0m0s",
"ts": "2023-09-13T12:59:47.594007192Z"
}

source

Runbook:

  1. Find failed sidekiq jobs for UpdateRepositoryStorageWorker: gstg | gprd

    screenshot of logs with failed moves

  2. If you need more context about the failure for example the failure came form Gitaly, fetch json.correlation_id and check the correlation dashboard: gstg | gprd

    screenshot of logs with failed moves using the correlation dashboard

  3. Confirm that the repository is not left in read-only mode

    # If Project
    [ gstg ] production> Project.find_by_full_path('sxuereb/gitaly').repository_read_only?
    => true
    # If Snippet
    [ gstg ] production> Snippet.find(3029095).repository_read_only?
    => true
    # If Group
    [ gstg ] production> Group.find(13181226).repository_read_only?
    => true
  4. Change repository to be writeable if repository_read_only? => true

Symptoms:

Move fails:

{
"caller": "project_repository.go:160",
"event": "move repository",
"level": "warn",
"msg": "project is read-only",
"project": "https://gitlab.com/sxuereb/gitaly",
"repository_size": 0,
"storage": "nfs-file102",
"timeout_duration": "1h0m0s",
"ts": "2023-10-27T08:43:43.15835397Z"
}

source

Runbook:

  1. Change repository to be writable if read-only

    # If Project
    [ gstg ] production> Project.find_by_full_path('sxuereb/gitaly').set_repository_writable!
    => nil
    [ gstg ] production> Project.find_by_full_path('sxuereb/gitaly').repository_read_only?
    => false
    # If Snippet
    [ gstg ] production> Snippet.find(3029095).set_repository_writable!
    => nil
    [ gstg ] production> Snippet.find(3029095).repository_read_only?
    => false
    # If Group
    [ gstg ] production> Group.find(13181226).set_repository_writable!
    => nil
    [ gstg ] production> Group.find(13181226).repository_read_only?
    => false

Timeout waiting for project repository pushes

Section titled “Timeout waiting for project repository pushes”

Symptoms:

  • Sidekiq jobs failing this exception message: Timeout waiting for project repository pushes (Kibana)

Runbook:

  • Check the project’s reference counter multiple times, if the number is not growing up then you can reset the counter and wait for the project to be moved by gitalyctl:

    project = Project.find_by_full_path('sxuereb/gitaly')
    project.reference_counter(type: project.repository.repo_type).value
    # => 6
    project.reference_counter(type: project.repository.repo_type).value
    # => 6
    project.reference_counter(type: project.repository.repo_type).reset!
    # => true
  • If number is increasing over time, then it’s a little tricky and a manual scheduling of movement is required. In a Rails console run all the following:

    class Projects::RepositoryStorageMove
    override :schedule_repository_storage_update_worker
    def schedule_repository_storage_update_worker
    Projects::UpdateRepositoryStorageWorker.new.perform(
    project_id,
    destination_storage_name,
    id
    )
    end
    end
    project = Project.find_by_full_path('sxuereb/gitaly')
    # Try these two lines until you get a `true`
    project.reference_counter(type: project.repository.repo_type).reset!
    project.repository_storage_moves.build(source_storage_name: project.repository_storage).schedule
    # => true

For more context around this error and its remedies, please check this thread.