Gitaly multi-project migration
gitalyctl
Section titled “gitalyctl”500 Internal Server Error
Section titled “500 Internal Server Error”Symptoms:
{ "error": "move storage: 3 error(s) occurred:\n* drain project storage: list projects: GET https://staging.gitlab.com/api/v4/projects: 500 {message: 500 Internal Server Error}\n* drain project storage: list projects: GET https://staging.gitlab.com/api/v4/projects: 500 {message: 500 Internal Server Error}\n* drain project storage: list projects: GET https://staging.gitlab.com/api/v4/projects: 500 {message: 500 Internal Server Error}", "event": "shutdown", "level": "error", "ts": "2023-09-05T04:15:47.875264403Z"}
Runbook:
-
Find 500 error logs in the API: https://nonprod-log.gitlab.net/app/r/s/s2ES1
-
Check the
json.exception.message
andjson.exception.class
for the error message. Also, notice thejson.params.value
to know which page is failing. -
Verify that you also see the 500 error locally
Terminal window $ curl -s --header "PRIVATE-TOKEN: $(op read op://private/GitLab-Staging/PAT)" "https://staging.gitlab.com/api/v4/projects?repository_storage=nfs-file02&order_by=id&sort=asc&statistics=true&per_page=100&page=12"{"message":"500 Internal Server Error"}⏎ -
Find the faulty project through the rails console.
The
offset
will depend on which page is failing, for example,page=12
is failing so calculating the offset(page - 1) * 100 = 1100
.Terminal window Project.where("repository_storage = ?", "nfs-file02").order(id: :asc).offset(1100).limit(100).each {|p| puts "ID: #{p.id}, valid: #{p.valid?}, errors: #{p.errors.full_messages}" unless p.valid? }; 0ID: 219566, valid: false, errors: ["Name can contain only letters..."]
Project repository move timeout with state:initial
/ state: schedule
/ state:started
Section titled “Project repository move timeout with state:initial / state: schedule / state:started”Symptoms:
Move timeout and state="initial"
/ state="scheduled"
/ state="started"
{ "destination_storage": "gitaly-01-stor-gstg.c.gitlab-gitaly-gstg-380a.internal", "event": "move repository timeout", "level": "error", "project": "https://staging.gitlab.com/clai/cuest-web-landing", "project_repository_move_id": 5967558, "repository_size": 0, "state": "scheduled", "storage": "nfs-file05", "timeout_duration": "1h0m0s", "ts": "2023-09-11T05:36:39.25137642Z"}
Runbook:
-
Connect to the rails console
-
Find the
Projects::RepositoryStroageMove
using theproject_repository_move_id
from the log message.Terminal window [ gstg ] production> Projects::RepositoryStorageMove.find(5967558)=>#<Projects::RepositoryStorageMove:0x00007fcb1acab4e8id: 5967558,created_at: Mon, 04 Sep 2023 09:09:04.454070000 UTC +00:00,updated_at: Mon, 04 Sep 2023 09:09:04.454070000 UTC +00:00,project_id: 3359362,state: 2,source_storage_name: "nfs-file05",destination_storage_name:"gitaly-01-stor-gstg.c.gitlab-gitaly-gstg-380a.internal">state: 2
means it’sscheduled
-
Mark the move as a failure, and disable read only mode.
Terminal window # Mark as failure[ gstg ] production> Projects::RepositoryStorageMove.find(5967558).do_fail!=> true# Disable read_only[ gstg ] production> Project.find(Projects::RepositoryStorageMove.find(5967558).project_id).update!(repository_read_only: false)=> true -
The repositry should be moved in the next round
gitalyctl
fetches projects for the storage.
Move repository state=failed
Section titled “Move repository state=failed”Symptoms:
Move failed with state="failed"
{ "caller": "snippet_repository.go:182", "destination_storage": "gitaly-02-stor-gstg.c.gitlab-gitaly-gstg-380a.internal", "event": "move repository state failed", "level": "error", "snippet": "https://staging.gitlab.com/-/snippets/1664763", "snippet_repository_move_id": 1040466, "state": "failed", "storage": "nfs-file07", "timeout_duration": "1h0m0s", "ts": "2023-09-13T12:59:47.594007192Z"}
Runbook:
-
Find failed sidekiq jobs for
UpdateRepositoryStorageWorker
: gstg | gprd -
If you need more context about the failure for example the failure came form Gitaly, fetch
json.correlation_id
and check the correlation dashboard: gstg | gprd -
Confirm that the repository is not left in
read-only
mode# If Project[ gstg ] production> Project.find_by_full_path('sxuereb/gitaly').repository_read_only?=> true# If Snippet[ gstg ] production> Snippet.find(3029095).repository_read_only?=> true# If Group[ gstg ] production> Group.find(13181226).repository_read_only?=> true -
Change repository to be writeable if
repository_read_only? => true
Repository read-only
Section titled “Repository read-only”Symptoms:
Move fails:
{ "caller": "project_repository.go:160", "event": "move repository", "level": "warn", "msg": "project is read-only", "project": "https://gitlab.com/sxuereb/gitaly", "repository_size": 0, "storage": "nfs-file102", "timeout_duration": "1h0m0s", "ts": "2023-10-27T08:43:43.15835397Z"}
Runbook:
-
Change repository to be writable if
read-only
# If Project[ gstg ] production> Project.find_by_full_path('sxuereb/gitaly').set_repository_writable!=> nil[ gstg ] production> Project.find_by_full_path('sxuereb/gitaly').repository_read_only?=> false# If Snippet[ gstg ] production> Snippet.find(3029095).set_repository_writable!=> nil[ gstg ] production> Snippet.find(3029095).repository_read_only?=> false# If Group[ gstg ] production> Group.find(13181226).set_repository_writable!=> nil[ gstg ] production> Group.find(13181226).repository_read_only?=> false
Timeout waiting for project repository pushes
Section titled “Timeout waiting for project repository pushes”Symptoms:
- Sidekiq jobs failing this exception message:
Timeout waiting for project repository pushes
(Kibana)
Runbook:
-
Check the project’s reference counter multiple times, if the number is not growing up then you can reset the counter and wait for the project to be moved by gitalyctl:
project = Project.find_by_full_path('sxuereb/gitaly')project.reference_counter(type: project.repository.repo_type).value# => 6project.reference_counter(type: project.repository.repo_type).value# => 6project.reference_counter(type: project.repository.repo_type).reset!# => true -
If number is increasing over time, then it’s a little tricky and a manual scheduling of movement is required. In a Rails console run all the following:
class Projects::RepositoryStorageMoveoverride :schedule_repository_storage_update_workerdef schedule_repository_storage_update_workerProjects::UpdateRepositoryStorageWorker.new.perform(project_id,destination_storage_name,id)endendproject = Project.find_by_full_path('sxuereb/gitaly')# Try these two lines until you get a `true`project.reference_counter(type: project.repository.repo_type).reset!project.repository_storage_moves.build(source_storage_name: project.repository_storage).schedule# => true
For more context around this error and its remedies, please check this thread.