GitLab application-side reindexing
GitLab comes with an application-side cronjob to execute database reindexing automatically and in background. This runbook links a couple relevant resources and documentation for this and contains steps to cancel reindexing actions during an emergency.
What is reindexing?
Section titled “What is reindexing?”Please refer to pg_repack.md#what-is-bloat for more background on bloat. Application-side reindexing is a background process that automatically deals with standard btree postgres indexes. It currently only supports “non-special” indexes, i.e. it does not support primary key indexes, unique indexes, expression indexes etc.. However, the majority of indexes is supported.
From a workflow perspective, when the reindexing kicks off, it estimates index bloat for all supported indexes using a heuristic. It then chooses the top N indexes by their bloat level and starts to recreate those. The recreating phase is similar to what pg_repack
does under the hoods:
- It creates a new temporary index on the sides and
- then performs a renaming operation to swap indexes.
(1) is low-risk in terms of locks but incurs additional disk IO to create the replacement index. (2) needs a elevated lock on the table to perform the renaming operations. We have guards in place, so this does not end up stalling the site for a longer period of time. If this guard doesn’t work out, the process is cancelled and cleaned up automatically (so we are not forcing the lock). It will be retried upon the next invocation.
Schedule
Section titled “Schedule”The schedule is configured through Chef. The cronjob executes on the deploy-node.
How automation picks indexes
Section titled “How automation picks indexes”In order to pick relevant indexes for reindexing, the job uses a bloat heuristic to determine bloat levels for all indexes. Any run of the job picks the 2 most bloated indexes by their relative bloat level.
The bloat heuristic is only available for btree and GiST indexes, GIN indexes are not supported.
Overall, the following limitations apply currently:
- Type: Only btree and GiST indexes
- Maximum size: Automatic reindexing only for indexes < 100 GB current size.
Explicit reindexing queue
Section titled “Explicit reindexing queue”In order to schedule specific reindexes for a rebuild, we can enqueue those indexes in the “reindexing queue”. This queue is consumed and emptied before any indexes are being picked from the bloat heuristic described above. This can be handy for indexes that are not supported, e.g. GIN indexes or indexes beyond a certain size (see above).
As an example, let’s schedule public.index_merge_request_diff_commits_on_sha
for a rebuild:
$ gitlab-rake gitlab:db:enqueue_reindexing_action[public.index_merge_request_diff_commits_on_sha]
Queued reindexing action: queued action [ id = 1, index: public.index_merge_request_diff_commits_on_sha ]There are 1 queued actions in total.
After having enqueued the index for a rebuild, the subsequent run of gitlab-rake gitlab:db:reindex
is going to execute the rebuild for this and other queued indexes, if any (up to 2 indexes in total per execution). If the queue is empty, we fall back to picking indexes automatically.
The queue can be examined, for example:
gitlabhq_production=# select * from postgres_reindex_queued_actions WHERE state = 0; id | index_identifier | state | created_at | updated_at----+---------------------------------------------------------------+-------+-------------------------------+------------------------------- 1 | public.index_merge_request_diff_commits_on_sha | 0 | 2021-11-04 10:35:05.396064+00 | 2021-11-04 10:35:05.396064+00
States for these queued actions include:
state = 0: queuedstate = 1: donestate = 2: failed
How to monitor
Section titled “How to monitor”Grafana gets annotations for any reindexing action that is happening. As an example, the PostgreSQL Overview has those enabled for production. Annotations have tags reindex
and gprd
or gstg
among others, so this can be filtered upon.
In order to understand ongoing and historic reindexing actions, one can also peek at the tracking table in PostgreSQL (using a psql session):
gitlabhq_production=# select * from postgres_reindex_actions order by action_start desc limit 10; id | action_start | action_end | ondisk_size_bytes_start | ondisk_size_bytes_end | state | index_identifier | bloat_estimate_bytes_start------+-------------------------------+-------------------------------+-------------------------+-----------------------+-------+-------------------------------------------------------------------+---------------------------- 1252 | 2021-01-29 13:00:18.806929+00 | 2021-01-29 13:00:26.122366+00 | 156024832 | 75317248 | 1 | public.index_projects_on_last_repository_check_failed | 80945152 1251 | 2021-01-29 12:57:15.52368+00 | 2021-01-29 13:00:18.666019+00 | 806748160 | 800751616 | 1 | public.index_ci_builds_on_auto_canceled_by_id | 326344704 1250 | 2021-01-29 12:20:39.034396+00 | 2021-01-29 12:20:49.181661+00 | 163807232 | 161005568 | 1 | public.index_label_links_on_target_id_and_target_type | 9543680 1249 | 2021-01-29 12:14:32.051548+00 | 2021-01-29 12:20:38.900695+00 | 2526797824 | 2544500736 | 1 | public.index_notes_on_discussion_id | 190005248 1248 | 2021-01-29 11:18:23.072552+00 | 2021-01-29 11:18:38.730528+00 | 123117568 | 112795648 | 1 | public.index_merge_requests_on_created_at | 10723328 1247 | 2021-01-29 11:14:24.531267+00 | 2021-01-29 11:18:22.915425+00 | 2256084992 | 2244321280 | 1 | public.index_ci_builds_on_commit_id_and_type_and_name_and_ref | 185147392 1246 | 2021-01-29 10:14:56.482897+00 | 2021-01-29 10:15:25.278484+00 | 180912128 | 180207616 | 1 | public.index_issues_on_project_id_and_closed_at | 50782208 1245 | 2021-01-29 10:14:13.057911+00 | 2021-01-29 10:14:56.421322+00 | 537165824 | 536363008 | 1 | public.index_ci_pipelines_on_project_id_and_ref_and_status_and_id | 66338816 1244 | 2021-01-29 09:14:31.813057+00 | 2021-01-29 09:14:37.906965+00 | 76464128 | 75309056 | 1 | public.index_project_pages_metadata_on_artifacts_archive_id | 31367168 1243 | 2021-01-29 09:14:25.056542+00 | 2021-01-29 09:14:31.749049+00 | 76832768 | 75309056 | 1 | public.index_project_pages_metadata_on_pages_deployment_id | 31506432(10 rows)
The state
column translates as follows:
- 0 -
started
- 1 -
finished
- 2 -
failed
How to stop
Section titled “How to stop”In order to disable the cronjob, we can disable the database_reindexing
feature flag (set to false). Further invocations of the cronjob will become a no-op until it is enabled again.
However, this does not affect an ongoing reindexing operation. If this operation needs to be stopped in an emergency situation, the following steps can be considered:
Cancel the postgres backend performing the index rebuild
Section titled “Cancel the postgres backend performing the index rebuild”- Log into the postgres primary using psql
- Check current activity:
select pid, query from pg_stat_activity where query ~* 'create index';
- In order to cancel the index build, signal the PG backend:
select pg_cancel_backend(pid) from pg_stat_activity where query ~* 'create index' and pid <> pg_backend_pid()
- As a last resort, terminate the PG backend (this translates to
SIGKILL
and should be used only as a last resort and upon review):select pg_terminate_backend(pid) from pg_stat_activity where query ~* 'create index' and pid <> pg_backend_pid()
Cancel the process on the deploy node
Section titled “Cancel the process on the deploy node”Log into the deploy host and terminate the reindexing process: pgrep -f gitlab:db:reindex| xargs kill
Documentation
Section titled “Documentation”Metrics
Section titled “Metrics”- Long-term tamland forecast (internal link)