Container Registry Database Index Bloat

This document is a condensed summary of the troubleshooting and corrective steps taken while investigating gitlab-com/gl-infra/capacity-planning#39 - the first potential index bloat saturation for the Container Registry database.

Metrics

The Prometheus base query used for database index bloat forecasts is gitlab_component_saturation:ratio{type="patroni-registry", component="pg_btree_bloat"}. We can easily visualize the overall bloat trend in Thanos using this query.

The query above can be used to identify the overall bloat. To see the estimated bloat for individual indexes, we can use the gitlab_database_bloat_btree_bloat_size metric.

The above metrics are fed by the bloat estimation queries from github.com/ioguix/pgsql-bloat-estimation. Therefore, and although it can be useful as a complement, when dealing with index bloat alerts, it is preferred to rely on the metrics above rather than other estimation mechanisms, such as pgstatindex.

Identifying Top Bloated Indexes

We can see the top 100 most bloated indexes using this query.

Fixing Index Bloat

The easiest and safest way to fix index bloat is by concurrently reindexing top bloated indexes.

gitlab-com/gl-infra/capacity-planning#39 was the first time this need arose. If it occurs frequently enough in the future, we may want to pursue periodic/automatic reindexing on the application side. Until then, we should:

Identify top 100 most bloated indexes, as described in the previous section;
Raise a Production Change Request to concurrently reindex these indexes. Follow gitlab-com/gl-infra/production#8175 as example.

If reindexing the top 100 is not enough, then we can move further and target the next 100. The majority of the registry tables have 64 partitions, so there are hundreds of indexes we can target.