Grafana Mimir Service

Service Overview
Alerts: https://alerts.gitlab.net/#/alerts?filter=%7Btype%3D%22mimir%22%2C%20tier%3D%22inf%22%7D
Label: gitlab-com/gl-infra/production~“Service::Mimir”

Logging

Elasticsearch

Quick Links

Reference	Link
Helm Deployment	helmfiles
Tenant Configuration	config-mgmt
Runbooks	Grafana Runbooks
Dashboards	Mimir Overview
Logs	Elastic Cloud

Troubleshooting

If you received a page for Mimir, the first thing is to determine if the problem is on the write path, read path, or with recording rule evaluation.

As well as checking if the problem is isolated to a single tenant, or effecting all tenants.

We have some useful dashboards to reference for a quick view of system health:

There are other useful operational dashboards you can navigate to from the top right, under “Mimir dashboards”.

When checking tenants, the key metrics/questions here are:

Is the tenant exceeding a quota?
- To increase quotas, see the getting-started docs.
Is the “Newest seen sample age” recent.
- If there is no recent samples coming in, this could indicate the remote-write client may be experiencing issues and not sending any data.
Are any series being dropped under “Distributor and ingester discarded samples rate”.
- Dropped samples would usually be the effect of a quota being exceeded so refer to the quota point above.

It’s also worth checking the observability alerts channel on slack #g_infra_observability_alerts, as there is some much more targeted alerting that will have direct links to appropriate runbooks.

Runbooks

We use a slightly refactored version of the Grafana Monitoring Mixin for much of the operational monitoring.

As such the Grafana Runbooks apply to our alerts as well, and are the best source of information for troubleshooting:

Onboarding

See the getting-started readme

Cardinality Management

Metrics cardinality is the silent performance killer in Prometheus.

Start with the cardinality-management readme to help identify problem metrics.

Architecture

Architecture Reference.

We deploy in the microservices mode via helmfiles.

There are additional GCP components deployed via the helm chart using config-connector.

This includes storage buckets and IAM policies. These componets are deployed to the gitlab-observability GCP project, as this keeps the config connector permissions scoped and blast radius limited to the observability services.

mimir-architecture

Capacity Planning

There is some good capacity planning docs from Grafana here.

These include some guidelines around sizing for various components in Mimir.

Keep in mind that at GitLab we have some incredibly high cardinality metrics, and while these numbers serve as good guidelines we often require more resources than recommended.

Scaling Mimir

Scaling up

All components in Mimir are horizontally scalable.

We have autoscaling in place for the following components:

Distributor
Querier
Query-Frontend

All components can be scaled up without concern.

The main consideration with scaling up is that with shuffle sharding enabled, new pods might not pick up workloads depending on shard assignments.

There is a runbook for the various component explaining the cause and fix in more detail.

Scaling down

Scaling down for stateless components can be done without issue, with only the usual concerns for saturation and ensuring enough resource is left available.

There are several stateful components in Mimir that require special consideration when scaling down.

Alertmanagers
Ingesters
Store-Gateways

Scaling down these needs to be done following a process as they contain recent data used for querying, unexpected removal of this data can cause missing datapoints.

More details on scaling down these components can be read here