Skip to content

GET Monitoring Setup

This documentation outlines setting up the staging-ref environment to work with GitLab infrastructure monitoring.

  • A private cluster is prefered for setting up alertmanager.

GET sets up Prometheus and Grafana in a VM and the default GitLab Helm chart defaults which enable Prometheus and Grafana. They will not be used and can be disabled. You can view examples of how to do this via the following MRs:

global:
# Disable Grafana
grafana:
enabled: false
...
# Disable built-in Prometheus
prometheus:
install: false

Labels help organize metrics by service. Labels can be added via the GitLab helm chart.

  • Labels need to be added to the GitLab helm values:
global:
common:
labels:
stage: main
shard: default
tier: sv
  • Deployment labels need to be added. For an up-to date list check out gitlab_charts.yml.j2 in the staging-ref repository.

Prometheus an open-source monitoring and alerting tool used to monitor all services within GitLab infrastructure. You can read more about technical details the project here.

Prometheus-stack is a helm chart that bundles cluster monitoring with prometheus using the prometheus operator. We’ll be using this chart to deploy prometheus.

  • Deploy to the GET cluster under the prometheus namespace via helm. In staging-ref, this is managed by CI jobs that validate and configure any changes to the helm chart. You can view the setup of this chart in this directory.

Scrape targets are configured in the values.yaml file under the prometheus-stack directory. Scrape targets are applied relabeling to match what is used in staging and production.

  1. Kubernetes targets. Prometheus scrape targets can be found in additionalPodMonitors and additionalServiceMonitors in values.yaml.

  2. Omnibus targets. Prometheus scrape targets can be found under additionalScrapeConfigs in values.yaml.

Exporters are “exporting” existing metrics from their applications or services. These are used by prometheus to scrape metrics. A few of them are disabled by default and we’ll need to enable them in order to use them. Exporters that need to be enabled manually within the GitLab helm values are:

Thanos is a set of components that aids in maintaining a highly available Prometheus setup with long term storage capabilities. Here we will not be deploying thanos, but rather connecting prometheus to our already existing thanos cluster.

The sidecar component of Thanos gets deployed along with the Prometheus instance. This configuration exists in the prometheus-stack helm chart. Thanos-sidecar will backup Prometheus data into an Object Storage bucket, and give other Thanos components access to the Prometheus metrics via a gRPC API.

  1. To be added in terraform:
  • Create an external IP in terraform to use as a loadBalancerIP
  • A GCS bucket for prometheus data needs to be created in terraform
  • Service account used by thanos-store in Kubernetes for access to its GCS bucket needs to be created in terraform
  • Configure a workload identity to be used by Thanos in terraform
  1. To be configured in the helm chart:
  1. Lastly, you need to add prometheus to thanos-store in order be able to query historical data from thanos-query. You can view an example MR on how to do that here.

Alerting rules are configured in Prometheus and then it sends alerts to an Alertmanager. The Alertmanager then manages those alerts and sends notifications, such as to a slack channel. We will not be using the bundled Alertmanager in prometheus-stack. Instead we’ve configured the use of existing alertmanager cluster.

Note: If using a public cluster you will need to configure IP Masquerade Agent in your cluster. Example configuration.

  1. Configure Alertmanager
  1. Configure Dead Man’s Snitch for Alertmanager. Alertmanager should send notifications for the dead man’s switch to the configured notification provider. This ensures that communication between the Alertmanager and the notification provider is working. (example merge request)
  2. Configure routing to Slack channels (example merge request).
  • TBA.

Dashboards for staging-ref can be found in Grafana under the staging-ref folder. If additional dashboards need to be added they can be added through the runbooks or they can be added manually.

If added manually the dashboard uid needs to be added to the protected dashboards list to prevent automated deletion that happens every 24 hours.