GET Monitoring Setup

This documentation outlines setting up the staging-ref environment to work with GitLab infrastructure monitoring.

Prerequisites

A private cluster is prefered for setting up alertmanager.

Notes

Staging-ref is not VPC peered environment therefore we had to add workarounds such as adding an ingress for each alertmanager and configuring Cloud Armor.

Disable built-in GET monitoring

GET sets up Prometheus and Grafana in a VM and the default GitLab Helm chart defaults which enable Prometheus and Grafana. They will not be used and can be disabled. You can view examples of how to do this via the following MRs:

Disable Grafana and prometheus managed by GET and remove GET monitoring VMs in the gitlab_charts.yml.j2 custom helm config used by GET. This can be done by adding the following to the GitLab helm values:

global:
  # Disable Grafana
  grafana:
    enabled: false
...
# Disable built-in Prometheus
prometheus:
  install: false

Enable labels

Labels help organize metrics by service. Labels can be added via the GitLab helm chart.

Labels need to be added to the GitLab helm values:

global:
  common:
    labels:
      stage: main
      shard: default
      tier: sv

Deployment labels need to be added. For an up-to date list check out gitlab_charts.yml.j2 in the staging-ref repository.

Prometheus

Prometheus an open-source monitoring and alerting tool used to monitor all services within GitLab infrastructure. You can read more about technical details the project here.

Deploy `prometheus-stack`

Prometheus-stack is a helm chart that bundles cluster monitoring with prometheus using the prometheus operator. We’ll be using this chart to deploy prometheus.

Deploy to the GET cluster under the prometheus namespace via helm. In staging-ref, this is managed by CI jobs that validate and configure any changes to the helm chart. You can view the setup of this chart in this directory.

Scraping targets

Scrape targets are configured in the values.yaml file under the prometheus-stack directory. Scrape targets are applied relabeling to match what is used in staging and production.

Kubernetes targets. Prometheus scrape targets can be found in additionalPodMonitors and additionalServiceMonitors in values.yaml.
Omnibus targets. Prometheus scrape targets can be found under additionalScrapeConfigs in values.yaml.

Exporters

Exporters are “exporting” existing metrics from their applications or services. These are used by prometheus to scrape metrics. A few of them are disabled by default and we’ll need to enable them in order to use them. Exporters that need to be enabled manually within the GitLab helm values are:

gitlab-shell (merge request example)
http-workhorse-exporter (merge request example)

Thanos

Thanos is a set of components that aids in maintaining a highly available Prometheus setup with long term storage capabilities. Here we will not be deploying thanos, but rather connecting prometheus to our already existing thanos cluster.

Thanos sidecar

The sidecar component of Thanos gets deployed along with the Prometheus instance. This configuration exists in the prometheus-stack helm chart. Thanos-sidecar will backup Prometheus data into an Object Storage bucket, and give other Thanos components access to the Prometheus metrics via a gRPC API.

To be added in terraform:

Create an external IP in terraform to use as a loadBalancerIP
A GCS bucket for prometheus data needs to be created in terraform
Service account used by thanos-store in Kubernetes for access to its GCS bucket needs to be created in terraform
Configure a workload identity to be used by Thanos in terraform

To be configured in the helm chart:

Enable service to expose thanos-sidecar
Add a secret with object storage credentials
Configure object storage
You can view an example MR of these configurations here.

Lastly, you need to add prometheus to thanos-store in order be able to query historical data from thanos-query. You can view an example MR on how to do that here.

Alerts and Alertmanager

Alerting rules are configured in Prometheus and then it sends alerts to an Alertmanager. The Alertmanager then manages those alerts and sends notifications, such as to a slack channel. We will not be using the bundled Alertmanager in prometheus-stack. Instead we’ve configured the use of existing alertmanager cluster.

Note: If using a public cluster you will need to configure IP Masquerade Agent in your cluster. Example configuration.

Configure Alertmanager

Add the cluster IP to the allowed IP ranges used by our CloudArmor security policy.

        src_ip_ranges = [
        "x.x.x.x/32", # GKE cluster NAT IP
      ]

Configure additionalAlertManagerConfigs (example merge request).

Configure Dead Man’s Snitch for Alertmanager. Alertmanager should send notifications for the dead man’s switch to the configured notification provider. This ensures that communication between the Alertmanager and the notification provider is working. (example merge request)
Configure routing to Slack channels (example merge request).

Prometheus rules

TBA.

Dashboards

Dashboards for staging-ref can be found in Grafana under the staging-ref folder. If additional dashboards need to be added they can be added through the runbooks or they can be added manually.

If added manually the dashboard uid needs to be added to the protected dashboards list to prevent automated deletion that happens every 24 hours.