GET Monitoring Setup
This documentation outlines setting up the staging-ref environment to work with GitLab infrastructure monitoring.
Prerequisites
Section titled “Prerequisites”- A private cluster is prefered for setting up alertmanager.
- Staging-ref is not VPC peered environment therefore we had to add workarounds such as adding an ingress for each alertmanager and configuring Cloud Armor.
Disable built-in GET monitoring
Section titled “Disable built-in GET monitoring”GET sets up Prometheus and Grafana in a VM and the default GitLab Helm chart defaults which enable Prometheus and Grafana. They will not be used and can be disabled. You can view examples of how to do this via the following MRs:
- Disable Grafana and prometheus managed by GET and remove GET monitoring VMs in the
gitlab_charts.yml.j2custom helm config used by GET. This can be done by adding the following to the GitLab helm values:
global: # Disable Grafana grafana: enabled: false...# Disable built-in Prometheusprometheus: install: falseEnable labels
Section titled “Enable labels”Labels help organize metrics by service. Labels can be added via the GitLab helm chart.
- Labels need to be added to the GitLab helm values:
global: common: labels: stage: main shard: default tier: sv- Deployment labels need to be added. For an up-to date list check out
gitlab_charts.yml.j2in thestaging-refrepository.
Prometheus
Section titled “Prometheus”Prometheus an open-source monitoring and alerting tool used to monitor all services within GitLab infrastructure. You can read more about technical details the project here.
Deploy prometheus-stack
Section titled “Deploy prometheus-stack”Prometheus-stack is a helm chart that bundles cluster monitoring with prometheus using the prometheus operator. We’ll be using this chart to deploy prometheus.
- Deploy to the GET cluster under the
prometheusnamespace via helm. In staging-ref, this is managed by CI jobs that validate and configure any changes to the helm chart. You can view the setup of this chart in this directory.
Scraping targets
Section titled “Scraping targets”Scrape targets are configured in the values.yaml file under the prometheus-stack directory. Scrape targets are applied relabeling to match what is used in staging and production.
-
Kubernetes targets. Prometheus scrape targets can be found in
additionalPodMonitorsandadditionalServiceMonitorsinvalues.yaml. -
Omnibus targets. Prometheus scrape targets can be found under
additionalScrapeConfigsinvalues.yaml.
Exporters
Section titled “Exporters”Exporters are “exporting” existing metrics from their applications or services. These are used by prometheus to scrape metrics. A few of them are disabled by default and we’ll need to enable them in order to use them. Exporters that need to be enabled manually within the GitLab helm values are:
- gitlab-shell (merge request example)
- http-workhorse-exporter (merge request example)
Alerts and Alertmanager
Section titled “Alerts and Alertmanager”Alerting rules are configured in Prometheus and then it sends alerts to an Alertmanager. The Alertmanager then manages those alerts and sends notifications, such as to a slack channel. We will not be using the bundled Alertmanager in prometheus-stack. Instead we’ve configured the use of existing alertmanager cluster.
Note: If using a public cluster you will need to configure IP Masquerade Agent in your cluster. Example configuration.
- Configure Alertmanager
-
Add the cluster IP to the allowed IP ranges used by our CloudArmor security policy.
src_ip_ranges = ["x.x.x.x/32", # GKE cluster NAT IP] -
Configure
additionalAlertManagerConfigs(example merge request).
- Configure Dead Man’s Snitch for Alertmanager. Alertmanager should send notifications for the dead man’s switch to the configured notification provider. This ensures that communication between the Alertmanager and the notification provider is working. (example merge request)
- Configure routing to Slack channels (example merge request).
Prometheus rules
Section titled “Prometheus rules”- TBA.
Dashboards
Section titled “Dashboards”Dashboards for staging-ref can be found in Grafana under the staging-ref folder. If additional dashboards need to be added they can be added through the runbooks or they can be added manually.
If added manually the dashboard uid needs to be added to the protected dashboards list to prevent automated deletion that happens every 24 hours.