GET Monitoring Setup
This documentation outlines setting up the staging-ref environment to work with GitLab infrastructure monitoring.
Prerequisites
Section titled “Prerequisites”- A private cluster is prefered for setting up alertmanager.
- Staging-ref is not VPC peered environment therefore we had to add workarounds such as adding an ingress for each alertmanager and configuring Cloud Armor.
Disable built-in GET monitoring
Section titled “Disable built-in GET monitoring”GET sets up Prometheus and Grafana in a VM and the default GitLab Helm chart defaults which enable Prometheus and Grafana. They will not be used and can be disabled. You can view examples of how to do this via the following MRs:
- Disable Grafana and prometheus managed by GET and remove GET monitoring VMs in the
gitlab_charts.yml.j2
custom helm config used by GET. This can be done by adding the following to the GitLab helm values:
global: # Disable Grafana grafana: enabled: false...# Disable built-in Prometheusprometheus: install: false
Enable labels
Section titled “Enable labels”Labels help organize metrics by service. Labels can be added via the GitLab helm chart.
- Labels need to be added to the GitLab helm values:
global: common: labels: stage: main shard: default tier: sv
- Deployment labels need to be added. For an up-to date list check out
gitlab_charts.yml.j2
in thestaging-ref
repository.
Prometheus
Section titled “Prometheus”Prometheus an open-source monitoring and alerting tool used to monitor all services within GitLab infrastructure. You can read more about technical details the project here.
Deploy prometheus-stack
Section titled “Deploy prometheus-stack”Prometheus-stack is a helm chart that bundles cluster monitoring with prometheus using the prometheus operator. We’ll be using this chart to deploy prometheus.
- Deploy to the GET cluster under the
prometheus
namespace via helm. In staging-ref, this is managed by CI jobs that validate and configure any changes to the helm chart. You can view the setup of this chart in this directory.
Scraping targets
Section titled “Scraping targets”Scrape targets are configured in the values.yaml
file under the prometheus-stack
directory. Scrape targets are applied relabeling to match what is used in staging and production.
-
Kubernetes targets. Prometheus scrape targets can be found in
additionalPodMonitors
andadditionalServiceMonitors
invalues.yaml
. -
Omnibus targets. Prometheus scrape targets can be found under
additionalScrapeConfigs
invalues.yaml
.
Exporters
Section titled “Exporters”Exporters are “exporting” existing metrics from their applications or services. These are used by prometheus to scrape metrics. A few of them are disabled by default and we’ll need to enable them in order to use them. Exporters that need to be enabled manually within the GitLab helm values are:
- gitlab-shell (merge request example)
- http-workhorse-exporter (merge request example)
Thanos
Section titled “Thanos”Thanos is a set of components that aids in maintaining a highly available Prometheus setup with long term storage capabilities. Here we will not be deploying thanos, but rather connecting prometheus to our already existing thanos cluster.
Thanos sidecar
Section titled “Thanos sidecar”The sidecar component of Thanos gets deployed along with the Prometheus instance. This configuration exists in the prometheus-stack
helm chart. Thanos-sidecar will backup Prometheus data into an Object Storage bucket, and give other Thanos components access to the Prometheus metrics via a gRPC API.
- To be added in terraform:
- Create an external IP in terraform to use as a
loadBalancerIP
- A GCS bucket for prometheus data needs to be created in terraform
- Service account used by thanos-store in Kubernetes for access to its GCS bucket needs to be created in terraform
- Configure a workload identity to be used by Thanos in terraform
- To be configured in the helm chart:
- Enable service to expose thanos-sidecar
- Add a secret with object storage credentials
- Configure object storage
- You can view an example MR of these configurations here.
- Lastly, you need to add prometheus to thanos-store in order be able to query historical data from thanos-query. You can view an example MR on how to do that here.
Alerts and Alertmanager
Section titled “Alerts and Alertmanager”Alerting rules are configured in Prometheus and then it sends alerts to an Alertmanager. The Alertmanager then manages those alerts and sends notifications, such as to a slack channel. We will not be using the bundled Alertmanager in prometheus-stack
. Instead we’ve configured the use of existing alertmanager cluster.
Note: If using a public cluster you will need to configure IP Masquerade Agent in your cluster. Example configuration.
- Configure Alertmanager
-
Add the cluster IP to the allowed IP ranges used by our CloudArmor security policy.
src_ip_ranges = ["x.x.x.x/32", # GKE cluster NAT IP] -
Configure
additionalAlertManagerConfigs
(example merge request).
- Configure Dead Man’s Snitch for Alertmanager. Alertmanager should send notifications for the dead man’s switch to the configured notification provider. This ensures that communication between the Alertmanager and the notification provider is working. (example merge request)
- Configure routing to Slack channels (example merge request).
Prometheus rules
Section titled “Prometheus rules”- TBA.
Dashboards
Section titled “Dashboards”Dashboards for staging-ref can be found in Grafana under the staging-ref folder. If additional dashboards need to be added they can be added through the runbooks or they can be added manually.
If added manually the dashboard uid
needs to be added to the protected dashboards list to prevent automated deletion that happens every 24 hours.