Upgrading Monitoring Components
Upgrading monitoring components requires changes in a few different places, but is standard from release-to-release.
Links to releases:
Links to various exporter releases:
- beat_exporter
- blackbox_exporter
- consul_exporter
- ebpf_exporter
- elasticsearch_exporter
- haproxy_exporter
- imap_mailbox_exporter
- influxdb_exporter
- mtail
- node_exporter
- pgbouncer_exporter
- postgres_exporter
- redis_exporter
- smokeping_prober
- stackdriver_exporter
- statsd_exporter
Monitoring
Section titled “Monitoring”Monitoring components meta-monitor each other, but some care is needed to ensure we don’t have gaps in observability.
General
Section titled “General”Most services expose a SERVICE_build_info
that can be used to monitor the progress of the rollout. For example, prometheus_build_info
.
Similarly, most services expose process_start_time_seconds
.
It’s also worth checking the standard up
metric.
Prometheus/Thanos
Section titled “Prometheus/Thanos”The monitoring-overview dashboard has a lot of details about Thanos and Prometheus metrics.
Pre-Change Steps
Section titled “Pre-Change Steps”Create an infrastructure issue if there isn’t one yet.
The issue should detail:
- The components being upgraded.
- Any breaking changes from the release notes.
- Any significant features/improvements being rolled out.
Prepare upgrade MRs
- Prometheus/Thanos/Pushgateway in Chef
- Prometheus in Helmfiles
- Grafana/Thanos in Tanka
- Various Exporters
- mtail in Chef
- mtail docker image for GKE
- mtail docker image in Hemmfiles
Don’t forget to bump cookbook versions when submitting cookbook changes.
Change Steps
Section titled “Change Steps”- Merge Chef MRs to the relevent cookbook.
- Wait for the cookbook publisher to post MRs to chef-repo
- Merge non-prod chef-repo MR and wait for Chef to deploy.
- Verify new versions are deployed.
- Merge prod chef-repo MR and wait for Chef to deploy.
- Verify new versions are deployed.
- Merge Helmfile/Tanka MRs.
- Verify new versions are deployed.
Post-Change Steps
Section titled “Post-Change Steps”- Verify services are operating and no alerts are firing.
- Verify the service metrics are healthy.
Rollback
Section titled “Rollback”- Prepare and submit rollback MRs for Chef/Helmfiles/Tanka
- Verify service returns to normal.