component_saturation_slo_out_of_bounds:kube_persistent_volume_claim_disk_space
Overview
Section titled “Overview”- This alert means that a Kube persistent volume is running out of disk space.
- This could be natural growth of the data stored within the volume.
- This could also be an abnormality where an unexpectedly large amount of data is being written for some reason.
- This affects the pod(s) that have the full volume. It could cause downtime of thoes pods if the drive fills up.
- The recipient of the alert needs to investigate which volume is filling up, and remediate the issue either via growing the disk or determining why an anomolous amount of data is being written and cleaning the volume.
Services
Section titled “Services”Metrics
Section titled “Metrics”- This alert is based on
kubelet_volume_stats_used_bytes / kubelet_volume_stats_capacity_bytes
. - The soft SLO is at 85% full and the hard SLO is at 90% full.
- Example Grafana Query
Alert Behavior
Section titled “Alert Behavior”- This alert should be fairly rare.
- This alert can be silenced if there is a plan in place to resolve the issue. Generally, the alert should be resolved instead of silenced.
Severities
Section titled “Severities”- Incidents involving this alert are likely S3 or S4 as the service is likely still up. If a PVC fills up, it could impact the service, but this alert should fire before it is full.
- This is not a user impacting alert.
Verification
Section titled “Verification”Recent changes
Section titled “Recent changes”Troubleshooting
Section titled “Troubleshooting”- Check the dashboard linked in the alert to determine which PVC is full.
- Once the PVC is identified, check the associated pod logs to see if there is any clear reason the drive is filling up.
Possible Resolutions
Section titled “Possible Resolutions”- Increase PVC size
- Previous Incident resolution: Zoekt persistent volume claim saturation
- Previous incident involving prometheus-agent
Dependencies
Section titled “Dependencies”-
No other dependencies can cause this alert.
-
Slack Channel: #g_foundations
-
It is unlikely we should ever tune this alert much as the thresholds are reasonable percentages.