ComponentResourceRunningOut_disk_space
Overview
Section titled “Overview”This alert means that the disk space utilization on a disk for a node is growing rapidly and will reach it’s capacity in the next 6 hours. The cause of the fast growth should be investigated.
Affected Service will be mentioned in the alert and the team owning the service can be determined in the Service Catalog by searching for the Service name.
Services
Section titled “Services”This alert does not have an assigned team and created from the template. So the alert can be firing for any GitLab component. To identify the team, identify the service for which the alert fired and search through the Service Catalog to get the details about the ownership.
Metrics
Section titled “Metrics”The alert expression is predicting whether the component saturation will exceed the defined hard SLO within the specified time frame. This means that this resource is growing rapidly and is predicted to exceed saturation threshold within the specified interval.
Alert Behavior
Section titled “Alert Behavior”This alert is rare and if triggered, should be investigated, as it may lead to a service running out of available disk space, which could trigger other incidents with higher Severity. It is not recommended to silence this alert.
Severities
Section titled “Severities”This alert is usually assigned a low Severity (S4 or S3), but may become a higher severity if the resource is not investigated and the disk usage reaches the capacity.
Review Incident Severity Handbook page to identify the required Severity Level
Verification
Section titled “Verification”Additional monitring dashboards can be found for the affected service in the “Saturation Details” section for the “disk_space component saturation” view - HAProxy Disk Space Utilization per Device per Node Example
Recent changes
Section titled “Recent changes”The alert is applicable to many services, and created from a template. To find out recent changes review the closed prodcution issues for a specific Service. To filter the issue to the affected service from the alert apply search filter with Label=Service::<service_name>
. Example for HAProxy service
Troubleshooting
Section titled “Troubleshooting”Basic approach for troubleshooting would involve finding the disk running out of space and identifying the cause of the disk usage.
- start with
sudo df -h -x squashfs
to find which disk is out of space - use
du
andlsof
tools to understnad what might be the reason for the disk usage - use
sudo apt-get clean
to purge package manager cache - consider removing old kernels with
sudo apt-get autoremove
- examine log files for rapid growth:
find /var/log -type f -size +100M
The goal of the troubleshooting is to understand the nature of the increased disk usage and identifying the need for adjustments, fixes or required capacity changes.
Possible Resolutions
Section titled “Possible Resolutions”Examples of the previous incidents:
- Low disk space on Gitaly storage
- Disk Space Utilization for ci-runners service
- All past incdents for ComponentResourceRunningOut_disk_space alert
Dependencies
Section titled “Dependencies”There are no external dependencies for this alert
After the ownership team has been identified for the affected component, search for the Slack channel of the team and look for the escalation there.
Alternative slack channels: