CI Deleted Objects Processing Triage
SLI Alert: ci_deleted_objects_processing
This page contains instructions for how to resolve alerts related to deleting job artifacts(Ci::DeleteObjectsWorker
). The intended audience are product engineers and support engineers looking to resolve issues with degraded artifact deletion on gitlab.com.
Overview
Section titled “Overview”- Sidekiq Service Level Indicator Dashboard Scroll to row with
ci_deleted_objects_processing SLI
to see current apdex and error ratios. - Alerts: https://alerts.gitlab.net/#/alerts?silenced=false&inhibited=false&muted=false&active=true&filter=%7Btype%3D%22sidekiq%22%2C%20component%3D%22ci_deleted_objects_processing%22%7D
- Label: gitlab-com/gl-infra/production~“Service::Sidekiq”
Logging
Section titled “Logging”Ci::DeleteObjectsService
Section titled “Ci::DeleteObjectsService”Artifacts can be deleted by 2 mechanisms:
- Default Expiry: 30 days
artifacts:expire_in
can be set in the pipeline configuration(.gitlab-ci.yml
)
Note: By default (unless explicitly disabled), artifacts are always kept for the most recent successful pipeline on each ref. Any expire_in
configuration does not apply to the most recent artifacts. More information: job artifact documentation
Once artifacts are “expired” a record is created on Ci::DeletedObject
table. Ci::ScheduleDeleteObjectsCronWorker
runs every 16 minutes queueing artifacts in batches for destruction through Ci::DeleteObjectsService
.
Why was this alert implemented?
Section titled “Why was this alert implemented?”There have been incidents in the past where Ci::DeleteObjectsWorker
could not keep up with the amount of data that needs to be removed. There have also been occurances where the workers silently failed due to a bug introduced(!165778), or were under provisioned for a few days before recovering.
While this issue does not directly cause a user facing issue, keeping expired artifacts increases storage costs and this may go unnoticed for quite some time without an alert.
Monitoring/Alerting
Section titled “Monitoring/Alerting”Alerting has been configured through Runbooks to alert into Slack channel #g_pipeline-execution_alerts
.
If an ci_deleted_objects_processing
alert is triggered it should be triaged and investigated as soon as possible by a member of the Verify::Pipeline Execution
team.
For any questions please reach out to the team in Slack via #g_pipeline-execution
or s_verify
or use GitLab’s group handle @gitlab-com/pipeline-execution-group
.
Apdex Violation
Section titled “Apdex Violation”The apdex for this alert considers the time elapsed between creation of a Ci::DeletedObject
and it’s deletion. The threshold is set to 12 hours. If at the time of deletion the artifact had been expired (meaning a record on ci_deleted_objets
was created) more than 12 hours ago the apdex threshold is considered breached and the issue should be investigated. This alert will trigger when the apdex is degraded.
Traffic Cessation Violation
Section titled “Traffic Cessation Violation”This alert is triggered when there is no traffic to Ci::DeleteObjectsWorker
for 30 min. Considering the volume of artifacts created and expired each day and the cron job(ScheduleDeleteObjectsCronWorker) running every 16 minutes, there is no reason why this worker have no traffic for 30 minutes under usual operating conditions.
Error Violation
Section titled “Error Violation”An error is recorded when the artifact deletion fails. This is different than the apdex violation where the artifact is successfully deleted but not within an acceptable time frame. This SLO is measured but does not currently trigger an alert. It is necessary to monitor for traffic absence and can also be used for invistigation purposes.
Triage & Trouble Shooting
Section titled “Triage & Trouble Shooting”graph TD G[ALERT::APDEX_SLI_Violation: 12-hour deletion delay ] --> C(Is the worker running successfully or throwing errors?) C --> |errors| H(Error debugging) C --> |success| D(Influx of expired JobArtifact records or drop in throughput?) D --> |Influx| E(Wait for it to catch up) D --> |Drop in throughput| F(Performance debugging) A[ALERT::Traffic_Absent_Violation: 0 Records deleted] --> B(Are there still records in the queue?) B -->|yes| C B -->|no| I(Go check on the ExpireBuildArtifactsWorker, why isn't it producing DeletedObjects?)