CI Deleted Objects Processing Triage

SLI Alert: ci_deleted_objects_processing

This page contains instructions for how to resolve alerts related to deleting job artifacts(Ci::DeleteObjectsWorker). The intended audience are product engineers and support engineers looking to resolve issues with degraded artifact deletion on gitlab.com.

Overview

Sidekiq Service Level Indicator Dashboard Scroll to row with ci_deleted_objects_processing SLI to see current apdex and error ratios.
Alerts: https://alerts.gitlab.net/#/alerts?silenced=false&inhibited=false&muted=false&active=true&filter=%7Btype%3D%22sidekiq%22%2C%20component%3D%22ci_deleted_objects_processing%22%7D
Label: gitlab-com/gl-infra/production~“Service::Sidekiq”

Logging

Ci::DeleteObjectsService

Artifacts can be deleted by 2 mechanisms:

Default Expiry: 30 days
artifacts:expire_in can be set in the pipeline configuration(.gitlab-ci.yml)

Note: By default (unless explicitly disabled), artifacts are always kept for the most recent successful pipeline on each ref. Any expire_in configuration does not apply to the most recent artifacts. More information: job artifact documentation

Once artifacts are “expired” a record is created on Ci::DeletedObject table. Ci::ScheduleDeleteObjectsCronWorker runs every 16 minutes queueing artifacts in batches for destruction through Ci::DeleteObjectsService.

Why was this alert implemented?

There have been incidents in the past where Ci::DeleteObjectsWorker could not keep up with the amount of data that needs to be removed. There have also been occurances where the workers silently failed due to a bug introduced(!165778), or were under provisioned for a few days before recovering.

While this issue does not directly cause a user facing issue, keeping expired artifacts increases storage costs and this may go unnoticed for quite some time without an alert.

Monitoring/Alerting

Alerting has been configured through Runbooks to alert into Slack channel #g_pipeline-execution_alerts.

If an ci_deleted_objects_processing alert is triggered it should be triaged and investigated as soon as possible by a member of the Verify::Pipeline Execution team.

For any questions please reach out to the team in Slack via #g_pipeline-execution or s_verify or use GitLab’s group handle @gitlab-com/pipeline-execution-group.

Apdex Violation

The apdex for this alert considers the time elapsed between creation of a Ci::DeletedObject and it’s deletion. The threshold is set to 12 hours. If at the time of deletion the artifact had been expired (meaning a record on ci_deleted_objets was created) more than 12 hours ago the apdex threshold is considered breached and the issue should be investigated. This alert will trigger when the apdex is degraded.

ApdexSLOViolation

Traffic Cessation Violation

This alert is triggered when there is no traffic to Ci::DeleteObjectsWorker for 30 min. Considering the volume of artifacts created and expired each day and the cron job(ScheduleDeleteObjectsCronWorker) running every 16 minutes, there is no reason why this worker have no traffic for 30 minutes under usual operating conditions.

Traffic Absent

Error Violation

An error is recorded when the artifact deletion fails. This is different than the apdex violation where the artifact is successfully deleted but not within an acceptable time frame. This SLO is measured but does not currently trigger an alert. It is necessary to monitor for traffic absence and can also be used for invistigation purposes.

ErrorSLOViolation

Triage & Trouble Shooting

graph TD
    G[ALERT::APDEX_SLI_Violation: 12-hour deletion delay ] --> C(Is the worker running successfully or throwing errors?)
    C --> |errors| H(Error debugging)
    C --> |success| D(Influx of expired JobArtifact records or drop in throughput?)
    D --> |Influx| E(Wait for it to catch up)
    D --> |Drop in throughput| F(Performance debugging)
    A[ALERT::Traffic_Absent_Violation: 0 Records deleted] --> B(Are there still records in the queue?)
    B -->|yes| C
    B -->|no| I(Go check on the ExpireBuildArtifactsWorker, why isn't it producing DeletedObjects?)