ContainerRegistryNotifications

Overview

What does this alert mean?
- The number of pending outgoing notifications is too high or grows too quickly.
- This can happen when notifications fail to be sent, typically seen the error rate for notifications sending is high as well.
What factors can contribute?
- Increased load on the registry pods.
- significant number of notifications generated by registry due to some traffic spike
- Rails monolith does not ingest notifications or there are network issues
What action is the recipient of this alert expected to take when it fires?
- Troubleshooting.

Services

Service Overview
Team that owns the service: Container Registry

Metrics

Metric: registry_notifications_pending.

Briefly explain the metric this alert is based on and link to the metrics catalogue. What unit is it measured in? (e.g., CPU usage in percentage, request latency in milliseconds)

Dashboard URL focusing on the Events queued per second panel.
The queue will grow while there are errors/failures.

Explain the reasoning behind the chosen threshold value for triggering the alert. Is it based on historical data, best practices, or capacity planning?

If the gauage metric keeps increasing, it means we are not dispatching any events. Having a low threshold should signal issues early on, before we see failures like lack of resources.
Each pod has maximum number of events queued set to 1000. Above that threshold events are dropped. The limit is per-pod so global limit of 1000 should be a safe bet even if the number of unsend notifications is high only on a single pod while at the same time it is enough to signal that there are issues as the queue size normally hovers around ~5 events for the whole cluster.

Describe the expected behavior of the metric under normal conditions. This helps identify situations where the alert might be falsely firing.

This metric should go up and down as pending events are queued and dispatched.
Some peaks are expected during traffic peak times.
The Pending events panels should have a relatively low 2 digit number.

Alert Behavior

Expected frequency of the alert. Is it a high-volume alert or expected to be rare?

Should be rare.

Show historical trends of the alert firing e.g Kibana dashboard

N/A (new alert)

Severities

Guidance for assigning incident severity to this alert

s3

Who is likely to be impacted by this cause of this alert?

Customers pushing/pulling images to the container registry.

Things to check to determine severity

Service overview
Escalate if service is degraded for a prolonged period of time.

Verification

Recent changes

Recent changes

Workloads MRs for “Service::Container Registry”

How to properly roll back changes

Check the changelog in the MR that updated the registry.
Review MRs included in the related release issue
If any MR has the label ~cannot-rollback applied, a detailed description should exist in that MR.
Otherwise, proceed to revert the commit and watch the deployment.
Review the dashboards and expect the metric to go back to normal.

ContainerRegistryNotifications

Overview

Services

Metrics

Alert Behavior

Severities

Verification

Recent changes

Troubleshooting

Possible Resolutions

Dependencies

Escalation

Definitions

ContainerRegistryNotifications

Overview

Services

Metrics

Alert Behavior

Severities

Verification

Recent changes

Troubleshooting

Possible Resolutions

Dependencies

Escalation

Definitions

Related Links