ContainerRegistryNotifications
Overview
Section titled “Overview”- What does this alert mean?
- The number of pending outgoing notifications is too high or grows too quickly.
- This can happen when notifications fail to be sent, typically seen the error rate for notifications sending is high as well.
- What factors can contribute?
- Increased load on the registry pods.
- significant number of notifications generated by registry due to some traffic spike
- Rails monolith does not ingest notifications or there are network issues
- What action is the recipient of this alert expected to take when it fires?
Services
Section titled “Services”- Service Overview
- Team that owns the service: Container Registry
Metrics
Section titled “Metrics”- Metric:
registry_notifications_pending
.
Briefly explain the metric this alert is based on and link to the metrics catalogue. What unit is it measured in? (e.g., CPU usage in percentage, request latency in milliseconds)
- Dashboard URL focusing on the
Events queued per second
panel. - The queue will grow while there are errors/failures.
Explain the reasoning behind the chosen threshold value for triggering the alert. Is it based on historical data, best practices, or capacity planning?
- If the gauage metric keeps increasing, it means we are not dispatching any events. Having a low threshold should signal issues early on, before we see failures like lack of resources.
- Each pod has maximum number of events queued set to 1000. Above that threshold events are dropped. The limit is per-pod so global limit of 1000 should be a safe bet even if the number of unsend notifications is high only on a single pod while at the same time it is enough to signal that there are issues as the queue size normally hovers around ~5 events for the whole cluster.
Describe the expected behavior of the metric under normal conditions. This helps identify situations where the alert might be falsely firing.
- This metric should go up and down as pending events are queued and dispatched.
- Some peaks are expected during traffic peak times.
- The
Pending events
panels should have a relatively low 2 digit number.
Alert Behavior
Section titled “Alert Behavior”Expected frequency of the alert. Is it a high-volume alert or expected to be rare?
- Should be rare.
Show historical trends of the alert firing e.g Kibana dashboard
- N/A (new alert)
Severities
Section titled “Severities”Guidance for assigning incident severity to this alert
s3
Who is likely to be impacted by this cause of this alert?
- Customers pushing/pulling images to the container registry.
Things to check to determine severity
- Service overview
- Escalate if service is degraded for a prolonged period of time.
Verification
Section titled “Verification”- Metric explorer
- Registry logs
registry-main/registry-overview
registry-notifications/webhook-notifications-detail
api-main/api-overview
cloudflare-main/cloudflare-overview
- Rails API logs.
Recent changes
Section titled “Recent changes”Recent changes
How to properly roll back changes
- Check the changelog in the MR that updated the registry.
- Review MRs included in the related release issue
- If any MR has the label ~cannot-rollback applied, a detailed description should exist in that MR.
- Otherwise, proceed to revert the commit and watch the deployment.
- Review the dashboards and expect the metric to go back to normal.
Troubleshooting
Section titled “Troubleshooting”- Registry troubleshooting
Possible Resolutions
Section titled “Possible Resolutions”Dependencies
Section titled “Dependencies”- Rails API
- Cloudflare/firewall rules