Skip to content

ContainerRegistryNotifications

  • What does this alert mean?
    • The number of pending outgoing notifications is too high or grows too quickly.
    • This can happen when notifications fail to be sent, typically seen the error rate for notifications sending is high as well.
  • What factors can contribute?
    • Increased load on the registry pods.
    • significant number of notifications generated by registry due to some traffic spike
    • Rails monolith does not ingest notifications or there are network issues
  • What action is the recipient of this alert expected to take when it fires?
  • Metric: registry_notifications_pending.

Briefly explain the metric this alert is based on and link to the metrics catalogue. What unit is it measured in? (e.g., CPU usage in percentage, request latency in milliseconds)

  • Dashboard URL focusing on the Events queued per second panel.
  • The queue will grow while there are errors/failures.

Explain the reasoning behind the chosen threshold value for triggering the alert. Is it based on historical data, best practices, or capacity planning?

  • If the gauage metric keeps increasing, it means we are not dispatching any events. Having a low threshold should signal issues early on, before we see failures like lack of resources.
  • Each pod has maximum number of events queued set to 1000. Above that threshold events are dropped. The limit is per-pod so global limit of 1000 should be a safe bet even if the number of unsend notifications is high only on a single pod while at the same time it is enough to signal that there are issues as the queue size normally hovers around ~5 events for the whole cluster.

Describe the expected behavior of the metric under normal conditions. This helps identify situations where the alert might be falsely firing.

  • This metric should go up and down as pending events are queued and dispatched.
  • Some peaks are expected during traffic peak times.
  • The Pending events panels should have a relatively low 2 digit number.

Expected frequency of the alert. Is it a high-volume alert or expected to be rare?

  • Should be rare.

Show historical trends of the alert firing e.g Kibana dashboard

  • N/A (new alert)

Guidance for assigning incident severity to this alert

  • s3

Who is likely to be impacted by this cause of this alert?

  • Customers pushing/pulling images to the container registry.

Things to check to determine severity

  • Service overview
  • Escalate if service is degraded for a prolonged period of time.

Recent changes

How to properly roll back changes

  • Check the changelog in the MR that updated the registry.
  • Review MRs included in the related release issue
  • If any MR has the label ~cannot-rollback applied, a detailed description should exist in that MR.
  • Otherwise, proceed to revert the commit and watch the deployment.
  • Review the dashboards and expect the metric to go back to normal.
  • Rails API
  • Cloudflare/firewall rules