High Number of Pending or Failed Outgoing Webhook Notifications
Background
Section titled “Background”The Container Registry is configured to emit webhook notifications that are consumed by the GitLab Rails /api/v4/container_registry_event/events endpoint as seen in here.
These notifications are used by Rails to keep track of registry statistics and usage. They are not critical and the delivery is on best-effort basis.
Causes
Section titled “Causes”A high number of pending or failed events is likely related to one of these possibilities:
- Networking error while sending an outgoing request to the
/api/v4/container_registry_event/eventsendpoint on GitLab.com; - Issues in the monolith itself, that prevent notifcations from being consumed.
- An application bug in the registry webhook notifications code.
Symptoms
Section titled “Symptoms”The ContainerRegistryNotificationsPendingCountTooHigh alert will be triggered if the number of pending outgoing events count is higher than the configured threshold for a prolonged period of time.
The ContainerRegistryNotificationsPendingQueueGrows alert will be triggered if the number of pending outgoing events is increasing faster than the configured threshold for a prolonged period of time.
The list of dashboard refrenced in the “Troubleshooting” section below can be used to gather insight on the system.
Troubleshooting
Section titled “Troubleshooting”We first need to identify the cause for the accumulation of pending outgoing notifications. For this, we can look at the following Grafana dashboards:
registry-notifications/webhook-notifications-detailapi-main/api-overviewcloudflare-main/cloudflare-overview- Rails API logs.
In (1), we should look at the failure and error rates, as well as the different status codes in the Events per second (by Status Code) panel.
In (2) and (3), we should look for potential errors at the Rails API level or any Cloudflare errors affecting the notifications delivery rate.
In (4), we can monitor the Rails API for the /api/v4/container_registry_event/events endpoint for clues on what could be going wrong.
In the presence of errors, we should also look at the registry access/application logs in Kibana.
This might allow us to see error details while trying to send a notification by searching for the string error writing event.
The same applies to Sentry, where all unknown application errors are reported.
Resolution
Section titled “Resolution”In case there are no signs of relevant application/network errors, and all metrics seem to point to an inability to keep up with the demand please contact container-registry.