High Number of Pending or Failed Outgoing Webhook Notifications
Background
Section titled “Background”The Container Registry is configured to emit webhook notifications
that are consumed by the GitLab Rails /api/v4/container_registry_event/events
endpoint as seen in here.
These notifications are used by Rails to keep track of registry statistics and usage, thus making this endpoint not critical. However, the webhook notification system enqueues events one at a time per registry instance, attempting to send the event until it succeeds, which can lead to problems of high resource consumption as seen in this issue.
NOTE:
There is pending work to fix the the issue above by replacing the threshold
setting with maxretries
as part of this issue.
Causes
Section titled “Causes”A high number of pending or failed events is likely related to one of these possibilities:
- Networking error while sending an outgoing request to the
/api/v4/container_registry_event/events
endpoint on GitLab.com; - An application bug in the registry webhook notifications code.
Symptoms
Section titled “Symptoms”The ContainerRegistryNotificationsPendingCountTooHigh
alerts will be triggered if the number of
pending outgoing events count is higher than the configured threshold for a prolonged period of time.
A small Registry API impact could be expected in these situations while this issue is implemented, as the registry instances could run out of resources to serve requests. Ideally, we would catch high resource usage by different metrics, as well as the Kubernetes scheduler recycling pods if the memory/CPU usage threshold is reached.
Also, the ContainerRegistryNotificationsFailedStatusCode
alerts when the response code received by the registry notifications system is
different than the expected 200 OK
. The metric registry_notifications_status_total
can be used to help diagnose a potential networking problem.
Troubleshooting
Section titled “Troubleshooting”We first need to identify the cause for the accumulation of pending outgoing notifications. For this, we can look at the following Grafana dashboards:
registry-main/registry-overview
registry-notifications/webhook-notifications-detail
api-main/api-overview
cloudflare-main/cloudflare-overview
- Rails API logs.
In (1), we should inspect the current Apdex/error rate SLIs, both for the server (to rule out any unexpected customer impact) and database components.
Expanding the Node Metrics
section can be used to get an indication of high memory or CPU usage.
In (2), we should look at the failure and error rates, as well as the different status codes in the Events per second (by Status Code)
panel.
In (3) and (4), we should look for potential errors at the Rails API level or any Cloudflare errors affecting the notifications delivery rate.
In (5), we can monitor the Rails API for the /api/v4/container_registry_event/events
endpoint for clues on what could be going wrong.
In the presence of errors, we should also look at the registry access/application logs in Kibana.
This might allow us to see error details while trying to send a notification
by searching for the string error writing event
.
The same applies to Sentry, where all unknown application errors are reported.
Resolution
Section titled “Resolution”Suppose there are no signs of relevant application/network errors, and all metrics seem to point to an inability to keep up with the demand. In that case, we should likely adjust the notifications settings to meet the demand by, for example, increasing the backoff
period and/or adjusting the threshold
setting.
An alternative is to recycle the affected pods. However, the events in the pending queue will be dropped, affecting the Registry Usage metrics.
In the presence of errors, the development team should be involved in debugging the underlying cause.