Periodic Job Monitoring
We have lots of jobs that run regularly. We use Prometheus & Alertmanager to monitor and alert on jobs that are either still running and taking too long to finish, or that have completed but took longer than expected (details). This type of alerting relies on jobs actually starting to be able to alert on them. What about jobs that quietly fail to trigger when they’re supposed to?
Dead Man’s Snitch (or DMS) is a third-party monitoring tool for periodic processes, and we use this service for alerting us when jobs fail to trigger when they’re expected to.
How does it work?
Section titled “How does it work?”You create a “snitch”, which is a unique URL for a job to be monitored and the interval that the job is expected to run (or smart mode to figure out the interval based on when the job checks in). Every time the job runs, it needs to hit the unique URL. If it doesn’t hit the URL within the expected interval, it triggers an alert.
What receivers does it support?
Section titled “What receivers does it support?”It can send an email and/or trigger one of the supported integrations (e.g., Slack, PagerDuty, etc).
What integrations have we got configured?
Section titled “What integrations have we got configured?”Currently we have:
- PagerDuty configured for any snitches tagged
pager
- Slack
#alerts
for all snitches - Slack
#database
for any snitches taggeddatabase
This list may be out-of-date so best refer to the DMS integrations page for an up-to-date listing.
How are snitches configured?
Section titled “How are snitches configured?”They are manually configured. We’re aware that there’s a Terraform provider for DMS, but we haven’t tried it yet.