NATS monitoring
TBD: We should have a separate dashboard for the status of NATS servers.
Available servers can be verified by the following promQL query:
(count by(type, env) (nats_healthz_js_enabled_only_status_value{value="ok"}) == bool count by(type, env) (nats_healthz_js_enabled_only_status_value) ) == 1
Existing NATS dashboard covers the rate and errors metrics for requests to NATS servers. Slow consumers or redelivered messages to consumers are current indicators of errors here.
NATS monitoring docs are a good reference to see other available metrics from its system. We rely on NATS prom exporter to export this into our environment.