Skip to content

SiphonLogicalReplicationSlotLagHigh

This alert fires when a Siphon logical replication slot on a GitLab.com Patroni primary has more than 100 GiB of unconfirmed WAL. In other words, Siphon’s producer for the affected cluster is at least 100 GiB behind the primary’s current WAL position.

Siphon consumes PostgreSQL’s WAL via logical replication slots named like prd_main_siphon_slot_1, prd_ci_siphon_slot_1, prd_sec_siphon_slot_1 (or the stg_* equivalents on staging). Each slot tracks a consumer’s confirmed_flush_lsn — the LSN up to which Siphon has confirmed receiving and persisting data. Until the slot advances past an LSN, PostgreSQL must retain the WAL for it.

Contributing factors include:

  • Siphon producer is crashed, scaled to zero, or stuck.
  • NATS JetStream (Siphon’s downstream) is unavailable, so the producer cannot publish events and therefore cannot acknowledge WAL.
  • A very large transaction on the primary has produced a burst of WAL that Siphon cannot chew through fast enough.

Parts of the service affected:

  • Directly: the Patroni primary whose fqdn is in the alert label set. WAL is accumulating on its data volume.
  • Indirectly: GitLab.com itself — if WAL accumulation fills the Patroni primary’s data volume, the database will stop accepting writes.

The recipient is expected to triage Siphon first (get the producer caught up or stop it cleanly). Scaling the producer to 0 and then back to 1 quickly may resolve the situation.

  • Metric: pg_replication_slots_confirmed_flush_lsn_bytes
  • Unit: bytes
  • Source: postgres_exporter on the Patroni primaries, scraped into the gitlab-gprd and gitlab-gstg Mimir tenants.
  • Despite the name, this metric is emitted as the delta between the primary’s current WAL LSN and the slot’s confirmed_flush_lsn — i.e. it is equivalent to pg_wal_lsn_diff(pg_current_wal_lsn(), confirmed_flush_lsn). It rises as the consumer falls behind and falls as the consumer catches up.
  • Threshold: 100 GiB (100 * 1024 * 1024 * 1024 bytes), sustained for 5 minutes. This threshold is a compromise between catching real problems early and not paging on transient backpressure. The Siphon runbook notes that mitigation is typically required within < 1 day, and GitLab.com Patroni primaries generate WAL at up to ~150 MB/s (see PrimaryDatabaseWALGenerationSaturationSustainedOver150MBS), so 100 GiB corresponds to roughly 10–15 minutes of unconsumed WAL at saturation — well before disk retention becomes a concern.
  • Normal behaviour: the metric hovers in the low MB to single-digit GB range for healthy slots, oscillating as the consumer catches up. A sustained upward trend is the typical failure mode and the signal this alert is tuned to catch.
  • This alert clears automatically once the slot catches up below the 100 GiB threshold.
  • Silencing is only appropriate during an open Change Request for Siphon or the affected Patroni cluster, or when a re-snapshot is deliberately in progress.
  • This should be a rare alert. Under normal operation, Siphon keeps well under the threshold. A firing alert almost always indicates a producer outage.
  • Automatic safeguard: GitLab.com Patroni clusters are configured with max_slot_wal_keep_size so PostgreSQL will invalidate a slot that retains more than the configured amount of WAL, rather than let the primary run out of disk. This alert is tuned to fire well before that limit, so that a human can catch up or cleanly stop the producer without forcing a full Siphon re-snapshot. See database-team/meetings#26 (comment 3271711718).
  • Previous incidents tagged with this alert
  • Default severity: s2, pager: pagerduty. This alert pages because unchecked lag on a Siphon slot can exhaust disk on a GitLab.com Patroni primary, which would compromise production.
  • Direct impact at firing time is usually limited to Siphon’s downstream consumers (ClickHouse-backed analytics). The severity reflects the risk of escalation to a primary-database outage, not the immediate customer impact.
  • Escalate to s1 if any of the following are true:
    • The Patroni primary’s data volume utilisation is above 70% and trending upward — disk exhaustion on the primary is imminent and customer-wide.
    • PrimaryDatabaseWALGenerationSaturationSustainedOver150MBS is also firing on the same cluster — WAL is being generated faster than Siphon can ever catch up.
    • The slot has been lagging and growing for more than several hours with no sign of progress.

You can also verify the lag and the active session holding the slot directly on the primary:

SELECT
slot_name,
active,
active_pid,
pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), confirmed_flush_lsn)) AS lag_bytes
FROM pg_replication_slots
WHERE slot_type = 'logical' AND slot_name ILIKE '%siphon%';

Full triage steps live in the Siphon runbook under High logical replication lag / producer not running. The recommended order of operations is:

  1. Identify the affected DB cluster. The slot_name label in the alert encodes it (e.g. prd_main_siphon_slot_1main, prd_ci_siphon_slot_1ci, prd_sec_siphon_slot_1sec).

  2. Check NATS JetStream health. If NATS is down, Siphon cannot publish and lag will grow. See the NATS runbook.

  3. Check the Siphon producer pods on the relevant orbit-* cluster:

    Terminal window
    glsh kube use-cluster orbit-prd --no-proxy
    kubectl get pods -n siphon -o wide
    kubectl logs -n siphon -l app=postgres-producer-<db> --tail=200
  4. Try bouncing the producer. Scaling the deployment to 0 and then back to 1 quickly often unsticks a wedged producer:

    Terminal window
    kubectl scale deployment postgres-producer-<db> --replicas=0 -n siphon
    kubectl scale deployment postgres-producer-<db> --replicas=1 -n siphon
  5. If bouncing doesn’t help, disconnect the active session holding the slot on the primary. This forces the producer to reconnect and can clear stuck replication state. First find the PID:

    SELECT
    slot_name,
    active_pid,
    pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), confirmed_flush_lsn)) AS lag_bytes
    FROM pg_replication_slots
    WHERE slot_type = 'logical' AND slot_name ILIKE '%siphon%';

    Then terminate the active_pid if present:

    SELECT pg_terminate_backend(<active_pid>);

Additional dashboards:

  • Siphon producer (postgres-producer-* pods in the siphon namespace).
  • NATS JetStream — Siphon’s pub/sub layer.
  • Patroni primary of the affected cluster (patroni, patroni-ci, or patroni-sec) — the host PostgreSQL that owns the slot.
  • Primary contact: #g_analytics_platform_insights Slack channel (team handbook). Individuals with the most context: @ahegyi, @arun.sori, @ankitbhatnagar.
  • For immediate emergencies, or if there is no response from the primary contacts, escalate to #database_operations — database reliability can step in to prevent disk exhaustion on the Patroni primary.