Skip to content

NATS Service

NATS is a messaging layer to support data messaging & queuing needs at GitLab scale. It is currently in use by Data Insights Platform for buffering events generated by Gitlab.

NATS Jetstream is a built-in persistence mechanism that we use to provide message streaming for the buffered events.

Our NATS workloads run in a clustered configuration with 3 replicas. See configurations at gitlab-helmfiles here.

We rely on at least once delivery semantics with the use of Jetstream consumers for durably persisting all ingested data.

A subscriber can successfully ACK or NACK an event. If it fails then NATS+Jetstream will redeliver it till a configured retention period. Our current implementation is built around this principle and does allow for possible duplicates. NATS Jetstream consumers are stateful and based on an acknowledgment policy (AckExplicit) the server will attempt re-delivery until it is acknowledged. If the consumer process dies in middle of processing, all unacknowledged events will be processed again.

Jetstream has support for specifying a replication factor (R) which allows for data to be replicated across nodes for fault tolerance. Quoting the docs from here:

JetStream uses a NATS optimized RAFT distributed quorum algorithm to distribute the persistence service between NATS servers in a cluster while maintaining immediate consistency (as opposed to eventual consistency) even in the face of failures.

We use data replication for Jetstream streams. Our current setup uses replication factor of 3, see gitlab-org/analytics-section/platform-insights/core#87 for implementation details.

Jetstream allows persistence at NATS server for all events received even after it has been acknowledged by a consumer. This is defined by a Retention and Limits policy for Jetstream. We current use LimitsPolicy which purges data according to an limit policy e.g. age, size etc. Currently we set the retention to 3 days for all received events.

Please note that retention at NATS is different from persistence at the downstream components which should be always present in the configuration of Data Insights Platform. Data in NATS is meant to be temporary but we allow a retention period for features like replay ability.

See backups file for more details.

We run NATS server in clustering mode with 3 replicas. This should tolerate the loss of 1 server across complete restarts (ref).

If we lose the quorum in cluster, the server will be marked as down and should no longer accept messages. Data Insights Platform do have probing mechanisms (ref) that will also note this and will stop accepting messages. Clients of the platform are recommended to have a mechanism to locally buffer the messages for the period when NATS is down. See discussion here for more details.

We’ve alerts on the loss of nats servers in here.

As streams will be backed up at regular intervals, the only data loss possible is between last successful backup and the time that the cluster failed. However there are design choices in place to further reduce this possibility:

  • Data buffered in NATS is temporary and enriched/exported in near real-time. This reduces linearly with any lag on the exporter processes. (ref)
  • Data is persisted on external data volumes (PVC) and should be recoverable with applicable Cloud-provider guarantees.
  • Data backups are intentionally frequent.
  • Recent data can be made (re)available from the producer.

TBD: Ongoing discussions on operational support with new component ownership model. But the teams utilizing this will be available on Tier 2 call for this system.

Current teams:

  • Platform Insights: Slack: g_analytics_platform_insights, Handbook link.
  • Fulfillment Platform: Slack: g_fulfillment_platform, Handbook link.

See monitoring.md for more details.

See operations.md for more details.