Skip to content

Data Insights Platform Runbooks

Data Insights Platform (DIP) is a unified abstraction to ingest, process, persist & query analytical data events generated across GitLab enabling our ability to compute business insights across the product.

It’s designed to be a general-purpose data toolkit that can be used to transport events-data from one system to another while enriching ingested data dynamically. It currently serves the following use-cases:

The following are general components that constitute a Data Insights Platform instance. For details around specific use-cases and/or environment-specific architectures, refer to their dedicated sections as linked earlier in the document.

DIP Overview

ComponentDescription
IngressAll ingress into our currently-supported deployments of Data Insights Platform is proxied via Cloudflare. On the GKE side, we employ ingress-nginx as our ingress controller which in turn uses an IP whitelist containing advertised Cloudflare IP ranges.
IngestersSingle ingestion mechanism for supported event types - which can be run both locally for development & as a cluster when in production. This layer is intentionally stateless to allow horizontal scalability to allow ingesting large data volumes.
Message Queue - NATS/JetstreamAll ingested data via the ingesters is first landed into NATS/Jetstream to allow for durably persisting all data before it can be parsed, enriched & exported to other downstream systems.
EnrichersCustom framework to enrich incoming data with the ability to communicate with external components such as GitLab API or Data Catalog for metadata. Supported enrichments include operations such as pseudonymization or redaction of sensitive parts of ingested data, PII detection, parsing client useragent strings, etc.
ExportersCustom implementations that help ship ingested data into designated persistent stores for further querying/processing:
ClickHouse Exporter: ClickHouse is our designated persistent database which helps us persist all analytical data ingested by the Platform and query from using the Query API.
S3/GCS Exporter: Having data shipped to S3/GCS helps land data into Snowflake powering our current analytical query-workflows using Snowflake & Tableau.
StorageClickHouse: External persistent database that allows for durable persistence and advanced OLAP querying capabilities for all analytical data ingested within the Platform.

We have not yet deployed a Data Insights Platform to service our self-managed and dedicated GitLab instances.

To kickoff related discussions, please start with filing an issue here with details of your use-case.