Global Code Search Service

Alerts: https://alerts.gitlab.net/#/alerts?filter=%7Btype%3D%22zoekt%22%2C%20tier%3D%22inf%22%7D
Label: gitlab-com/gl-infra/production~“Service::Zoekt”

Logging

Summary

Quick start

GitLab uses Zoekt, an open-source search engine specifically designed for precise code search. This integration powers GitLab’s “exact code search” feature which offers significant improvements over the Elasticsearch-based search, including exact match and regular expression modes.

Our Zoekt integration is supported by:

gitlab-zoekt-indexer a service (written in Go) which manages the underlying Zoekt indexes and provides gRPC (and legacy HTTP) APIs for integrating with GitLab
gitlab-zoekt helm chart to deploy the above Go service

Unlike Elasticsearch, which was not ideally suited for code search, Zoekt provides:

Exact match mode: Returns results that precisely match the search query
Regular expression mode: Supports regex patterns and boolean expressions
Multiple line matches: Shows multiple matching lines from the same file
Advanced filters: Language, file path, symbol, etc.

This feature is part of the epic to improve code search capabilities in GitLab.

How-to guides

Monitoring Zoekt system state

To get comprehensive information about the current state of the Zoekt system in the production Rails console, use:

Search::RakeTask::Zoekt.info(name: "gitlab:zoekt:info", watch_interval: 60)

The watch_interval parameter refreshes the data every N seconds (in this example, every 60 seconds). If not set, the command will only run once.

This command provides valuable insights into node status, indexing progress, and system health, making it useful for diagnostics and monitoring.

You can also run this command as part of the rake task: rake "gitlab:zoekt:info[60]" or rake gitlab:zoekt:info (to run it once).

Enabling/Disabling Zoekt search

You can prevent GitLab from using Zoekt integration for searching by unchecking the checkbox Enable searching under the section Exact code search found in the admin settings(accessed by admins only) Settings->Search, but leave the indexing integration itself enabled. An example of when this is useful is during an incident where users are experiencing slow searches or Zoekt is unresponsive.

Enabling/Disabling Zoekt search for specific namespaces

When we rollout Zoekt search for SaaS customers, it is enabled by default. But if a customer wish to get it disabled we can run the following chatops command to disable the Zoekt search specifically for a namespace.

  /chatops run feature set --group=root-group-path disable_zoekt_search_for_saas true --production

To re-enable it again we can run the following chatops command

  /chatops run feature set --group=root-group-path disable_zoekt_search_for_saas false --production

Evicting namespaces from a Zoekt node

In order to evict a namespace manually, you can manually delete the Search::Zoekt::Replica record associated with the namespace:

namespace = Namespace.find_by_full_path('gitlab-org')
enabled_namespace = Search::Zoekt::EnabledNamespace.where(root_namespace_id: namespace.id).first
enabled_namespace.replicas.delete_all

Marking a zoekt node as lost

When a Zoekt node PVC is over 80% of usage and evicting or removing namespaces doesn’t reduce the usage, you can quickly remove all namespaces from a Zoekt node by manually mark the node as lost. This is a safe operation because the lost node will reregister itself as a new node and the Zoekt Architecture will handle allocating all namespaces and projects.

Warning: The new UUID must not exist in the table.

node_name = 'gitlab-gitlab-zoekt-29'
uuid = SecureRandom.uuid

Search::Zoekt::Node.by_name(node_name).update_all(uuid: uuid, last_seen_at: 24.hours.ago)

When to add a Zoekt node

Increase the number of Zoekt replicas (nodes) by 20% of total capacity if all Zoekt nodes are above 65% of disk utilization. For example, if there are 22 nodes, add 4.4 (4 nodes).

Pausing Zoekt indexing

Zoekt indexing can be paused by checking the checkbox Pause indexing under the section Exact code search found in the admin settings(accessed by admins only) Settings->Search. An example of when this is useful is during an incident when there are a large number of indexing Sidekiq jobs failing.

Disabling Zoekt indexing

Zoekt indexing can be completely disabled by unchecking the checkbox Enable indexing under the section Exact code search found in the admin settings(accessed by admins only) Settings->Search. Pausing indexing is the preferred method to halt Zoekt indexing.

WARNING: Indexed data will be stale after indexing is re-enabled. Reindexing from scratch may be necessary to ensure up to date search results.

Limitations

Multiple shards and replication are not supported yet. You can follow the progress in https://gitlab.com/groups/gitlab-org/-/epics/11382.

Architecture

Design document: https://handbook.gitlab.com/handbook/engineering/architecture/design-documents/code_search_with_zoekt/

Key Components

Unified Binary: `gitlab-zoekt`

A significant improvement in the implementation is the introduction of a unified binary called gitlab-zoekt, which replaces the previously separate binaries (gitlab-zoekt-indexer and gitlab-zoekt-webserver). This unified binary can operate in two distinct modes:

Indexer mode: Responsible for indexing repositories
Webserver mode: Responsible for serving search requests

Having a unified binary simplifies deployment, operation, and maintenance of the Zoekt infrastructure. The key advantages of this approach include:

Simplified deployment: Only one binary needs to be built, deployed, and maintained
Consistent codebase: Shared code between indexer and webserver is maintained in one place
Operational flexibility: The same binary can run in different modes based on configuration
Testing mode: The unified binary can run both services simultaneously for testing purposes

Database Models

GitLab uses several database models to manage Zoekt:

Search::Zoekt::EnabledNamespace: Tracks which namespaces have Zoekt enabled
Search::Zoekt::Node: Represents a Zoekt server node with information about its capacity, address, and online status
Search::Zoekt::Replica: Manages replica relationships for high availability
Search::Zoekt::Index: Manages the index state for a namespace, including storage allocation and watermark levels
Search::Zoekt::Repository: Represents a project repository in Zoekt with indexing state
Search::Zoekt::Task: Tracks indexing tasks (index, force_index, delete) that need to be processed by Zoekt nodes

Communication Flow

Indexing Flow

GitLab detects repository changes and creates zoekt_tasks
Zoekt nodes periodically pull tasks via HTTP requests to GitLab’s Internal API
Zoekt nodes process the tasks (indexing repositories)
Zoekt nodes send callbacks to GitLab to update task status
Appropriate database records are updated (zoekt_task, zoekt_repository, zoekt_index)

Search Flow

User performs a search in GitLab UI
GitLab determines if the search should use Zoekt
If Zoekt is appropriate, GitLab forwards the search to a Zoekt node
Zoekt processes the search and returns results
GitLab formats and presents the results to the user

Scaling and High Availability

Self-Registering Node Architecture

Nodes register themselves with GitLab through the task retrieval API
Each node provides information about its address, name, disk usage, etc.
GitLab maintains a registry of nodes with their status and capacity
Nodes that don’t check in for a period can be automatically removed

This architecture makes the system self-configuring and facilitates easy scaling.

Sharding Strategy

Groups/namespaces are assigned to specific Zoekt nodes for indexing and searching
GitLab manages the shard assignments internally based on node capacity and load
When new nodes are added, they can automatically take on new workloads
If nodes go offline, their work can be reassigned to other nodes

Replication Strategy

A primary-replica model is used for high availability
Primary nodes handle both indexing and search
Replica nodes are used for search only
Each replica has its own independent index (no complex index file synchronization)
If a primary goes down, a replica can be promoted to primary

Zoekt API

Task Retrieval API

Zoekt nodes call this endpoint to get tasks to process:

GET /internal/search/zoekt/:uuid/tasks

This provides node information (UUID, URL, disk space, etc.) and returns tasks that need to be processed.

Callback API

Zoekt nodes send callbacks to this endpoint after processing tasks:

POST /internal/search/zoekt/:uuid/callback

This updates task status (success/failure) and can include additional information like repository size.

Search API

GitLab calls this endpoint on Zoekt to execute searches:

GET /api/search

This includes query parameters and filters and returns search results to be displayed to the user.

Deployment

Kubernetes/Helm

GitLab provides a Helm chart (gitlab-zoekt) for Kubernetes deployments
The chart deploys Zoekt in a StatefulSet with a persistent volume for index storage
The chart includes configurations for resource allocation, scaling, and networking
A gateway component (NGINX) is deployed for load balancing

Docker/Container

Containers are built from the CNG repository
The Dockerfile builds on top of gitlab-base and includes:
- The gitlab-zoekt unified binary
- Universal ctags for symbol extraction
- Scripts for process management and healthchecks
The container can be configured via environment variables to run in either indexer or webserver mode

Scalability

How much Zoekt storage do we need

Worst-case scenario, Zoekt index takes about 2.8 times of the source code in the indexed branch (excluding binary files). We don’t observe that in reality. It’s usually about 0.4.

Watermark Management

The Zoekt integration includes a sophisticated watermark management system to ensure efficient use of storage:

Low Watermark (60-70%): Triggers rebalancing to avoid reaching higher levels
High Watermark (70-75%): Signals potential storage pressure and prioritizes rebalancing
Critical Watermark (85%+): May pause indexing to prevent node overload while performing evictions

This system ensures that storage is used efficiently while preventing nodes from running out of space.

Monitoring

Dashboards

There are a few dashboards to monitor Zoekt health:

Zoekt Health Dashboard: Monitor search and indexing operations
Zoekt memory usage : View memory utilization for Zoekt containers
Zoekt OOM errors: View any Out Of Memory exceptions for Zoekt containrs
Zoekt pvc usage: View PVC volume capacity for Zoekt nodes
Zoekt indexing locks in progress: View number of indexing locks (locks are per project)
Zoekt Info Dashboard

Kibana logs

GitLab application has a dedicated zoekt.log file for Zoekt-related log entries. This will be handled by the standard logging infrastructure. You may also find indexing related errors in sidekiq.log and search related errors in production_json.log.

The gitlab-zoekt binary (in both indexer and webserver modes) writes logs to stdout.

Alerts

`kube_persistent_volume_claim_disk_space`

Zoekt architecture has logic which detects when nodes disk usage is over the limit. Projects will be removed from each node until it the node disk usage under the limit. If the disk space is not coming down quick enough, follow these steps in order:

remove namespaces manually
As a last resort, mark the node as lost

WARNING: The PVC disk size must not be increased manually. Zoekt nodes are sized with a specific PVC size and it must remain consistant across all nodes.