Skip to content

Global Code Search Service

GitLab uses Zoekt, an open-source search engine specifically designed for precise code search. This integration powers GitLab’s “exact code search” feature which offers significant improvements over the Elasticsearch-based search, including exact match and regular expression modes.

Our Zoekt integration is supported by:

  1. gitlab-zoekt-indexer a service (written in Go) which manages the underlying Zoekt indexes and provides gRPC (and legacy HTTP) APIs for integrating with GitLab
  2. gitlab-zoekt helm chart to deploy the above Go service

Unlike Elasticsearch, which was not ideally suited for code search, Zoekt provides:

  • Exact match mode: Returns results that precisely match the search query
  • Regular expression mode: Supports regex patterns and boolean expressions
  • Multiple line matches: Shows multiple matching lines from the same file
  • Advanced filters: Language, file path, symbol, etc.

This feature is part of the epic to improve code search capabilities in GitLab.

To get comprehensive information about the current state of the Zoekt system in the production Rails console, use:

Search::RakeTask::Zoekt.info(name: "gitlab:zoekt:info", watch_interval: 60)

The watch_interval parameter refreshes the data every N seconds (in this example, every 60 seconds). If not set, the command will only run once.

This command provides valuable insights into node status, indexing progress, and system health, making it useful for diagnostics and monitoring.

You can also run this command as part of the rake task: rake "gitlab:zoekt:info[60]" or rake gitlab:zoekt:info (to run it once).

You can prevent GitLab from using Zoekt integration for searching by unchecking the checkbox Enable searching under the section Exact code search found in the admin settings(accessed by admins only) Settings->Search, but leave the indexing integration itself enabled. An example of when this is useful is during an incident where users are experiencing slow searches or Zoekt is unresponsive.

Enabling/Disabling Zoekt search for specific namespaces

Section titled “Enabling/Disabling Zoekt search for specific namespaces”

When we rollout Zoekt search for SaaS customers, it is enabled by default. But if a customer wish to get it disabled we can run the following chatops command to disable the Zoekt search specifically for a namespace.

/chatops run feature set --group=root-group-path disable_zoekt_search_for_saas true --production

To re-enable it again we can run the following chatops command

/chatops run feature set --group=root-group-path disable_zoekt_search_for_saas false --production

In order to evict a namespace manually, you can manually delete the Search::Zoekt::Replica record associated with the namespace:

namespace = Namespace.find_by_full_path('gitlab-org')
enabled_namespace = Search::Zoekt::EnabledNamespace.where(root_namespace_id: namespace.id).first
enabled_namespace.replicas.delete_all

When a Zoekt node PVC is over 80% of usage and evicting or removing namespaces doesn’t reduce the usage, you can quickly remove all namespaces from a Zoekt node by manually mark the node as lost. This is a safe operation because the lost node will reregister itself as a new node and the Zoekt Architecture will handle allocating all namespaces and projects.

Warning: The new UUID must not exist in the table.

node_name = 'gitlab-gitlab-zoekt-29'
uuid = SecureRandom.uuid
Search::Zoekt::Node.by_name(node_name).update_all(uuid: uuid, last_seen_at: 24.hours.ago)

Increase the number of Zoekt replicas (nodes) by 20% of total capacity if all Zoekt nodes are above 65% of disk utilization. For example, if there are 22 nodes, add 4.4 (4 nodes).

Zoekt indexing can be paused by checking the checkbox Pause indexing under the section Exact code search found in the admin settings(accessed by admins only) Settings->Search. An example of when this is useful is during an incident when there are a large number of indexing Sidekiq jobs failing.

Zoekt indexing can be completely disabled by unchecking the checkbox Enable indexing under the section Exact code search found in the admin settings(accessed by admins only) Settings->Search. Pausing indexing is the preferred method to halt Zoekt indexing.

WARNING: Indexed data will be stale after indexing is re-enabled. Reindexing from scratch may be necessary to ensure up to date search results.

  1. Multiple shards and replication are not supported yet. You can follow the progress in https://gitlab.com/groups/gitlab-org/-/epics/11382.

A significant improvement in the implementation is the introduction of a unified binary called gitlab-zoekt, which replaces the previously separate binaries (gitlab-zoekt-indexer and gitlab-zoekt-webserver). This unified binary can operate in two distinct modes:

  • Indexer mode: Responsible for indexing repositories
  • Webserver mode: Responsible for serving search requests

Having a unified binary simplifies deployment, operation, and maintenance of the Zoekt infrastructure. The key advantages of this approach include:

  1. Simplified deployment: Only one binary needs to be built, deployed, and maintained
  2. Consistent codebase: Shared code between indexer and webserver is maintained in one place
  3. Operational flexibility: The same binary can run in different modes based on configuration
  4. Testing mode: The unified binary can run both services simultaneously for testing purposes

GitLab uses several database models to manage Zoekt:

  • Search::Zoekt::EnabledNamespace: Tracks which namespaces have Zoekt enabled
  • Search::Zoekt::Node: Represents a Zoekt server node with information about its capacity, address, and online status
  • Search::Zoekt::Replica: Manages replica relationships for high availability
  • Search::Zoekt::Index: Manages the index state for a namespace, including storage allocation and watermark levels
  • Search::Zoekt::Repository: Represents a project repository in Zoekt with indexing state
  • Search::Zoekt::Task: Tracks indexing tasks (index, force_index, delete) that need to be processed by Zoekt nodes
  1. GitLab detects repository changes and creates zoekt_tasks
  2. Zoekt nodes periodically pull tasks via HTTP requests to GitLab’s Internal API
  3. Zoekt nodes process the tasks (indexing repositories)
  4. Zoekt nodes send callbacks to GitLab to update task status
  5. Appropriate database records are updated (zoekt_task, zoekt_repository, zoekt_index)
  1. User performs a search in GitLab UI
  2. GitLab determines if the search should use Zoekt
  3. If Zoekt is appropriate, GitLab forwards the search to a Zoekt node
  4. Zoekt processes the search and returns results
  5. GitLab formats and presents the results to the user
  • Nodes register themselves with GitLab through the task retrieval API
  • Each node provides information about its address, name, disk usage, etc.
  • GitLab maintains a registry of nodes with their status and capacity
  • Nodes that don’t check in for a period can be automatically removed

This architecture makes the system self-configuring and facilitates easy scaling.

  • Groups/namespaces are assigned to specific Zoekt nodes for indexing and searching
  • GitLab manages the shard assignments internally based on node capacity and load
  • When new nodes are added, they can automatically take on new workloads
  • If nodes go offline, their work can be reassigned to other nodes
  • A primary-replica model is used for high availability
  • Primary nodes handle both indexing and search
  • Replica nodes are used for search only
  • Each replica has its own independent index (no complex index file synchronization)
  • If a primary goes down, a replica can be promoted to primary

Zoekt nodes call this endpoint to get tasks to process:

GET /internal/search/zoekt/:uuid/tasks

This provides node information (UUID, URL, disk space, etc.) and returns tasks that need to be processed.

Zoekt nodes send callbacks to this endpoint after processing tasks:

POST /internal/search/zoekt/:uuid/callback

This updates task status (success/failure) and can include additional information like repository size.

GitLab calls this endpoint on Zoekt to execute searches:

GET /api/search

This includes query parameters and filters and returns search results to be displayed to the user.

  • GitLab provides a Helm chart (gitlab-zoekt) for Kubernetes deployments
  • The chart deploys Zoekt in a StatefulSet with a persistent volume for index storage
  • The chart includes configurations for resource allocation, scaling, and networking
  • A gateway component (NGINX) is deployed for load balancing
  • Containers are built from the CNG repository
  • The Dockerfile builds on top of gitlab-base and includes:
    • The gitlab-zoekt unified binary
    • Universal ctags for symbol extraction
    • Scripts for process management and healthchecks
  • The container can be configured via environment variables to run in either indexer or webserver mode

Worst-case scenario, Zoekt index takes about 2.8 times of the source code in the indexed branch (excluding binary files). We don’t observe that in reality. It’s usually about 0.4.

The Zoekt integration includes a sophisticated watermark management system to ensure efficient use of storage:

  1. Low Watermark (60-70%): Triggers rebalancing to avoid reaching higher levels
  2. High Watermark (70-75%): Signals potential storage pressure and prioritizes rebalancing
  3. Critical Watermark (85%+): May pause indexing to prevent node overload while performing evictions

This system ensures that storage is used efficiently while preventing nodes from running out of space.

There are a few dashboards to monitor Zoekt health:

GitLab application has a dedicated zoekt.log file for Zoekt-related log entries. This will be handled by the standard logging infrastructure. You may also find indexing related errors in sidekiq.log and search related errors in production_json.log.

The gitlab-zoekt binary (in both indexer and webserver modes) writes logs to stdout.

Zoekt architecture has logic which detects when nodes disk usage is over the limit. Projects will be removed from each node until it the node disk usage under the limit. If the disk space is not coming down quick enough, follow these steps in order:

  1. remove namespaces manually
  2. As a last resort, mark the node as lost

WARNING: The PVC disk size must not be increased manually. Zoekt nodes are sized with a specific PVC size and it must remain consistant across all nodes.