Global Code Search Service
- Alerts: https://alerts.gitlab.net/#/alerts?filter=%7Btype%3D%22zoekt%22%2C%20tier%3D%22inf%22%7D
- Label: gitlab-com/gl-infra/production~“Service::Zoekt”
Logging
Section titled “Logging”Summary
Section titled “Summary”Quick start
Section titled “Quick start”GitLab uses Zoekt, an open-source search engine specifically designed for precise code search. This integration powers GitLab’s “exact code search” feature which offers significant improvements over the Elasticsearch-based search, including exact match and regular expression modes.
Our Zoekt integration is supported by:
gitlab-zoekt-indexer
a service (written in Go) which manages the underlying Zoekt indexes and provides gRPC (and legacy HTTP) APIs for integrating with GitLabgitlab-zoekt
helm chart to deploy the above Go service
Unlike Elasticsearch, which was not ideally suited for code search, Zoekt provides:
- Exact match mode: Returns results that precisely match the search query
- Regular expression mode: Supports regex patterns and boolean expressions
- Multiple line matches: Shows multiple matching lines from the same file
- Advanced filters: Language, file path, symbol, etc.
This feature is part of the epic to improve code search capabilities in GitLab.
How-to guides
Section titled “How-to guides”Monitoring Zoekt system state
Section titled “Monitoring Zoekt system state”To get comprehensive information about the current state of the Zoekt system in the production Rails console, use:
Search::RakeTask::Zoekt.info(name: "gitlab:zoekt:info", watch_interval: 60)
The watch_interval
parameter refreshes the data every N seconds (in this example, every 60 seconds). If not set, the command will only run once.
This command provides valuable insights into node status, indexing progress, and system health, making it useful for diagnostics and monitoring.
You can also run this command as part of the rake task: rake "gitlab:zoekt:info[60]"
or rake gitlab:zoekt:info
(to run it once).
Enabling/Disabling Zoekt search
Section titled “Enabling/Disabling Zoekt search”You can prevent GitLab from using Zoekt integration for searching by unchecking the checkbox Enable searching
under the section Exact code search
found in the admin settings(accessed by admins only) Settings->Search
, but leave the indexing integration itself enabled.
An example of when this is useful is during an incident where users are experiencing slow searches or Zoekt is unresponsive.
Enabling/Disabling Zoekt search for specific namespaces
Section titled “Enabling/Disabling Zoekt search for specific namespaces”When we rollout Zoekt search for SaaS customers, it is enabled by default. But if a customer wish to get it disabled we can run the following chatops command to disable the Zoekt search specifically for a namespace.
/chatops run feature set --group=root-group-path disable_zoekt_search_for_saas true --production
To re-enable it again we can run the following chatops command
/chatops run feature set --group=root-group-path disable_zoekt_search_for_saas false --production
Evicting namespaces from a Zoekt node
Section titled “Evicting namespaces from a Zoekt node”In order to evict a namespace manually, you can manually delete the Search::Zoekt::Replica
record associated with the namespace:
namespace = Namespace.find_by_full_path('gitlab-org')enabled_namespace = Search::Zoekt::EnabledNamespace.where(root_namespace_id: namespace.id).firstenabled_namespace.replicas.delete_all
Marking a zoekt node as lost
Section titled “Marking a zoekt node as lost”When a Zoekt node PVC is over 80% of usage and evicting or removing namespaces doesn’t reduce the usage, you can quickly remove all namespaces from a Zoekt node by manually mark the node as lost. This is a safe operation because the lost node will reregister itself as a new node and the Zoekt Architecture will handle allocating all namespaces and projects.
Warning: The new UUID must not exist in the table.
node_name = 'gitlab-gitlab-zoekt-29'uuid = SecureRandom.uuid
Search::Zoekt::Node.by_name(node_name).update_all(uuid: uuid, last_seen_at: 24.hours.ago)
When to add a Zoekt node
Section titled “When to add a Zoekt node”Increase the number of Zoekt replicas (nodes) by 20% of total capacity if all Zoekt nodes are above 65% of disk utilization. For example, if there are 22 nodes, add 4.4 (4 nodes).
Pausing Zoekt indexing
Section titled “Pausing Zoekt indexing”Zoekt indexing can be paused by checking the checkbox Pause indexing
under the section Exact code search
found in the admin settings(accessed by admins only) Settings->Search
. An example
of when this is useful is during an incident when there are a large number of indexing Sidekiq jobs failing.
Disabling Zoekt indexing
Section titled “Disabling Zoekt indexing”Zoekt indexing can be completely disabled by unchecking the checkbox Enable indexing
under the section Exact code search
found in the admin settings(accessed by admins only) Settings->Search
. Pausing indexing is the preferred method to halt Zoekt indexing.
WARNING: Indexed data will be stale after indexing is re-enabled. Reindexing from scratch may be necessary to ensure up to date search results.
Limitations
Section titled “Limitations”- Multiple shards and replication are not supported yet. You can follow the progress in https://gitlab.com/groups/gitlab-org/-/epics/11382.
Architecture
Section titled “Architecture”- Design document: https://handbook.gitlab.com/handbook/engineering/architecture/design-documents/code_search_with_zoekt/
Key Components
Section titled “Key Components”Unified Binary: gitlab-zoekt
Section titled “Unified Binary: gitlab-zoekt”A significant improvement in the implementation is the introduction of a unified binary called gitlab-zoekt
, which replaces the previously separate binaries (gitlab-zoekt-indexer
and gitlab-zoekt-webserver
). This unified binary can operate in two distinct modes:
- Indexer mode: Responsible for indexing repositories
- Webserver mode: Responsible for serving search requests
Having a unified binary simplifies deployment, operation, and maintenance of the Zoekt infrastructure. The key advantages of this approach include:
- Simplified deployment: Only one binary needs to be built, deployed, and maintained
- Consistent codebase: Shared code between indexer and webserver is maintained in one place
- Operational flexibility: The same binary can run in different modes based on configuration
- Testing mode: The unified binary can run both services simultaneously for testing purposes
Database Models
Section titled “Database Models”GitLab uses several database models to manage Zoekt:
Search::Zoekt::EnabledNamespace
: Tracks which namespaces have Zoekt enabledSearch::Zoekt::Node
: Represents a Zoekt server node with information about its capacity, address, and online statusSearch::Zoekt::Replica
: Manages replica relationships for high availabilitySearch::Zoekt::Index
: Manages the index state for a namespace, including storage allocation and watermark levelsSearch::Zoekt::Repository
: Represents a project repository in Zoekt with indexing stateSearch::Zoekt::Task
: Tracks indexing tasks (index, force_index, delete) that need to be processed by Zoekt nodes
Communication Flow
Section titled “Communication Flow”Indexing Flow
Section titled “Indexing Flow”- GitLab detects repository changes and creates
zoekt_tasks
- Zoekt nodes periodically pull tasks via HTTP requests to GitLab’s Internal API
- Zoekt nodes process the tasks (indexing repositories)
- Zoekt nodes send callbacks to GitLab to update task status
- Appropriate database records are updated (
zoekt_task
,zoekt_repository
,zoekt_index
)
Search Flow
Section titled “Search Flow”- User performs a search in GitLab UI
- GitLab determines if the search should use Zoekt
- If Zoekt is appropriate, GitLab forwards the search to a Zoekt node
- Zoekt processes the search and returns results
- GitLab formats and presents the results to the user
Scaling and High Availability
Section titled “Scaling and High Availability”Self-Registering Node Architecture
Section titled “Self-Registering Node Architecture”- Nodes register themselves with GitLab through the task retrieval API
- Each node provides information about its address, name, disk usage, etc.
- GitLab maintains a registry of nodes with their status and capacity
- Nodes that don’t check in for a period can be automatically removed
This architecture makes the system self-configuring and facilitates easy scaling.
Sharding Strategy
Section titled “Sharding Strategy”- Groups/namespaces are assigned to specific Zoekt nodes for indexing and searching
- GitLab manages the shard assignments internally based on node capacity and load
- When new nodes are added, they can automatically take on new workloads
- If nodes go offline, their work can be reassigned to other nodes
Replication Strategy
Section titled “Replication Strategy”- A primary-replica model is used for high availability
- Primary nodes handle both indexing and search
- Replica nodes are used for search only
- Each replica has its own independent index (no complex index file synchronization)
- If a primary goes down, a replica can be promoted to primary
Zoekt API
Section titled “Zoekt API”Task Retrieval API
Section titled “Task Retrieval API”Zoekt nodes call this endpoint to get tasks to process:
GET /internal/search/zoekt/:uuid/tasks
This provides node information (UUID, URL, disk space, etc.) and returns tasks that need to be processed.
Callback API
Section titled “Callback API”Zoekt nodes send callbacks to this endpoint after processing tasks:
POST /internal/search/zoekt/:uuid/callback
This updates task status (success/failure) and can include additional information like repository size.
Search API
Section titled “Search API”GitLab calls this endpoint on Zoekt to execute searches:
GET /api/search
This includes query parameters and filters and returns search results to be displayed to the user.
Deployment
Section titled “Deployment”Kubernetes/Helm
Section titled “Kubernetes/Helm”- GitLab provides a Helm chart (
gitlab-zoekt
) for Kubernetes deployments - The chart deploys Zoekt in a StatefulSet with a persistent volume for index storage
- The chart includes configurations for resource allocation, scaling, and networking
- A gateway component (NGINX) is deployed for load balancing
Docker/Container
Section titled “Docker/Container”- Containers are built from the CNG repository
- The Dockerfile builds on top of gitlab-base and includes:
- The
gitlab-zoekt
unified binary - Universal ctags for symbol extraction
- Scripts for process management and healthchecks
- The
- The container can be configured via environment variables to run in either indexer or webserver mode
Scalability
Section titled “Scalability”How much Zoekt storage do we need
Section titled “How much Zoekt storage do we need”Worst-case scenario, Zoekt index takes about 2.8 times of the source code in the indexed branch (excluding binary files). We don’t observe that in reality. It’s usually about 0.4.
Watermark Management
Section titled “Watermark Management”The Zoekt integration includes a sophisticated watermark management system to ensure efficient use of storage:
- Low Watermark (60-70%): Triggers rebalancing to avoid reaching higher levels
- High Watermark (70-75%): Signals potential storage pressure and prioritizes rebalancing
- Critical Watermark (85%+): May pause indexing to prevent node overload while performing evictions
This system ensures that storage is used efficiently while preventing nodes from running out of space.
Monitoring
Section titled “Monitoring”Dashboards
Section titled “Dashboards”There are a few dashboards to monitor Zoekt health:
- Zoekt Health Dashboard: Monitor search and indexing operations
- Zoekt memory usage : View memory utilization for Zoekt containers
- Zoekt OOM errors: View any Out Of Memory exceptions for Zoekt containrs
- Zoekt pvc usage: View PVC volume capacity for Zoekt nodes
- Zoekt indexing locks in progress: View number of indexing locks (locks are per project)
- Zoekt Info Dashboard
Kibana logs
Section titled “Kibana logs”GitLab application has a dedicated zoekt.log
file for Zoekt-related log entries. This will be handled by the standard logging infrastructure. You may also find indexing related errors in sidekiq.log
and search related errors in production_json.log
.
The gitlab-zoekt
binary (in both indexer and webserver modes) writes logs to stdout.
Alerts
Section titled “Alerts”kube_persistent_volume_claim_disk_space
Section titled “kube_persistent_volume_claim_disk_space”Zoekt architecture has logic which detects when nodes disk usage is over the limit. Projects will be removed from each node until it the node disk usage under the limit. If the disk space is not coming down quick enough, follow these steps in order:
- remove namespaces manually
- As a last resort, mark the node as lost
WARNING: The PVC disk size must not be increased manually. Zoekt nodes are sized with a specific PVC size and it must remain consistant across all nodes.