Skip to content

Zlonk Service

Zlonk creates, prepares for use, and destroys ZFS clones.

In its first iteration, it is a simple shell script that is customized for cloned Postgres replicas. It was initially written to address the long data extraction outage we experienced around March - May 2021. Other uses uses include the generation of ad-hoc replicas for testing and clean-up purposes.

It currently only runs from cron on patroni-v12-zfs-01-db-grpd.

Zlonk is a simple wrapper to create and destroy clones, and, as such, have very little in the way of “Architecture”. However, there are a few concepts useful in understanding what it does and how it works.

Zlonk is called with two arguments: a project and an instance. A project is simply a grouping mechanism for instances. Instances map directly to a ZFS clone. For instance, the Data team uses the dailyx instance in the datalytics project.

The current version of Zlonks acts like a switch: when a clone is not present, it creates it (and confifgures a Postgres instance to use it); when it it present, it destroys it (after stopping the cloned Postgres replica).

The zpool0/pg_datasets/data12 dataset is under Zlonk control. This dataset is mounted as a file system for use by the cascaded Postgres replica under /var/opt/gitlab/postgresql/data12 much like any other database replica. Clones are created in zpool0/pg_datasets and are named with the project and instance.

zpool0/pg_datasets/data12
zpool0/pg_datasets/data12:datalytics.dailyx

Zlonk is invoked as follows (see crontab for examples)

bin/zlonk.sh <project> <instance>

Once invoked, it will behave in one of two ways:

  • If the project:instance clone does not exist, Zlonk will checkpoint postgres, snapshot the file system, create and mount a clone, and start Postgres, which will attempt recovery. Clones are mounted under /var/opt/gitlab/postgresql/zlonk/. At that point, the cloned replica, which is completely indepdendent of the cascaded replica, will be available on an alternate port.
  • If the project:instance clone does exist, Zlonk will fast-stop the cloned Postgres instance and destroy the clone.

There are no performance constraints relevant to Zlonk. However, we do have some timeout monitoring in place to ensure the clone and destruction tasks are executed within reason or fail to complete for some reason.

Zlonk implements job completion monitoring. Alerts are funneled to the #alerts Slack channel with a S4 severity.

Clone timeouts will generally occur when Postgres is unable to recover the database, which we have observed once since ZLonk went into operation two months ago.

Alert generated:

  • GitLab job has failed: The GitLab job “clone” resource “zlonk..” has failed.
  • Tier: db, Type: zlonk.postgres

We have not found a root cause, but the solution is to destroy and regenerate the clone, so run zlonk twice (see crontab for invocation).