Why partition?

generated with DocToc

Why partition?
Timing
Determining what to split off
Implementation
Application changes
Infrastructure changes
- VMs
- K8s
- Redis-cluster
Observability
Migration process
Verification

Redis is a (mostly) single threaded application and thus can only use one CPU. Some small amount of CPU can be offloaded to IO threads, but it is a small percentage. This means that we have a hard upper limit for CPU capacity for any specific redis instance. Partitioning allows us to get around this by splitting off part of the workload of an existing redis instance to a new, separate redis instance. We are tracking this in capacity planning issues, and should have a reasonable amount of warning if this needs to be done.

In the longer term, we’re working on redis cluster to allow for horizontal scaling as well.

The other reason to partition redis would be for workload isolation, and that is a reason that will become the more common scenario after we have a functional horizontal scaling system.

The last two redis partitions that we have done took between 3 - 6 weeks to complete. The main factors that slow down or speed up the time required are the availability of reviewers for MRs, particularly for the application changes. In general, tamland forecasts capacity out approximately three months, which should give us adequate time to partition if required.

This is more of an art than a science, as it will require some amount of knowledge of the usage patterns of the redis in question. We have documented how to analyze memory usage as well as keyspace usage and network traffic, all of which can be combined to help determine what the largest chunk that makes logical sense to move is.

A lot of these processes are still a work in progress and will be modified based off of the results of the epic to lessen sharding toil.

Create a new redis store class. It should inherit from Gitlab::Redis::Wrapper. Configure its config_fallback to be the current store from which you’re sharding off this workload.
Add use_primary_and_secondary_stores_for_<store_name> and use_primary_store_as_default_for_<store_name> feature flags, matching the name of the new store.
Update the relevant client code to use the new store.

Example issue and MRs.

With the current k8s overhead due to the networking stack, we can only migrate workloads that are expected to remain well under saturation threshold to Kubernetes. See discussion points here.. This means that in most cases, we’ll likely be functionally sharding off onto VMs, but I’ve included the k8s documentation for when we can shift to using that more. We presently do not have a well known method of creating a redis cluster shard, but that is work in progress as well.

VMs

Step 1 and 2 will leave the cluster in a strange state, with each redis host believing they are the primary and a somewhat confused sentinel quorum. In order to get things back in order, you’ll need to run chef-client, gitlab-ctl reconfigure, and set each secondary to be a replica of the primary. The redis-reconfigure.sh script does this for you.

./scripts/redis-reconfigure.sh $ENVIRONMENT $INSTANCE_NAME bootstrap

K8s

Redis-cluster

Not yet implemented.)

The new redis type will require a new set of dashboards created from the redis archetype. Example MR.

Create MRs to configure gitlab-rails for the new instance

This needs to be done in both chef and k8s.

Example MRs

Chef K8s

Update chef-repo vault with the correct key.

In the chef-repo repository, run the following command, and then update the section with the correct information.

$ bin/gkms-vault-edit gitlab-omnibus-secrets gprd

"gitlab-rails": {
...
  "redis_yml_override": {
    "db_load_balancing": {
      "url": "redis://<IAMAPASSWORD>@gprd-redis-db-load-balancing"
    }
  },
...

Use feature flags to turn on and off the dual writes

Use feature flags to transition to the new store. Between each step, check error metrics and error logs.

You should also let enough time elapse between feature toggles to “warm up” the new store. The amount of time required to warm up a new instance depends on the usage pattern. Often, looking at the info keyspace ttl (which is in milliseconds) and multiplying it times two will get you a pretty good guestimate. For some usage patterns, we do not have a TTL set, and those will require a different method of rollout. See scalability #2193 for more information.

The sequence of feature flag toggles you want to follow is:

Turn on use_primary_and_secondary_stores_for_<store_name>
Turn on use_primary_store_as_default_for_<store_name>
Turn off use_primary_and_secondary_stores_for_<store_name>

After the first feature flag is toggled, you should begin to see activity on your new instance. Look for overall RPS, primary RPS and connected clients on the appropriate Grafana dashboard. Another good command to use is info keyspace on the new redis instance.

Before dual writes:

sudo gitlab-redis-cli
0.0.0.0:6379> info keyspace

0.0.0.0:6379>

After dual writes:

0.0.0.0:6379> info keyspace

db0:keys=8319,expires=8319,avg_ttl=14836

Example change request

Double check the appropriate feature flags as there has been a lot of discussion around how MultiStore works.

Dashboards for both the new and old redis stores will provide some good insights into the before and after usage.

This thanos query which compares week over week primary_cpu usage is also a good one to use.