Skip to content

Provisioning Redis Cluster

This document outlines the steps for provisioning a Redis Cluster. Former attempts are documented here:

First, this guide defines a few variables in <>, namely:

  • ENV: pre, gstg, gprd
  • GCP_PROJECT: gitlab-production or gitlab-staging-1
  • INSTANCE_TYPE: feature-flag
  • RAILS_INSTANCE_NAME: The name that GitLab Rails would recognise. This matches redis.xxx.yml or the 2nd-top-level key in redis.yml (top-level key being production)
  • RAILS_INSTANCE_NAME_OMNIBUS: RAILS_INSTANCE_NAME but using underscore, i.e. feature_flag instead of feature-flag
  • RAILS_CLASS_NAME: The class which would connect to the new Redis Cluster. e.g. Gitlab::Redis::FeatureFlag.
  • REPLICA_REDACTED: Generated in Generate Redis passwords step
  • RAILS_REDACTED: Generated in Generate Redis passwords step
  • EXPORTER_REDACTED: Generated in Generate Redis passwords step
  • CONSOLE_REDACTED: Generated in Generate Redis passwords step

When configuring the application, note that the name of the instance must match the object name in lowercase and kebab-case/snake-case in the application. E.g. We have redis-cluster-chat-cache service but in GitLab Rails, the object is Gitlab::Redis::Chat. Hence chat should be used when configuring the secret for the application in console and Kubernetes.

Note: To avoid mistakes in manually copy-pasting the variables in <> above during a provisioning session, it is recommended to prepare this doc with all the variables replaced beforehand.

Generate four passwords, REPLICA_REDACTED, RAILS_REDACTED, EXPORTER_REDACTED, and CONSOLE_REDACTED using:

for I in REPLICA_REDACTED RAILS_REDACTED EXPORTER_REDACTED CONSOLE_REDACTED; do echo $I; openssl rand -hex 32; done

Update both redis-cluster and redis-exporter gkms vault secrets using these commands in the chef-repo:

./bin/gkms-vault-edit redis-cluster <ENV>

Update the JSON payload to include the new instance details:

{
...,
"redis-cluster-<INSTANCE_TYPE>": {
"redis_conf": {
"masteruser": "replica",
"masterauth": "<REPLICA_REDACTED>",
"user": [
"default off",
"replica on ~* &* +@all ><REPLICA_REDACTED>",
"console on ~* &* +@all ><CONSOLE_REDACTED>",
"redis_exporter on +client +ping +info +config|get +cluster|info +slowlog +latency +memory +select +get +scan +xinfo +type +pfcount +strlen +llen +scard +zcard +hlen +xlen +eval allkeys ><EXPORTER_REDACTED>",
"rails on ~* &* +@all -debug ><RAILS_REDACTED>"
]
}
}
}

Do the same for

./bin/gkms-vault-edit redis-exporter <ENV>

Modify the existing JSON

{
"redis_exporter": {
"redis-cluster-<INSTANCE_TYPE>": {
"env": {
"REDIS_PASSWORD": "<EXPORTER_REDACTED>"
}
}
}
}

Set the new chef roles and add the new role to the list of gitlab-redis roles in -infra-prometheus-server role.

An example MR can be found here.

Provision the VMs via the generic-stor/google terraform module. This is done in the config-mgmt project in the ops environment. An example MR can be found here.

After the MR is merged and applied, check the VM state via:

gcloud compute instances list --project <GCP_PROJECT> | grep 'redis-cluster-<INSTANCE_TYPE>'

You need to wait for the initial chef-client run to complete.

One way to check is to tail the serial port output to check when the initial run is completed. An example:

gcloud compute --project=<GCP_PROJECT> instances tail-serial-port-output redis-cluster-<INSTANCE_TYPE>-shard-01-01-db-<ENV> --zone us-east1-{c/b/d}

a. SSH into one of the instance:

ssh redis-cluster-<INSTANCE_TYPE>-shard-01-01-db-<ENV>.c.<GCP_PROJECT>.internal

b. Run the following:

export ENV=<ENV>
export PROJECT=<GCP_PROJECT>
export DEPLOYMENT=redis-cluster-<INSTANCE_TYPE>

c. Use the following command to connect the master-nodes and initialise a working cluster. Add more shard FQDN where necessary. e.g. $DEPLOYMENT-shard-<shard_number>-01-db-$ENV.c.$PROJECT.internal:6379.

sudo gitlab-redis-cli --cluster create \
$DEPLOYMENT-shard-01-01-db-$ENV.c.$PROJECT.internal:6379 \
$DEPLOYMENT-shard-02-01-db-$ENV.c.$PROJECT.internal:6379 \
$DEPLOYMENT-shard-03-01-db-$ENV.c.$PROJECT.internal:6379

Use the following command to connect the remaining nodes to the cluster. Update {01, 02, 03, ... n}-{02,03,..m} where n is the number of shards and m is the number of instances per shard.

for i in {01,02,03}-{02,03}; do
sudo gitlab-redis-cli --cluster add-node \
$DEPLOYMENT-shard-$i-db-$ENV.c.$PROJECT.internal:6379 \
$DEPLOYMENT-shard-01-01-db-$ENV.c.$PROJECT.internal:6379
sleep 2
done

Use the following command, to assign the replicas within each shard. Update {01, 02, 03, ... n}-{02,03,..n} where necessary depending on the cluster-size.

for i in {01,02,03}; do
for j in {02,03}; do
node_id="$(sudo gitlab-redis-cli cluster nodes | grep $DEPLOYMENT-shard-$i-01-db-$ENV.c.$PROJECT.internal | awk '{ print $1 }')";
sudo gitlab-redis-cli -h $DEPLOYMENT-shard-$i-$j-db-$ENV.c.$PROJECT.internal \
cluster replicate $node_id
done
done

Wait for a few seconds as the nodes need time to gossip. Check the status via:

$ sudo gitlab-redis-cli --cluster info $DEPLOYMENT-shard-01-01-db-$ENV.c.$PROJECT.internal:6379
redis-cluster-ratelimiting-shard-01-01-db-gprd.c.gitlab-production.internal:6379 (9b0828e3...) -> 0 keys | 5461 slots | 2 slaves.
10.217.21.3:6379 (ac03fcee...) -> 0 keys | 5461 slots | 2 slaves.
10.217.21.4:6379 (f8341afd...) -> 0 keys | 5462 slots | 2 slaves.
[OK] 0 keys in 3 masters.
0.00 keys per slot on average.
$ sudo gitlab-redis-cli cluster info | head -n7
cluster_state:ok
cluster_slots_assigned:16384
cluster_slots_ok:16384
cluster_slots_pfail:0
cluster_slots_fail:0
cluster_known_nodes:9
cluster_size:3

a. Proxy and authenticate to Hashicorp Vault:

glsh vault proxy
export VAULT_PROXY_ADDR="socks5://localhost:18200"
glsh vault login
vault kv get -format=json chef/env/<ENV>/shared/gitlab-omnibus-secrets | jq '.data.data' > data.json
cat data.json | jq --arg PASSWORD <RAILS_REDACTED> '."omnibus-gitlab".gitlab_rb."gitlab-rails".redis_yml_override.<RAILS_INSTANCE_NAME_OMNIBUS>.password = $PASSWORD' > data.json.tmp
diff -u data.json data.json.tmp
mv data.json.tmp data.json
vault kv patch chef/env/<ENV>/shared/gitlab-omnibus-secrets @data.json
rm data.json
OR
glsh vault edit-secret chef env/<ENV>/shared/gitlab-omnibus-secrets

Update roles/-base.json with the relevant connection details. An example MR can be found here.

Check the confirmation detail by using gitlab-rails console inside a console instance. You may need to run chef-client to update the node and render the updated configuration files. This is important as the pipeline does not check the correctness of the config files. This may impact the deploy-node as GitLab Rails connects to Redis instances on start-up. There was a past incident of such an issue for reference.

[ gstg ] production> Gitlab::Redis::FeatureFlag.with{|c| c.ping} # replace Gitlab::Redis::FeatureFlag with <RAILS_CLASS_NAME>
=> "PONG"
[ gstg ] production>

a. Update secret

vault kv put k8s/env/<ENV>/ns/gitlab/redis-cluster-<INSTANCE_TYPE>-rails password=<RAILS_REDACTED>

For example,

vault kv get k8s/env/<ENV>/ns/gitlab/redis-cluster-<INSTANCE_TYPE>-rails
======================== Secret Path ========================
k8s/data/env/gprd/ns/gitlab/redis-cluster-ratelimiting-rails
======= Metadata =======
Key Value
--- -----
created_time 2023-03-18T00:33:29.790293426Z
custom_metadata <nil>
deletion_time n/a
destroyed false
version 1
====== Data ======
Key Value
--- -----
password <RAILS_REDACTED>

Note the version of the password in vault kv get and make sure it tallies with the external secret definition in k8s-workload:

gitlab-redis-cluster-<INSTANCE_TYPE>-rails-credential-v1:
refreshInterval: 0
secretStoreName: gitlab-secrets
target:
creationPolicy: Owner
deletionPolicy: Delete
data:
- remoteRef:
key: env/{{ $env }}/ns/gitlab/redis-cluster-<INSTANCE_TYPE>-rails
property: password
version: "1"
secretKey: password

Note that when rotating secrets (eg having v1 in gstg and v2 in gprd), follow a safe and controlled rollout as described here.

b. Update Gitlab Rails .Values.global.redis accordingly.

Either add a new key to .Values.global.redis.<RAILS_INSTANCE_NAME> or .Values.global.redis.redisYmlOverride.<RAILS_INSTANCE_NAME>. An example MR can be found here.

This was encountered when provisioning the production instance of redis-cluster-chat-cache. To resolve this, run chef-repo on Prometheus with:

knife ssh roles:gprd-infra-prometheus-server "sudo chef-client"