Troubleshooting

Is redis running?

Grafana

Grafana dashboards:

For example, there is a Grafana chart showing number of slowlog events in redis-sidekiq (not linking it here because the panel ID changes when Grafana dashboards are deployed).

Prometheus/Thanos

https://thanos.gitlab.net/graph?g0.range_input=1w&g0.expr=redis_up%20%3C%201&g0.tab=0

directly on a redis host

is redis up?
- gitlab-ctl status
can we dial redis?
- telnet localhost 6379
can we talk to redis via redis-cli?

REDISCLI_AUTH=$(sudo grep ^masterauth /var/opt/gitlab/redis/redis.conf|cut -d\" -f2)
/opt/gitlab/embedded/bin/redis-cli info

How to get redis stats?

Grafana

see above

Prometheus/Thanos

Redis hosts are running the redis_exporter . It is scraped by Prometheus. See the exporter documentation for more details.

Example redis_exporter metric in Thanos.

redis-cli

Run:

NOTE: DO NOT USE [MONITOR] COMMAND! It streams back every command processed by the Redis server which can double the use of resources. It can overload a production machine making it unresponsive or causing an OOM kill.

NOTE: DO NOT USE [KEYS] COMMAND! It will overload a production machine.

What is causing Redis to slowdown?

The application relies on Redis throughput to be very high, latency spikes can be detrimental to the operation of the entire application.

Prometheus/Thanos/Grafana

Explore Prometheus/Thanos/Grafana. Historical metrics might suggest a sudden change in the application behavior or traffic, for example:

operations rate, 7d: https://dashboards.gitlab.net/d/redis-cache-main/redis-cache-overview?orgId=1&from=now-7d&to=now&fullscreen&panelId=54
keys rate of change, 7d: https://dashboards.gitlab.net/d/redis-cache-main/redis-cache-overview?orgId=1&from=now-7d&to=now&fullscreen&panelId=64

For Grafana links see above.

slowlog

The slowlog records slow Redis queries.

Redis SLOWLOG command documentation.

Get top 10 Redis slowlog entries:

> slowlog get 10
1) 1) (integer) 5100            # A unique progressive identifier for every slow log entry.
   2) (integer) 1561019091      # The unix timestamp at which the logged command was processed.
   3) (integer) 21390           # The amount of time needed for its execution, in microseconds.
   4) 1) "del"                  # The array composing the arguments of the command.
      2) "cache:gitlab:242234:8213877:Ci::CompareTestReportsService"

To convert the timestamp, use date -d @1561019091.

Get the command execution time threshold at which commands are logged (in microseconds):

> config get slowlog-log-slower-than

Get size of slowlog (entries are discarded like in a FIFO queue):

> config get slowlog-max-len

Monitoring number of slowlog entries

The number of entries added to the slowlog is exposed as a Prometheus metric and there is a Grafana chart for it.

Monitoring the rate of change in the slowlog

A useful metric for monitoring potential slow-downs in Redis is measuring the rate of change in the redis_slowlog_last_id.

This can be done by plotting (changes(redis_slowlog_last_id[1h])](../https://prometheus.gprd.gitlab.net/graph?g0.range_input=1d&g0.expr=changes(redis_slowlog_last_id%5B1h%5D)&g0.tab=0).

Redis latency monitoring framework

Redis provides a latency diagnostic tool: https://redis.io/topics/latency-monitor

You may need to enable it with CONFIG SET latency-monitor-threshold 100.

From https://redis.io/topics/latency-monitor :

By default monitoring is disabled (threshold set to 0), even if the actual cost of latency monitoring is near zero. However while the memory requirements of latency monitoring are very small, there is no good reason to raise the baseline memory usage of a Redis instance that is working well.

LATENCY DOCTOR

> CONFIG SET latency-monitor-threshold 100
> LATENCY DOCTOR
Dave, I have observed latency spikes in this Redis instance.
You don't mind talking about it, do you Dave?

1. command: 5 latency spikes (average 300ms, mean deviation 120ms,
   period 73.40 sec). Worst all time event 500ms.

I have a few advices for you:

- Your current Slow Log configuration only logs events that are
  slower than your configured latency monitor threshold. Please
  use 'CONFIG SET slowlog-log-slower-than 1000'.
- Check your Slow Log to understand what are the commands you are
  running which are too slow to execute. Please check
  http://redis.io/commands/slowlog for more information.
- Deleting, expiring or evicting (because of maxmemory policy)
  large objects is a blocking operation. If you have very large
  objects that are often deleted, expired, or evicted, try to
  fragment those objects into multiple smaller objects.

 > CONFIG SET latency-monitor-threshold 0

generic debugging/troubleshooting tools

gdb

https://redis.io/topics/debugging

redis-cli sub-commands

—latency

—latency-history

redis-cli has a useful command-line argument --latency-history that issues PING commands to a Redis server to measure its responsiveness. For example:

$ /opt/gitlab/embedded/bin/redis-cli --latency-history  -h 10.217.5.102
min: 0, max: 67, avg: 8.65 (799 samples) -- 15.00 seconds range
min: 0, max: 62, avg: 9.03 (783 samples) -- 15.01 seconds range
min: 0, max: 50, avg: 8.53 (802 samples) -- 15.00 seconds range
min: 0, max: 61, avg: 7.96 (830 samples) -- 15.02 seconds range
min: 0, max: 110, avg: 7.32 (860 samples) -- 15.01 seconds range
min: 0, max: 30, avg: 2.28 (1206 samples) -- 15.00 seconds range
min: 0, max: 82, avg: 5.39 (966 samples) -- 15.01 seconds range
min: 0, max: 108, avg: 19.62 (504 samples) -- 15.00 seconds range
min: 0, max: 57, avg: 13.87 (625 samples) -- 15.01 seconds range
min: 0, max: 57, avg: 7.82 (836 samples) -- 15.03 seconds range
min: 0, max: 45, avg: 5.28 (972 samples) -- 15.00 seconds range

This test will run indefinitely until you kill it, but the avg time here is important. The first line shows that on average, a single Redis command took 8 ms to respond—too slow! A healthy looking run returns averages well under a millisecond:

$ /opt/gitlab/embedded/bin/redis-cli --latency-history  -h  10.217.5.101
min: 0, max: 1, avg: 0.10 (1472 samples) -- 15.01 seconds range
min: 0, max: 1, avg: 0.10 (1470 samples) -- 15.00 seconds range
min: 0, max: 2, avg: 0.10 (1470 samples) -- 15.00 seconds range
min: 0, max: 2, avg: 0.11 (1470 samples) -- 15.01 seconds range
min: 0, max: 2, avg: 0.11 (1471 samples) -- 15.01 seconds range

There may be a number of causes for the latency:

Number of client connections: check the number of active TCP connections on the Redis host.
Slow background saves
Key evictions

See https://tech.trivago.com/2017/01/25/learn-redis-the-hard-way-in-production/ more information.

—latency-dist

—bigkeys (find biggest keys)

(uses SCAN command)

—scan (get a list of keys matching a pattern)

redis-cli -a $REDIS_PASS --scan --patern "resque:*"

Get the number of connections per Redis client IP

On a redis host:

sudo lsof -i tcp:6379 | grep ESTABLISHED | sed -E "s/.*6379->(.*):.* \(ESTABLISHED\)/\1/g" | sort | uniq -c | sort -nr

redis-memory-analyzer

https://github.com/gamenet/redis-memory-analyzer

Ruby script to sample Redis keys

Here’s a quick and dirty script to sample Redis key with a REDIS_URL defined:

require 'redis'
require 'uri'

if ENV['REDIS_URL'].nil? || ENV['REDIS_URL'].empty?
  abort "Error: REDIS_URL environment variable is not set.\nUsage: export REDIS_URL='redis://localhost:6379/0'"
end

redis = Redis.new(url: ENV['REDIS_URL'])

# Hash to store grouped keys
key_groups = Hash.new(0)

# Sample counter
sampled_count = 0
max_samples = 1_000_000

puts "Sampling up to #{max_samples} Redis keys..."
puts "Connected to: #{ENV['REDIS_URL']}"
puts

# Use SCAN to iterate through keys efficiently
cursor = "0"
loop do
  # SCAN returns [next_cursor, keys_array]
  cursor, keys = redis.scan(cursor, count: 1000)

  keys.each do |key|
    break if sampled_count >= max_samples

    # Extract first 3 colon-separated parts
    parts = key.split(':', 4) # Split into max 4 parts (we only need first 3)
    prefix = parts[0..2].join(':')

    key_groups[prefix] += 1
    sampled_count += 1
  end

  break if sampled_count >= max_samples || cursor == "0"
end

puts "Sampled #{sampled_count} keys"
puts "\nKey groups (sorted by count):\n\n"

# Sort by count (descending) and display
key_groups.sort_by { |_prefix, count| -count }.each do |prefix, count|
  percentage = (count.to_f / sampled_count * 100).round(2)
  puts "#{prefix.ljust(50)} #{count.to_s.rjust(10)} (#{percentage}%)"
end

puts "\nTotal unique prefixes: #{key_groups.size}"

redis.close

analyze TTL of keys on Redis

If you need to analyze the TTL of keys for a specific pattern, you can use this script:

require 'redis'

if ENV['REDIS_URL'].nil? || ENV['REDIS_URL'].empty?
  abort "Error: REDIS_URL environment variable is not set.\nUsage: export REDIS_URL='redis://localhost:6379/0'"
end

# Get key pattern from command line argument or use default
if ARGV.empty?
  puts "Usage: ruby script.rb <key_pattern> [max_samples]"
  puts "Example: ruby script.rb 'session:gitlab:2:*' 10000"
  puts "\nNo pattern provided. Using default pattern."
  key_pattern = '*'
else
  key_pattern = ARGV[0]
end

# Get max samples from second argument or use default
max_samples = ARGV[1] ? ARGV[1].to_i : 10_000

# Connect to Redis
redis = Redis.new(url: ENV['REDIS_URL'])

# Configuration
puts "Sampling TTLs for keys matching: #{key_pattern}"
puts "Connected to: #{ENV['REDIS_URL']}"
puts "Max samples: #{max_samples}"
puts

# Store TTL values
ttls = []
no_expiry_count = 0
not_found_count = 0
sampled_count = 0

# Use SCAN with MATCH to find matching keys
cursor = "0"
loop do
  cursor, keys = redis.scan(cursor, match: key_pattern, count: 1000)

  keys.each do |key|
    break if sampled_count >= max_samples

    ttl = redis.ttl(key)
    case ttl
    when -1
      no_expiry_count += 1
    when -2
      not_found_count += 1
    else
      ttls << ttl
    end

    sampled_count += 1
  end

  break if sampled_count >= max_samples || cursor == "0"
end

puts "Sampled #{sampled_count} keys\n"

if ttls.empty?
  puts "No keys with TTL found."
else
  # Calculate statistics
  sorted_ttls = ttls.sort
  min_ttl = sorted_ttls.first
  max_ttl = sorted_ttls.last
  avg_ttl = ttls.sum.to_f / ttls.size
  median_ttl = sorted_ttls[ttls.size / 2]

  puts "TTL Statistics (in seconds):"
  puts "-" * 60
  puts "Keys with TTL: #{ttls.size}"
  puts "Keys without expiry: #{no_expiry_count}"
  puts "Keys not found: #{not_found_count}"
  puts
  puts "Min TTL: #{min_ttl}s (#{(min_ttl / 60.0).round(1)} minutes)"
  puts "Max TTL: #{max_ttl}s (#{(max_ttl / 3600.0).round(1)} hours)"
  puts "Average TTL: #{avg_ttl.round(2)}s (#{(avg_ttl / 60.0).round(1)} minutes)"
  puts "Median TTL: #{median_ttl}s (#{(median_ttl / 60.0).round(1)} minutes)"
  puts

  # TTL distribution buckets
  puts "TTL Distribution:"
  puts "-" * 60

  buckets = {
    "< 1 minute" => 0,
    "1-5 minutes" => 0,
    "5-30 minutes" => 0,
    "30-60 minutes" => 0,
    "1-6 hours" => 0,
    "6-24 hours" => 0,
    "> 24 hours" => 0
  }

  ttls.each do |ttl|
    case ttl
    when 0...60
      buckets["< 1 minute"] += 1
    when 60...300
      buckets["1-5 minutes"] += 1
    when 300...1800
      buckets["5-30 minutes"] += 1
    when 1800...3600
      buckets["30-60 minutes"] += 1
    when 3600...21600
      buckets["1-6 hours"] += 1
    when 21600...86400
      buckets["6-24 hours"] += 1
    else
      buckets["> 24 hours"] += 1
    end
  end

  buckets.each do |range, count|
    percentage = (count.to_f / ttls.size * 100).round(2)
    bar = "█" * (percentage / 2).to_i
    puts "#{range.ljust(20)} #{count.to_s.rjust(8)} (#{percentage.to_s.rjust(6)}%) #{bar}"
  end
end

redis.close

analyze memory usage on redis

$ rdb -c memory dump.rdb | ruby redis-analysis-tool.rb
$ cat redis-analysis-tool.rb

  count = 0

  sizes = { }

  ARGF.each do |line|

    next unless line =~ /^\d+,string,(\w+):.*?,(\d+)/

    sizes[$1] = (sizes[$1] || 0) + $2.to_i

    count = count + 1

    if (sizes.keys.size > 10000) || (count % 100000 == 0) then

      sizes.each do |key, size|

        puts "#{key}:#{size}\n"

        sizes = { }

      end

    end

  end

  sizes.each do |key, size|

    puts "#{key}:#{size}\n"

    sizes = { }

  end

Keyspace pattern analysis

https://gitlab.com/gitlab-com/gl-infra/redis-keyspace-analyzer

This tool can perform an offline analysis of patterns in the keyspace. Its input is a dump of keys and key sizes, generated with a tool you can find in the repo. With human supervision it can detect key patterns, show their frequency, and show aggregate memory use per pattern.

Analyze network traffic on a Redis host

This guide describes a technique that will not have a major performance impact on a Redis host. It consists of the following:

Capture Redis traffic using tcpdump.
Split the packet capture into separate flows using tcp-flow.
Run a custom script to aggregate the results.

Capture traffic and download it to your local machine

On the master Redis server, capture TCP packets and compress them with the following commands:

$ df -Th /var/log # confirm there's enough disk space

$ sudo mkdir -p /var/log/pcap-$USER
$ cd /var/log/pcap-$USER
$ sudo chown $USER:$USER .

$ sudo tcpdump -G 30 -W 1 -s 65535 tcp port 6379 -w redis.pcap -i ens4
tcpdump: listening on ens4, link-type EN10MB (Ethernet), capture size 65535 bytes
676 packets captured
718 packets received by filter
0 packets dropped by kernel

It may be cheaper to capture only incoming traffic:

sudo tcpdump -G 30 -W 1 -s 65535 tcp dst port 6379 -w redis.pcap -i ens4

Compression:

gzip redis.pcap

now download the capture with:

scp redis-cache-01-db-gstg.c.gitlab-staging-1.internal:redis.pcap.gz .

remember to remove the pcap file once you’re done!

Split the packet capture using tcpflow

install tcpflow (on MacOS: brew install tcpflow)
split the packet capture into separate tcpflows:

tcpflow -I -s -o redis-analysis -r redis.pcap.gz
cd redis-analysis

Analyze Redis traffic

count redis commands

Get the number of commands send to redis:

$ find . -name '*.06379'|xargs -n 1 perl -0777  -pe 's/\*\d+\r\n\$\d+\r\n(\w+)\r\n\$\d+\r\n([\w\d:]+)/command: $1 $2/gsx;'|grep -a '^command'|grep -v "command: auth "|sort|uniq -c|sort -nr > ./script_report
$ less ./script_report
70334 command: setex peek:requests:
69205 command: get cache:gitlab:geo:current_node:12.0.0-pre:5.1.7
69178 command: get cache:gitlab:geo:node_enabled:12.0.0-pre:5.1.7
65642 command: get cache:gitlab:flipper/v1/feature/enforced_sso_requires_session
(...)

keyspace analysis

The redis trace script parses out flows into a timeline of commands, one line per key. The fields are: timestamp, second offset, command, src host, key pattern, key.

It has some pre-canned key pattern extractions that can be enabled via GITLAB_REDIS_CLUSTER. Supported values are: persistent, cache.

The script can be tweaked or its output further processed with awk and friends.

find redis-analysis -name '*.06379.findx' | GITLAB_REDIS_CLUSTER=cache parallel -j0 -n100 ruby runbooks/scripts/redis_trace_cmd.rb | sed '/^$/d' > trace.txt
gsort --parallel=8 trace.txt -o trace.txt

For example, count per key pattern:

cat trace.txt | awk '{ print $5 }' | sort -n | uniq -c | sort -nr

It is also possible to output in JSON format for processing via jq:

find redis-analysis -name '*.06379.findx' | GITLAB_REDIS_CLUSTER=cache OUTPUT_FORMAT=json parallel -j0 -n100 ruby runbooks/scripts/redis_trace_cmd.rb | sed '/^$/d' > trace.json

This allows for a similar count per command and key pattern:

cat trace.json | jq -c '[.cmd, .patterns]' | sort | uniq -c | sort -rn | head

key size estimation

The following commands are meant to be run on a replica instance, for example redis-cache-01-db-gprd.

In this example, we’re filtering the dump output for Class:merge_requests, replace this with your keyname.

$ sudo gitlab-redis-cli bgsave

# Monitor the file on disk, once it stops increasing in size, it's ready to be used!
$ sudo ls -lta /var/opt/gitlab/redis/dump.rdb

# Once the file is ready, move it to a safe-er location, for example
$ sudo mv /var/opt/gitlab/redis/dump.rdb /var/log/redis-data/
$ RDB_FILE_PATH=/var/log/redis-data

# build the `dump` binary in your local machine (https://github.com/igorwwwwwwwwwwwwwwwwwwww/rdb/tree/version-9)
###  $ git clone https://github.com/igorwwwwwwwwwwwwwwwwwwww/rdb
###  $ cd rdb
###  $ git checkout version-9
###  $ GOOS=linux GOARCH=amd64 go build ./cmd/dump
###  $ scp dump redis-cache-01-db-gprd:

# Now we'll use the `dump` binary to analyze `dump.rdb`
$ sudo ./dump $RDB_FILE_PATH/dump.rdb | awk -F'\t' '$1 ~ /Class:merge_requests/ { sum1 += $3; sum2 += $4 } END { print sum1, sum2 }'
# The two values you get from this represent estimates in bytes used for values and keys+values respectively.
6039116565 6549803309

# Convert to GiB
$ echo $((6039116565.0/(1024.0**3.0)))
5.6243655877187848

The values presented are an optimistic estimate, as redis will require some more memory for its datastructures. Generally, the key size will be on that order of magnitude.

The current maxmemory in Redis-cache is set to 60 GiB. Depending on the numbers you get, the ratio of each compared to the maxmemory can give you an idea of how significant of an impact your change might introduce.

Please remember to delete the RDB file once you’re done!

rm $RDB_FILE_PATH/dump.rdb

Please remember to delete the `pcap` file immediately after performing the analysis

CPU profiling

CPU profiles are useful for diagnosing CPU saturation. Especially since redis is (mostly) single-threaded, CPU can become a bottleneck.

A profile can be captured via perf:

sudo mkdir -p /var/log/perf-$USER
cd /var/log/perf-$USER
sudo chown $USER:$USER .

sudo perf record -p $(pidof redis-server) -F 497 --call-graph dwarf --no-inherit -- sleep 300
sudo perf script --header | gzip > stacks.$(hostname).$(date --iso-8601=seconds).gz
sudo rm perf.data

This will sample stacks at ~500hz.

Those stack traces can then be downloaded and analyzed with flamescope or flamegraph.

scp $host:/var/log/perf-\*/stacks.\*.gz .
cat stacks.$host.$time.gz | gunzip - | ~/code/FlameGraph/stackcollapse-perf.pl | ~/code/FlameGraph/flamegraph.pl > flamegraph.svg

Accessing Redis via the Rails console

Sometimes you may wish to query a production Redis server from a Rails console. Either because you don’t have sufficient access to run redis-cli, or because you are running a query that is easier expressed in Ruby than with redis-cli.

You probably want to use a Redis secondary to do this. This is how you instantiate a Ruby Redis client for a secondary:

redis = Redis.new(Gitlab::Redis::SharedState.params.merge(role: :slave))

Substitute Cache,Queues, TraceChunks, RateLimiting, or Sessions for ‘SharedState’ to get a client for the respective Redis instance.

packetbeat

TODO https://github.com/elastic/beats/tree/master/packetbeat

Profiling the application

TODO e.g. rbspy, will be partially covered by https://gitlab.com/gitlab-com/gl-infra/reliability/-/issues/6940

Failover and Recovery procedures

Accessing the Redis console

Be extremely careful with Redis! There are commands such as KEYS or MONITOR that can lock Redis entirely without any warning. The application relies heavily on cache so locking Redis will result in an immediate downtime.

Redis admin password is stored in the omnibus cookbook secrets in GKMS, and it’s deployed to gitlab config file: /etc/gitlab/gitlab.rb (this file then gets translated into multiple other config files, including redis.conf)

interactive:

REDISCLI_AUTH=$(sudo grep ^masterauth /var/opt/gitlab/redis/redis.conf|cut -d\" -f2) /opt/gitlab/embedded/bin/redis-cli

or oneliners:

REDISCLI_AUTH=$(sudo grep ^masterauth /var/opt/gitlab/redis/redis.conf|cut -d\" -f2) /opt/gitlab/embedded/bin/redis-cli slowlog get 10

Building a new Redis server and starting replication

NOTE: These instructions are for setting up Redis Sentinel: https://redis.io/topics/sentinel . NOT for setting up Redis Cluster: https://redis.io/topics/cluster-tutorial

From time to time you may have to build (or rebuild) a redis cluster. While the omnibus documentation (https://docs.gitlab.com/ee/administration/high_availability/redis.html) says everything should start replicating by magic, it doesn’t in our builds because we touch /etc/gitlab/skip-autoreconfigure on redis nodes, so that restarts during upgrades can be done in a more controlled fashion across multiple nodes.

So, after building the nodes, there are some manual steps to take:

On all nodes, sudo gitlab-ctl reconfigure
- This will reconfigure/start up redis, but not sentinel
On all nodes, sudo gitlab-ctl start sentinel
- Not sure why, but it’s minor
On the replicas, start replicating from the master:
1. REDISCLI_AUTH=$(sudo grep ^masterauth /var/opt/gitlab/redis/redis.conf|cut -d\” -f2)
2. /opt/gitlab/embedded/bin/redis-cli
3. 127.0.0.1:6379> slaveof MASTER_IP 6379
4. 127.0.0.1:6379> info replication

You’re now expecting the replica to report something like:

role:slave
master_host:MASTER_IP
master_port:6379

If you run info replication on the master, you expect to see role:master and connected_slaves:2

Discussion

Sentinel is supposed to control the replication configuration in redis.conf (the ‘slaveof’ configuration line); therefore, when omnibus creates redis.conf it really shouldn’t add that configuration line, otherwise it and sentinel would end up fighting. So new redis nodes created with omnibus installed will all think they’re master, until they’re told otherwise. We do this above, and at that point, sentinel (connected to the master) becomes aware of the replicas, and starts managing their replication status.

It’s a little chicken-and-egg, and humans need to be involved. It should, however, be one-off at cluster build time.

Ban an IP with Rails Rack Attack (which uses redis)

see: https://gitlab.com/gitlab-com/runbooks/blob/master/docs/redis/ban-an-IP-with-redis.md

Replication issues

Possible checks

client-output-buffer-limit

check Redis docs for more information: https://raw.githubusercontent.com/antirez/redis/5.0/redis.conf

> config get client-output-buffer-limit

Checks Using Prometheus

Redis Primaries

List the Redis primaries using:

redis_master_repl_offset > 0

Redis Secondaries

List the Redis secondaries using:

redis_slave_info

Redis Replication Lag

Replication lag indicates that the Redis secondaries are struggling to keep up with the changes on the primary. This may be due to the rate of changes on the primary being too high, or the secondaries being under too much load to keep up.

Replication lag is measured in bytes in the replication stream.

https://dashboards.gitlab.net/dashboard/db/andrew-redis?panelId=13&fullscreen&orgId=1

Redis Replication Events

Check the Redis Replication Events dashboard to see if Redis is frequently failing over. This may indicate replication issues. https://dashboards.gitlab.net/dashboard/db/andrew-redis?panelId=14&fullscreen&orgId=1
Master switch events are logged in the redis log, for example:

$ zcat /var/log/gitlab/redis/@400000005e58927932f8744c.s | grep -i master
2020-02-27_11:35:39.68552 26796:M 27 Feb 2020 11:35:39.685 * MASTER MODE enabled (user request from 'id=267 addr=10.224.8.122:51379 fd=17 name= age=58518 idle=0 flags=x db=0 sub=0 psub=0 multi=3 qbuf=140 qbuf-free=32628 obl=36 oll=0 omem=0 events=r cmd=exec')

Redis Sentinel

NOTE: At the moment of writing, Redis Cluster is not used anywhere in the gitlab.com infrastructure, we only utilize Redis Sentinel.

Redis Sentinel provides compatible clients with a pointer to the current Redis primary. Clients will query Sentinel and then connect directly to the primary Redis (in other words, Sentinel does not proxy requests).

Additionally, Sentinel will reconfigure Redis instances as primary or secondaries, depending on the Sentinel clusters quorum.

For more information see Sentinel documentation

Sentinel is configured via gitlab.rb:

$ sudo grep redis_sentinels /etc/gitlab/gitlab.rb
gitlab_rails['redis_sentinels'] = [{"host"=>"10.66.2.101", "port"=>26379}, {"host"=>"10.66.2.102", "port"=>26379}, {"host"=>"10.66.2.103", "port"=>26379}]

which gets translated into /var/opt/gitlab/sentinel/sentinel.conf.

Get Redis master

Once you have the IP of a sentinel, use redis-cli to access sentinel. Sentinel usually runs on port 26379 (ie, Redis port (6379) + 20000). The sentinel masters command will return a list of Redis primaries managed by this sentinel cluster:

$ /opt/gitlab/embedded/bin/redis-cli -h 10.66.2.101 -p 26379 sentinel masters
6379 sentinel masters
1)  1) "name"
    2) "gitlab-redis"
    3) "ip"
    4) "10.66.2.103"
    5) "port"
    6) "6379"
    7) "runid"
    8) "6f24caa796eb53afcf3b6a883ca02037892c812e"
    9) "flags"
   10) "master"
   11) "link-pending-commands"
   12) "0"
   13) "link-refcount"
   14) "1"
   15) "last-ping-sent"
   16) "0"
   17) "last-ok-ping-reply"
   18) "125"
   19) "last-ping-reply"
   20) "125"
   21) "down-after-milliseconds"
   22) "10000"
   23) "info-refresh"
   24) "2505"
   25) "role-reported"
   26) "master"
   27) "role-reported-time"
   28) "1540240114"
   29) "config-epoch"
   30) "208"
   31) "num-slaves"
   32) "2"
   33) "num-other-sentinels"
   34) "2"
   35) "quorum"
   36) "2"
   37) "failover-timeout"
   38) "60000"
   39) "parallel-syncs"
   40) "1"

A few important details to keep an eye on:

name: the name of the Redis primary/secondaries set. Remember a single Sentinel cluster can manage multiple Redis sets.
ip: the IP of the primary
port: the port of the primary
flags: {+ master +} is good. {- odown -} (Objectively down, the quorum is in agreement about the host being down) and {- sdown -} (Subjectively down, the quorum is in disagreement about the host being down) are bad.
num-other-sentinels: this should be {+ 3 +} for our Sentinel topology. If this number is different, there may be problems with Sentinel.
quorum: this should be {+ 2 +} for our Sentinel topology.

Get Redis slaves

You can also query the list of slaves connected to a sentinel primary using sentinel slaves <primary-name>:

$  /opt/gitlab/embedded/bin/redis-cli -h 10.66.2.102 -p 26379 sentinel slaves gitlab-redis
1)  1) "name"
    2) "10.66.2.102:6379"
    3) "ip"
    4) "10.66.2.102"
    5) "port"
    6) "6379"
    7) "runid"
    8) "664393f67a6c1b5a130c3af52f05429e5d923558"
    9) "flags"
   10) "slave"
   ...

Get Sentinel machines

Redis console

Replication status

> info replication
# Replication
role:master
connected_slaves:4
slave0:ip=10.45.2.8,port=6379,state=online,offset=208856216927,lag=0
slave1:ip=10.45.2.7,port=6379,state=online,offset=208856050552,lag=1
slave2:ip=10.45.2.9,port=6379,state=online,offset=208856088958,lag=1
master_repl_offset:208856228130
repl_backlog_active:1
repl_backlog_size:1048576
repl_backlog_first_byte_offset:208855179555
repl_backlog_histlen:1048576

In this case we are missing slave3 since we have 4 slaves.

Master/slave role of the redis node

> role
1) "master"
2) (integer) 7657965683
3) 1) 1) "10.224.8.102"
      2) "6379"
      3) "7657965683"
   2) 1) "10.224.8.101"
      2) "6379"
      3) "7657965519"

Resolution

Just wait, every slave should automatically restart it’s replication when it drops out
If it takes longer then expected check /var/log/gitlab/redis/current on the mailfunctioning slave for any indications why it won’t restart replication

Helpful Resources

Switch Master manually

How to manually switch primaries

NOTE: This should have no visible negative impact on the GitLab application.

NOTE: There is no authentication required for interacting with Sentinel.

Get current Redis master. On one of the nodes running the redis sentinel (varies by cluster; redis-cache has its own set of sentinel servers, and all the rest run sentinel on the main redis nodes; and this may change in future):

$ /opt/gitlab/embedded/bin/redis-cli -p 26379 SENTINEL masters
1)  1) "name"
    2) "gstg-redis-cache"        # cluster_id
    3) "ip"
    4) "10.224.8.103"            # ip address of the current master
    5) "port"
    6) "6379"
    7) "runid"
    8) "06277f7abca059c268b2c5e2b2581d7d3bf330f1"
    9) "flags"
   10) "master"
   11) "link-pending-commands"
   12) "0"
   13) "link-refcount"
   14) "1"
   15) "last-ping-sent"
   16) "0"
   17) "last-ok-ping-reply"
   18) "440"
   19) "last-ping-reply"
   20) "440"
   21) "down-after-milliseconds"
   22) "10000"
   23) "info-refresh"
   24) "9021"
   25) "role-reported"
   26) "master"
   27) "role-reported-time"
   28) "956691745"
   29) "config-epoch"
   30) "51"
   31) "num-slaves"
   32) "2"
   33) "num-other-sentinels"
   34) "2"
   35) "quorum"
   36) "2"
   37) "failover-timeout"
   38) "60000"
   39) "parallel-syncs"
   40) "1"

Failover the master to one of the replicas:

/opt/gitlab/embedded/bin/redis-cli -p 26379 SENTINEL failover CLUSTER_NAME

CLUSTER_NAME is one of gprd-redis (main persistent cluster), gprd-redis-cache (primary transient cache), gprd-redis-sidekiq (sidekiq specific persistent cluster), gprd-redis-tracechunks (CI build tracechunks persistent cluster), gprd-redis-ratelimiting (RackAttack/App Rate limiting cluster), or gprd-sessions (Web sessions)

Replication flapping

Possible causes

A redis failover causes the slaves to sync from the master, that might be constrained by the client-output-buffer-limit.
If Redis is frequently failing over, it may be worth checking the Redis Sentinel logs (/var/log/gitlab/sentinel/current).
Possible causes include:
- Host network connectivity
- Redis is being killed by the OOMKiller
- A very high latency command (for example keys * or debug sleep 60) is preventing Redis from processing commands
- Redis is unable to write the RDB snapshot, leading to the instance becoming read-only (check /opt/gitlab/embedded/bin/redis-cli config get dir, df -h /var/opt/gitlab/redis for space)

Possible fixes

Temporarily disable the client-output-buffer-limit on the new master.

REDISCLI_AUTH=$(sudo grep ^masterauth /var/opt/gitlab/redis/redis.conf|cut -d\" -f2)
/opt/gitlab/embedded/bin/redis-cli config set client-output-buffer-limit "slave 0 0 0"

Once the cluster is stable again, revert the change by setting the value, to the value from the configuration file. (/var/opt/gitlab/redis/redis.conf) You’ll need to convert any non-bytes number into bytes to apply it on the console (i.e. 4gb = 410241024*1024 = 4294967296)

Thus for a line in the config like this

client-output-buffer-limit slave 4gb 4gb 0

You need to execute this:

REDISCLI_AUTH=$(sudo grep ^masterauth /var/opt/gitlab/redis/redis.conf|cut -d\" -f2)
/opt/gitlab/embedded/bin/redis-cli config set client-output-buffer-limit "slave 4294967296 4294967296 0"

Redis is down

Start Redis

gitlab-ctl start redis

Failed to collect Redis metrics

Symptoms

You see alerts like FailedToCollectRedisMetrics.
Redis metrics are unavailable

Possible checks

Solution

If everything looks ok, it might be that the instance made a full resync from master. During that time the redis_exporter fails to collect metrics from redis. Check /var/log/gitlab/redis/current for MASTER <-> SLAVE sync events during the time of the alert.

If either of the redis or sentinel services is down, restart it with

gitlab-ctl restart redis

gitlab-ctl restart sentinel.

Else check for possible issues in /var/log/gitlab/redis/current (e.g. resync from master) and see [redis_replication.md].

Miscellaneous

BigKeys analysis

Per https://gitlab.com/gitlab-com/gl-infra/scalability/-/issues/360 there may be a script that runs periodically (hourly by default) on a redis replica, to collect ‘bigkeys’ output and store it for later analysis.

The bigkeys are stored in a GCS bucket named gitlab-gprd-redis-analysis under the gitlab-production project.

The frequency can be controlled with the chef attribute redis_analysis.bigkeys.timer_on_calendar, being a systemd time spec. You probably do not want to run it more than once an hour (it’s intended for broad-brush data collection, not fine-grained), although other than considering how long it takes to run and avoiding overlap there’s not actual constraint on that.

If it needs to be stopped for some reason (it is running badly, is causing undue load, or other unexpected effects) it can be

Stopped if currently running, with `sudo systemctl stop redis-bigkeys-extract.service’
Prevented from running again (until chef next runs) with sudo systemctl stop redis-bigkeys-extract.timer
Turned off by chef by setting the attribute ‘redis_analysis.bigkeys.timer_enabled` to false, e.g. in a role

Troubleshooting

Is redis running?

Grafana

Prometheus/Thanos

directly on a redis host

How to get redis stats?

Grafana

Prometheus/Thanos

redis-cli

What is causing Redis to slowdown?

Prometheus/Thanos/Grafana

slowlog

Monitoring number of slowlog entries

Monitoring the rate of change in the slowlog

Redis latency monitoring framework

LATENCY DOCTOR

generic debugging/troubleshooting tools

gdb

redis-cli sub-commands

—latency

—latency-history

—latency-dist

—bigkeys (find biggest keys)

—scan (get a list of keys matching a pattern)

Get the number of connections per Redis client IP

redis-memory-analyzer

Ruby script to sample Redis keys

analyze TTL of keys on Redis

analyze memory usage on redis

Keyspace pattern analysis

Analyze network traffic on a Redis host

Capture traffic and download it to your local machine

Split the packet capture using tcpflow

Analyze Redis traffic

count redis commands

keyspace analysis

key size estimation

Please remember to delete the pcap file immediately after performing the analysis

CPU profiling

Accessing Redis via the Rails console

packetbeat

Profiling the application

Failover and Recovery procedures

Accessing the Redis console

Building a new Redis server and starting replication

Discussion

Ban an IP with Rails Rack Attack (which uses redis)

Replication issues

Possible checks

client-output-buffer-limit

Checks Using Prometheus

Redis Primaries

Redis Secondaries

Redis Replication Lag

Redis Replication Events

Redis Sentinel

Get Redis master

Get Redis slaves

Get Sentinel machines

Redis console

Replication status

Master/slave role of the redis node

Resolution

Helpful Resources

Switch Master manually

How to manually switch primaries

Replication flapping

Possible causes

Possible fixes

Redis is down

Start Redis

Failed to collect Redis metrics

Symptoms

Possible checks

Solution

Miscellaneous

BigKeys analysis

References

Please remember to delete the `pcap` file immediately after performing the analysis