Troubleshooting
Is redis running?
Section titled “Is redis running?”Grafana
Section titled “Grafana”Grafana dashboards:
For example, there is a Grafana chart showing number of slowlog events in redis-sidekiq (not linking it here because the panel ID changes when Grafana dashboards are deployed).
Prometheus/Thanos
Section titled “Prometheus/Thanos”https://thanos.gitlab.net/graph?g0.range_input=1w&g0.expr=redis_up%20%3C%201&g0.tab=0
directly on a redis host
Section titled “directly on a redis host”- is redis up?
gitlab-ctl status
- can we dial redis?
telnet localhost 6379
- can we talk to redis via
redis-cli
?
REDISCLI_AUTH=$(sudo grep ^masterauth /var/opt/gitlab/redis/redis.conf|cut -d\" -f2)/opt/gitlab/embedded/bin/redis-cli info
How to get redis stats?
Section titled “How to get redis stats?”Grafana
Section titled “Grafana”Prometheus/Thanos
Section titled “Prometheus/Thanos”Redis hosts are running the redis_exporter
. It is scraped by Prometheus. See the exporter documentation for more details.
Example redis_exporter
metric in Thanos.
redis-cli
Section titled “redis-cli”Run:
NOTE: DO NOT USE [MONITOR] COMMAND! It streams back every command processed by the Redis server which can double the use of resources. It can overload a production machine making it unresponsive or causing an OOM kill.
NOTE: DO NOT USE [KEYS] COMMAND! It will overload a production machine.
What is causing Redis to slowdown?
Section titled “What is causing Redis to slowdown?”The application relies on Redis throughput to be very high, latency spikes can be detrimental to the operation of the entire application.
Prometheus/Thanos/Grafana
Section titled “Prometheus/Thanos/Grafana”Explore Prometheus/Thanos/Grafana. Historical metrics might suggest a sudden change in the application behavior or traffic, for example:
- operations rate, 7d: https://dashboards.gitlab.net/d/redis-cache-main/redis-cache-overview?orgId=1&from=now-7d&to=now&fullscreen&panelId=54
- keys rate of change, 7d: https://dashboards.gitlab.net/d/redis-cache-main/redis-cache-overview?orgId=1&from=now-7d&to=now&fullscreen&panelId=64
For Grafana links see above.
slowlog
Section titled “slowlog”The slowlog records slow Redis queries.
Redis SLOWLOG command documentation.
Get top 10 Redis slowlog entries:
> slowlog get 101) 1) (integer) 5100 # A unique progressive identifier for every slow log entry. 2) (integer) 1561019091 # The unix timestamp at which the logged command was processed. 3) (integer) 21390 # The amount of time needed for its execution, in microseconds. 4) 1) "del" # The array composing the arguments of the command. 2) "cache:gitlab:242234:8213877:Ci::CompareTestReportsService"
To convert the timestamp, use date -d @1561019091
.
Get the command execution time threshold at which commands are logged (in microseconds):
> config get slowlog-log-slower-than
Get size of slowlog (entries are discarded like in a FIFO queue):
> config get slowlog-max-len
Monitoring number of slowlog entries
Section titled “Monitoring number of slowlog entries”The number of entries added to the slowlog is exposed as a Prometheus metric and there is a Grafana chart for it.
Monitoring the rate of change in the slowlog
Section titled “Monitoring the rate of change in the slowlog”A useful metric for monitoring potential slow-downs in Redis is measuring the rate of change in the redis_slowlog_last_id
.
This can be done by plotting (changes(redis_slowlog_last_id[1h])
](../https://prometheus.gprd.gitlab.net/graph?g0.range_input=1d&g0.expr=changes(redis_slowlog_last_id%5B1h%5D)&g0.tab=0).
Redis latency monitoring framework
Section titled “Redis latency monitoring framework”Redis provides a latency diagnostic tool: https://redis.io/topics/latency-monitor
You may need to enable it with CONFIG SET latency-monitor-threshold 100
.
From https://redis.io/topics/latency-monitor :
By default monitoring is disabled (threshold set to 0), even if the actual cost of latency monitoring is near zero. However while the memory requirements of latency monitoring are very small, there is no good reason to raise the baseline memory usage of a Redis instance that is working well.
LATENCY DOCTOR
Section titled “LATENCY DOCTOR”> CONFIG SET latency-monitor-threshold 100> LATENCY DOCTORDave, I have observed latency spikes in this Redis instance.You don't mind talking about it, do you Dave?
1. command: 5 latency spikes (average 300ms, mean deviation 120ms, period 73.40 sec). Worst all time event 500ms.
I have a few advices for you:
- Your current Slow Log configuration only logs events that are slower than your configured latency monitor threshold. Please use 'CONFIG SET slowlog-log-slower-than 1000'.- Check your Slow Log to understand what are the commands you are running which are too slow to execute. Please check http://redis.io/commands/slowlog for more information.- Deleting, expiring or evicting (because of maxmemory policy) large objects is a blocking operation. If you have very large objects that are often deleted, expired, or evicted, try to fragment those objects into multiple smaller objects.
> CONFIG SET latency-monitor-threshold 0
generic debugging/troubleshooting tools
Section titled “generic debugging/troubleshooting tools”https://redis.io/topics/debugging
redis-cli sub-commands
Section titled “redis-cli sub-commands”—latency
Section titled “—latency”—latency-history
Section titled “—latency-history”redis-cli
has a useful command-line argument --latency-history
that
issues PING commands to a Redis server to measure its
responsiveness. For example:
$ /opt/gitlab/embedded/bin/redis-cli --latency-history -h 10.217.5.102min: 0, max: 67, avg: 8.65 (799 samples) -- 15.00 seconds rangemin: 0, max: 62, avg: 9.03 (783 samples) -- 15.01 seconds rangemin: 0, max: 50, avg: 8.53 (802 samples) -- 15.00 seconds rangemin: 0, max: 61, avg: 7.96 (830 samples) -- 15.02 seconds rangemin: 0, max: 110, avg: 7.32 (860 samples) -- 15.01 seconds rangemin: 0, max: 30, avg: 2.28 (1206 samples) -- 15.00 seconds rangemin: 0, max: 82, avg: 5.39 (966 samples) -- 15.01 seconds rangemin: 0, max: 108, avg: 19.62 (504 samples) -- 15.00 seconds rangemin: 0, max: 57, avg: 13.87 (625 samples) -- 15.01 seconds rangemin: 0, max: 57, avg: 7.82 (836 samples) -- 15.03 seconds rangemin: 0, max: 45, avg: 5.28 (972 samples) -- 15.00 seconds range
This test will run indefinitely until you kill it, but the avg
time here
is important. The first line shows that on average, a single Redis command
took 8 ms to respond—too slow! A healthy looking run returns averages well
under a millisecond:
$ /opt/gitlab/embedded/bin/redis-cli --latency-history -h 10.217.5.101min: 0, max: 1, avg: 0.10 (1472 samples) -- 15.01 seconds rangemin: 0, max: 1, avg: 0.10 (1470 samples) -- 15.00 seconds rangemin: 0, max: 2, avg: 0.10 (1470 samples) -- 15.00 seconds rangemin: 0, max: 2, avg: 0.11 (1470 samples) -- 15.01 seconds rangemin: 0, max: 2, avg: 0.11 (1471 samples) -- 15.01 seconds range
There may be a number of causes for the latency:
- Number of client connections: check the number of active TCP connections on the Redis host.
- Slow background saves
- Key evictions
See https://tech.trivago.com/2017/01/25/learn-redis-the-hard-way-in-production/ more information.
—latency-dist
Section titled “—latency-dist”—bigkeys (find biggest keys)
Section titled “—bigkeys (find biggest keys)”(uses SCAN command)
—scan (get a list of keys matching a pattern)
Section titled “—scan (get a list of keys matching a pattern)”redis-cli -a $REDIS_PASS --scan --patern "resque:*"
Get the number of connections per Redis client IP
Section titled “Get the number of connections per Redis client IP”On a redis host:
sudo lsof -i tcp:6379 | grep ESTABLISHED | sed -E "s/.*6379->(.*):.* \(ESTABLISHED\)/\1/g" | sort | uniq -c | sort -nr
redis-memory-analyzer
Section titled “redis-memory-analyzer”https://github.com/gamenet/redis-memory-analyzer
Ruby script to sample Redis keys
Section titled “Ruby script to sample Redis keys”Here’s a quick and dirty script to sample Redis key with a REDIS_URL
defined:
require 'redis'require 'uri'
if ENV['REDIS_URL'].nil? || ENV['REDIS_URL'].empty? abort "Error: REDIS_URL environment variable is not set.\nUsage: export REDIS_URL='redis://localhost:6379/0'"end
redis = Redis.new(url: ENV['REDIS_URL'])
# Hash to store grouped keyskey_groups = Hash.new(0)
# Sample countersampled_count = 0max_samples = 1_000_000
puts "Sampling up to #{max_samples} Redis keys..."puts "Connected to: #{ENV['REDIS_URL']}"puts
# Use SCAN to iterate through keys efficientlycursor = "0"loop do # SCAN returns [next_cursor, keys_array] cursor, keys = redis.scan(cursor, count: 1000)
keys.each do |key| break if sampled_count >= max_samples
# Extract first 3 colon-separated parts parts = key.split(':', 4) # Split into max 4 parts (we only need first 3) prefix = parts[0..2].join(':')
key_groups[prefix] += 1 sampled_count += 1 end
break if sampled_count >= max_samples || cursor == "0"end
puts "Sampled #{sampled_count} keys"puts "\nKey groups (sorted by count):\n\n"
# Sort by count (descending) and displaykey_groups.sort_by { |_prefix, count| -count }.each do |prefix, count| percentage = (count.to_f / sampled_count * 100).round(2) puts "#{prefix.ljust(50)} #{count.to_s.rjust(10)} (#{percentage}%)"end
puts "\nTotal unique prefixes: #{key_groups.size}"
redis.close
analyze TTL of keys on Redis
Section titled “analyze TTL of keys on Redis”If you need to analyze the TTL of keys for a specific pattern, you can use this script:
require 'redis'
if ENV['REDIS_URL'].nil? || ENV['REDIS_URL'].empty? abort "Error: REDIS_URL environment variable is not set.\nUsage: export REDIS_URL='redis://localhost:6379/0'"end
# Get key pattern from command line argument or use defaultif ARGV.empty? puts "Usage: ruby script.rb <key_pattern> [max_samples]" puts "Example: ruby script.rb 'session:gitlab:2:*' 10000" puts "\nNo pattern provided. Using default pattern." key_pattern = '*'else key_pattern = ARGV[0]end
# Get max samples from second argument or use defaultmax_samples = ARGV[1] ? ARGV[1].to_i : 10_000
# Connect to Redisredis = Redis.new(url: ENV['REDIS_URL'])
# Configurationputs "Sampling TTLs for keys matching: #{key_pattern}"puts "Connected to: #{ENV['REDIS_URL']}"puts "Max samples: #{max_samples}"puts
# Store TTL valuesttls = []no_expiry_count = 0not_found_count = 0sampled_count = 0
# Use SCAN with MATCH to find matching keyscursor = "0"loop do cursor, keys = redis.scan(cursor, match: key_pattern, count: 1000)
keys.each do |key| break if sampled_count >= max_samples
ttl = redis.ttl(key) case ttl when -1 no_expiry_count += 1 when -2 not_found_count += 1 else ttls << ttl end
sampled_count += 1 end
break if sampled_count >= max_samples || cursor == "0"end
puts "Sampled #{sampled_count} keys\n"
if ttls.empty? puts "No keys with TTL found."else # Calculate statistics sorted_ttls = ttls.sort min_ttl = sorted_ttls.first max_ttl = sorted_ttls.last avg_ttl = ttls.sum.to_f / ttls.size median_ttl = sorted_ttls[ttls.size / 2]
puts "TTL Statistics (in seconds):" puts "-" * 60 puts "Keys with TTL: #{ttls.size}" puts "Keys without expiry: #{no_expiry_count}" puts "Keys not found: #{not_found_count}" puts puts "Min TTL: #{min_ttl}s (#{(min_ttl / 60.0).round(1)} minutes)" puts "Max TTL: #{max_ttl}s (#{(max_ttl / 3600.0).round(1)} hours)" puts "Average TTL: #{avg_ttl.round(2)}s (#{(avg_ttl / 60.0).round(1)} minutes)" puts "Median TTL: #{median_ttl}s (#{(median_ttl / 60.0).round(1)} minutes)" puts
# TTL distribution buckets puts "TTL Distribution:" puts "-" * 60
buckets = { "< 1 minute" => 0, "1-5 minutes" => 0, "5-30 minutes" => 0, "30-60 minutes" => 0, "1-6 hours" => 0, "6-24 hours" => 0, "> 24 hours" => 0 }
ttls.each do |ttl| case ttl when 0...60 buckets["< 1 minute"] += 1 when 60...300 buckets["1-5 minutes"] += 1 when 300...1800 buckets["5-30 minutes"] += 1 when 1800...3600 buckets["30-60 minutes"] += 1 when 3600...21600 buckets["1-6 hours"] += 1 when 21600...86400 buckets["6-24 hours"] += 1 else buckets["> 24 hours"] += 1 end end
buckets.each do |range, count| percentage = (count.to_f / ttls.size * 100).round(2) bar = "█" * (percentage / 2).to_i puts "#{range.ljust(20)} #{count.to_s.rjust(8)} (#{percentage.to_s.rjust(6)}%) #{bar}" endend
redis.close
analyze memory usage on redis
Section titled “analyze memory usage on redis”$ rdb -c memory dump.rdb | ruby redis-analysis-tool.rb$ cat redis-analysis-tool.rb
count = 0
sizes = { }
ARGF.each do |line|
next unless line =~ /^\d+,string,(\w+):.*?,(\d+)/
sizes[$1] = (sizes[$1] || 0) + $2.to_i
count = count + 1
if (sizes.keys.size > 10000) || (count % 100000 == 0) then
sizes.each do |key, size|
puts "#{key}:#{size}\n"
sizes = { }
end
end
end
sizes.each do |key, size|
puts "#{key}:#{size}\n"
sizes = { }
end
Keyspace pattern analysis
Section titled “Keyspace pattern analysis”https://gitlab.com/gitlab-com/gl-infra/redis-keyspace-analyzer
This tool can perform an offline analysis of patterns in the keyspace. Its input is a dump of keys and key sizes, generated with a tool you can find in the repo. With human supervision it can detect key patterns, show their frequency, and show aggregate memory use per pattern.
Analyze network traffic on a Redis host
Section titled “Analyze network traffic on a Redis host”This guide describes a technique that will not have a major performance impact on a Redis host. It consists of the following:
- Capture Redis traffic using
tcpdump
. - Split the packet capture into separate flows using tcp-flow.
- Run a custom script to aggregate the results.
Capture traffic and download it to your local machine
Section titled “Capture traffic and download it to your local machine”On the master Redis server, capture TCP packets and compress them with the following commands:
$ df -Th /var/log # confirm there's enough disk space
$ sudo mkdir -p /var/log/pcap-$USER$ cd /var/log/pcap-$USER$ sudo chown $USER:$USER .
$ sudo tcpdump -G 30 -W 1 -s 65535 tcp port 6379 -w redis.pcap -i ens4tcpdump: listening on ens4, link-type EN10MB (Ethernet), capture size 65535 bytes676 packets captured718 packets received by filter0 packets dropped by kernel
It may be cheaper to capture only incoming traffic:
sudo tcpdump -G 30 -W 1 -s 65535 tcp dst port 6379 -w redis.pcap -i ens4
Compression:
gzip redis.pcap
now download the capture with:
scp redis-cache-01-db-gstg.c.gitlab-staging-1.internal:redis.pcap.gz .
remember to remove the pcap file once you’re done!
Split the packet capture using tcpflow
Section titled “Split the packet capture using tcpflow”- install tcpflow (on MacOS:
brew install tcpflow
) - split the packet capture into separate tcpflows:
tcpflow -I -s -o redis-analysis -r redis.pcap.gzcd redis-analysis
Analyze Redis traffic
Section titled “Analyze Redis traffic”count redis commands
Section titled “count redis commands”Get the number of commands send to redis:
$ find . -name '*.06379'|xargs -n 1 perl -0777 -pe 's/\*\d+\r\n\$\d+\r\n(\w+)\r\n\$\d+\r\n([\w\d:]+)/command: $1 $2/gsx;'|grep -a '^command'|grep -v "command: auth "|sort|uniq -c|sort -nr > ./script_report$ less ./script_report70334 command: setex peek:requests:69205 command: get cache:gitlab:geo:current_node:12.0.0-pre:5.1.769178 command: get cache:gitlab:geo:node_enabled:12.0.0-pre:5.1.765642 command: get cache:gitlab:flipper/v1/feature/enforced_sso_requires_session(...)
keyspace analysis
Section titled “keyspace analysis”The redis trace script parses out flows into a timeline of commands, one line per key. The fields are: timestamp, second offset, command, src host, key pattern, key.
It has some pre-canned key pattern extractions that can be enabled via GITLAB_REDIS_CLUSTER
. Supported values are: persistent
, cache
.
The script can be tweaked or its output further processed with awk
and friends.
find redis-analysis -name '*.06379.findx' | GITLAB_REDIS_CLUSTER=cache parallel -j0 -n100 ruby runbooks/scripts/redis_trace_cmd.rb | sed '/^$/d' > trace.txtgsort --parallel=8 trace.txt -o trace.txt
For example, count per key pattern:
cat trace.txt | awk '{ print $5 }' | sort -n | uniq -c | sort -nr
It is also possible to output in JSON format for processing via jq
:
find redis-analysis -name '*.06379.findx' | GITLAB_REDIS_CLUSTER=cache OUTPUT_FORMAT=json parallel -j0 -n100 ruby runbooks/scripts/redis_trace_cmd.rb | sed '/^$/d' > trace.json
This allows for a similar count per command and key pattern:
cat trace.json | jq -c '[.cmd, .patterns]' | sort | uniq -c | sort -rn | head
key size estimation
Section titled “key size estimation”The following commands are meant to be run on a replica instance, for example redis-cache-01-db-gprd
.
In this example, we’re filtering the dump output for Class:merge_requests
, replace this with your keyname.
$ sudo gitlab-redis-cli bgsave
# Monitor the file on disk, once it stops increasing in size, it's ready to be used!$ sudo ls -lta /var/opt/gitlab/redis/dump.rdb
# Once the file is ready, move it to a safe-er location, for example$ sudo mv /var/opt/gitlab/redis/dump.rdb /var/log/redis-data/$ RDB_FILE_PATH=/var/log/redis-data
# build the `dump` binary in your local machine (https://github.com/igorwwwwwwwwwwwwwwwwwwww/rdb/tree/version-9)### $ git clone https://github.com/igorwwwwwwwwwwwwwwwwwwww/rdb### $ cd rdb### $ git checkout version-9### $ GOOS=linux GOARCH=amd64 go build ./cmd/dump### $ scp dump redis-cache-01-db-gprd:
# Now we'll use the `dump` binary to analyze `dump.rdb`$ sudo ./dump $RDB_FILE_PATH/dump.rdb | awk -F'\t' '$1 ~ /Class:merge_requests/ { sum1 += $3; sum2 += $4 } END { print sum1, sum2 }'# The two values you get from this represent estimates in bytes used for values and keys+values respectively.6039116565 6549803309
# Convert to GiB$ echo $((6039116565.0/(1024.0**3.0)))5.6243655877187848
The values presented are an optimistic estimate, as redis will require some more memory for its datastructures. Generally, the key size will be on that order of magnitude.
The current maxmemory
in Redis-cache is set to 60 GiB
. Depending on the numbers you get, the ratio of each compared to the maxmemory
can give you an idea of how significant of an impact your change might introduce.
Please remember to delete the RDB file once you’re done!
rm $RDB_FILE_PATH/dump.rdb
Please remember to delete the pcap
file immediately after performing the analysis
Section titled “Please remember to delete the pcap file immediately after performing the analysis”CPU profiling
Section titled “CPU profiling”CPU profiles are useful for diagnosing CPU saturation. Especially since redis is (mostly) single-threaded, CPU can become a bottleneck.
A profile can be captured via perf:
sudo mkdir -p /var/log/perf-$USERcd /var/log/perf-$USERsudo chown $USER:$USER .
sudo perf record -p $(pidof redis-server) -F 497 --call-graph dwarf --no-inherit -- sleep 300sudo perf script --header | gzip > stacks.$(hostname).$(date --iso-8601=seconds).gzsudo rm perf.data
This will sample stacks at ~500hz.
Those stack traces can then be downloaded and analyzed with flamescope or flamegraph.
scp $host:/var/log/perf-\*/stacks.\*.gz .cat stacks.$host.$time.gz | gunzip - | ~/code/FlameGraph/stackcollapse-perf.pl | ~/code/FlameGraph/flamegraph.pl > flamegraph.svg
Accessing Redis via the Rails console
Section titled “Accessing Redis via the Rails console”Sometimes you may wish to query a production Redis server from a Rails
console. Either because you don’t have sufficient access to run
redis-cli
, or because you are running a query that is easier
expressed in Ruby than with redis-cli
.
You probably want to use a Redis secondary to do this. This is how you instantiate a Ruby Redis client for a secondary:
redis = Redis.new(Gitlab::Redis::SharedState.params.merge(role: :slave))
Substitute Cache
,Queues
, TraceChunks
, RateLimiting
, or Sessions
for ‘SharedState’ to get a client
for the respective Redis instance.
packetbeat
Section titled “packetbeat”TODO https://github.com/elastic/beats/tree/master/packetbeat
Profiling the application
Section titled “Profiling the application”TODO e.g. rbspy, will be partially covered by https://gitlab.com/gitlab-com/gl-infra/reliability/-/issues/6940
Failover and Recovery procedures
Section titled “Failover and Recovery procedures”Accessing the Redis console
Section titled “Accessing the Redis console”Be extremely careful with Redis! There are commands such as KEYS or MONITOR that can lock Redis entirely without any warning. The application relies heavily on cache so locking Redis will result in an immediate downtime.
Redis admin password is stored in the omnibus cookbook secrets in GKMS, and it’s deployed to gitlab config file: /etc/gitlab/gitlab.rb (this file then gets translated into multiple other config files, including redis.conf)
interactive:
REDISCLI_AUTH=$(sudo grep ^masterauth /var/opt/gitlab/redis/redis.conf|cut -d\" -f2) /opt/gitlab/embedded/bin/redis-cli
or oneliners:
REDISCLI_AUTH=$(sudo grep ^masterauth /var/opt/gitlab/redis/redis.conf|cut -d\" -f2) /opt/gitlab/embedded/bin/redis-cli slowlog get 10
Building a new Redis server and starting replication
Section titled “Building a new Redis server and starting replication”NOTE: These instructions are for setting up Redis Sentinel: https://redis.io/topics/sentinel . NOT for setting up Redis Cluster: https://redis.io/topics/cluster-tutorial
From time to time you may have to build (or rebuild) a redis cluster. While the omnibus documentation (https://docs.gitlab.com/ee/administration/high_availability/redis.html) says everything should start replicating by magic, it doesn’t in our builds because we touch /etc/gitlab/skip-autoreconfigure on redis nodes, so that restarts during upgrades can be done in a more controlled fashion across multiple nodes.
So, after building the nodes, there are some manual steps to take:
- On all nodes,
sudo gitlab-ctl reconfigure
- This will reconfigure/start up redis, but not sentinel
- On all nodes,
sudo gitlab-ctl start sentinel
- Not sure why, but it’s minor
- On the replicas, start replicating from the master:
- REDISCLI_AUTH=$(sudo grep ^masterauth /var/opt/gitlab/redis/redis.conf|cut -d\” -f2)
- /opt/gitlab/embedded/bin/redis-cli
- 127.0.0.1:6379> slaveof MASTER_IP 6379
- 127.0.0.1:6379> info replication
You’re now expecting the replica to report something like:
role:slavemaster_host:MASTER_IPmaster_port:6379
If you run info replication
on the master, you expect to see role:master
and connected_slaves:2
Discussion
Section titled “Discussion”Sentinel is supposed to control the replication configuration in redis.conf (the ‘slaveof’ configuration line); therefore, when omnibus creates redis.conf it really shouldn’t add that configuration line, otherwise it and sentinel would end up fighting. So new redis nodes created with omnibus installed will all think they’re master, until they’re told otherwise. We do this above, and at that point, sentinel (connected to the master) becomes aware of the replicas, and starts managing their replication status.
It’s a little chicken-and-egg, and humans need to be involved. It should, however, be one-off at cluster build time.
Ban an IP with Rails Rack Attack (which uses redis)
Section titled “Ban an IP with Rails Rack Attack (which uses redis)”see: https://gitlab.com/gitlab-com/runbooks/blob/master/docs/redis/ban-an-IP-with-redis.md
Replication issues
Section titled “Replication issues”Possible checks
Section titled “Possible checks”client-output-buffer-limit
Section titled “client-output-buffer-limit”check Redis docs for more information: https://raw.githubusercontent.com/antirez/redis/5.0/redis.conf
> config get client-output-buffer-limit
Checks Using Prometheus
Section titled “Checks Using Prometheus”Redis Primaries
Section titled “Redis Primaries”List the Redis primaries using:
Redis Secondaries
Section titled “Redis Secondaries”List the Redis secondaries using:
Redis Replication Lag
Section titled “Redis Replication Lag”Replication lag indicates that the Redis secondaries are struggling to keep up with the changes on the primary. This may be due to the rate of changes on the primary being too high, or the secondaries being under too much load to keep up.
Replication lag is measured in bytes in the replication stream.
https://dashboards.gitlab.net/dashboard/db/andrew-redis?panelId=13&fullscreen&orgId=1
Redis Replication Events
Section titled “Redis Replication Events”-
Check the Redis Replication Events dashboard to see if Redis is frequently failing over. This may indicate replication issues. https://dashboards.gitlab.net/dashboard/db/andrew-redis?panelId=14&fullscreen&orgId=1
-
Master switch events are logged in the redis log, for example:
$ zcat /var/log/gitlab/redis/@400000005e58927932f8744c.s | grep -i master2020-02-27_11:35:39.68552 26796:M 27 Feb 2020 11:35:39.685 * MASTER MODE enabled (user request from 'id=267 addr=10.224.8.122:51379 fd=17 name= age=58518 idle=0 flags=x db=0 sub=0 psub=0 multi=3 qbuf=140 qbuf-free=32628 obl=36 oll=0 omem=0 events=r cmd=exec')
Redis Sentinel
Section titled “Redis Sentinel”NOTE: At the moment of writing, Redis Cluster is not used anywhere in the gitlab.com infrastructure, we only utilize Redis Sentinel.
Redis Sentinel provides compatible clients with a pointer to the current Redis primary. Clients will query Sentinel and then connect directly to the primary Redis (in other words, Sentinel does not proxy requests).
Additionally, Sentinel will reconfigure Redis instances as primary or secondaries, depending on the Sentinel clusters quorum.
For more information see Sentinel documentation
Sentinel is configured via gitlab.rb
:
$ sudo grep redis_sentinels /etc/gitlab/gitlab.rbgitlab_rails['redis_sentinels'] = [{"host"=>"10.66.2.101", "port"=>26379}, {"host"=>"10.66.2.102", "port"=>26379}, {"host"=>"10.66.2.103", "port"=>26379}]
which gets translated into /var/opt/gitlab/sentinel/sentinel.conf
.
Get Redis master
Section titled “Get Redis master”Once you have the IP of a sentinel, use redis-cli
to access sentinel. Sentinel usually runs on port 26379
(ie, Redis port (6379
) + 20000
). The sentinel masters
command will return a list of Redis primaries managed by this sentinel cluster:
$ /opt/gitlab/embedded/bin/redis-cli -h 10.66.2.101 -p 26379 sentinel masters6379 sentinel masters1) 1) "name" 2) "gitlab-redis" 3) "ip" 4) "10.66.2.103" 5) "port" 6) "6379" 7) "runid" 8) "6f24caa796eb53afcf3b6a883ca02037892c812e" 9) "flags" 10) "master" 11) "link-pending-commands" 12) "0" 13) "link-refcount" 14) "1" 15) "last-ping-sent" 16) "0" 17) "last-ok-ping-reply" 18) "125" 19) "last-ping-reply" 20) "125" 21) "down-after-milliseconds" 22) "10000" 23) "info-refresh" 24) "2505" 25) "role-reported" 26) "master" 27) "role-reported-time" 28) "1540240114" 29) "config-epoch" 30) "208" 31) "num-slaves" 32) "2" 33) "num-other-sentinels" 34) "2" 35) "quorum" 36) "2" 37) "failover-timeout" 38) "60000" 39) "parallel-syncs" 40) "1"
A few important details to keep an eye on:
name
: the name of the Redis primary/secondaries set. Remember a single Sentinel cluster can manage multiple Redis sets.ip
: the IP of the primaryport
: the port of the primaryflags
: {+master
+} is good. {-odown
-} (Objectively down, the quorum is in agreement about the host being down) and {-sdown
-} (Subjectively down, the quorum is in disagreement about the host being down) are bad.num-other-sentinels
: this should be {+3
+} for our Sentinel topology. If this number is different, there may be problems with Sentinel.quorum
: this should be {+2
+} for our Sentinel topology.
Get Redis slaves
Section titled “Get Redis slaves”You can also query the list of slaves connected to a sentinel primary using sentinel slaves <primary-name>
:
$ /opt/gitlab/embedded/bin/redis-cli -h 10.66.2.102 -p 26379 sentinel slaves gitlab-redis1) 1) "name" 2) "10.66.2.102:6379" 3) "ip" 4) "10.66.2.102" 5) "port" 6) "6379" 7) "runid" 8) "664393f67a6c1b5a130c3af52f05429e5d923558" 9) "flags" 10) "slave" ...
Get Sentinel machines
Section titled “Get Sentinel machines”Redis console
Section titled “Redis console”Replication status
Section titled “Replication status”> info replication# Replicationrole:masterconnected_slaves:4slave0:ip=10.45.2.8,port=6379,state=online,offset=208856216927,lag=0slave1:ip=10.45.2.7,port=6379,state=online,offset=208856050552,lag=1slave2:ip=10.45.2.9,port=6379,state=online,offset=208856088958,lag=1master_repl_offset:208856228130repl_backlog_active:1repl_backlog_size:1048576repl_backlog_first_byte_offset:208855179555repl_backlog_histlen:1048576
In this case we are missing slave3 since we have 4 slaves.
Master/slave role of the redis node
Section titled “Master/slave role of the redis node”> role1) "master"2) (integer) 76579656833) 1) 1) "10.224.8.102" 2) "6379" 3) "7657965683" 2) 1) "10.224.8.101" 2) "6379" 3) "7657965519"
Resolution
Section titled “Resolution”- Just wait, every slave should automatically restart it’s replication when it drops out
- If it takes longer then expected check /var/log/gitlab/redis/current on the mailfunctioning slave for any indications why it won’t restart replication
Helpful Resources
Section titled “Helpful Resources”- https://redis.io/topics/replication
- https://redis.io/topics/sentinel
- https://redislabs.com/blog/top-redis-headaches-for-devops-replication-buffer/
- https://redislabs.com/blog/top-redis-headaches-for-devops-replication-timeouts/
- https://redislabs.com/blog/top-redis-headaches-for-devops-client-buffers/
Switch Master manually
Section titled “Switch Master manually”How to manually switch primaries
Section titled “How to manually switch primaries”NOTE: This should have no visible negative impact on the GitLab application.
NOTE: There is no authentication required for interacting with Sentinel.
- Get current Redis master. On one of the nodes running the redis sentinel (varies by cluster; redis-cache has its own set of sentinel servers, and all the rest run sentinel on the main redis nodes; and this may change in future):
$ /opt/gitlab/embedded/bin/redis-cli -p 26379 SENTINEL masters1) 1) "name" 2) "gstg-redis-cache" # cluster_id 3) "ip" 4) "10.224.8.103" # ip address of the current master 5) "port" 6) "6379" 7) "runid" 8) "06277f7abca059c268b2c5e2b2581d7d3bf330f1" 9) "flags" 10) "master" 11) "link-pending-commands" 12) "0" 13) "link-refcount" 14) "1" 15) "last-ping-sent" 16) "0" 17) "last-ok-ping-reply" 18) "440" 19) "last-ping-reply" 20) "440" 21) "down-after-milliseconds" 22) "10000" 23) "info-refresh" 24) "9021" 25) "role-reported" 26) "master" 27) "role-reported-time" 28) "956691745" 29) "config-epoch" 30) "51" 31) "num-slaves" 32) "2" 33) "num-other-sentinels" 34) "2" 35) "quorum" 36) "2" 37) "failover-timeout" 38) "60000" 39) "parallel-syncs" 40) "1"
- Failover the master to one of the replicas:
/opt/gitlab/embedded/bin/redis-cli -p 26379 SENTINEL failover CLUSTER_NAME
CLUSTER_NAME is one of gprd-redis
(main persistent cluster), gprd-redis-cache
(primary transient cache), gprd-redis-sidekiq
(sidekiq specific persistent cluster), gprd-redis-tracechunks
(CI build tracechunks persistent cluster), gprd-redis-ratelimiting (RackAttack/App Rate limiting cluster), or gprd-sessions
(Web sessions)
Replication flapping
Section titled “Replication flapping”Possible causes
Section titled “Possible causes”- A redis failover causes the slaves to sync from the master, that might be constrained by the client-output-buffer-limit.
- If Redis is frequently failing over, it may be worth checking the Redis Sentinel logs (
/var/log/gitlab/sentinel/current
). - Possible causes include:
- Host network connectivity
- Redis is being killed by the OOMKiller
- A very high latency command (for example
keys *
ordebug sleep 60
) is preventing Redis from processing commands - Redis is unable to write the RDB snapshot, leading to the instance becoming read-only (check
/opt/gitlab/embedded/bin/redis-cli config get dir
,df -h /var/opt/gitlab/redis
for space)
Possible fixes
Section titled “Possible fixes”Temporarily disable the client-output-buffer-limit
on the new master.
REDISCLI_AUTH=$(sudo grep ^masterauth /var/opt/gitlab/redis/redis.conf|cut -d\" -f2)/opt/gitlab/embedded/bin/redis-cli config set client-output-buffer-limit "slave 0 0 0"
Once the cluster is stable again, revert the change by setting the value, to the value from the configuration file. (/var/opt/gitlab/redis/redis.conf
)
You’ll need to convert any non-bytes number into bytes to apply it on the console (i.e. 4gb = 410241024*1024 = 4294967296)
Thus for a line in the config like this
client-output-buffer-limit slave 4gb 4gb 0
You need to execute this:
REDISCLI_AUTH=$(sudo grep ^masterauth /var/opt/gitlab/redis/redis.conf|cut -d\" -f2)/opt/gitlab/embedded/bin/redis-cli config set client-output-buffer-limit "slave 4294967296 4294967296 0"
Redis is down
Section titled “Redis is down”Start Redis
Section titled “Start Redis”gitlab-ctl start redis
Failed to collect Redis metrics
Section titled “Failed to collect Redis metrics”Symptoms
Section titled “Symptoms”- You see alerts like
FailedToCollectRedisMetrics
. - Redis metrics are unavailable
Possible checks
Section titled “Possible checks”Solution
Section titled “Solution”If everything looks ok, it might be that the instance made a full resync from
master. During that time the redis_exporter fails to collect metrics from
redis. Check /var/log/gitlab/redis/current
for MASTER <-> SLAVE sync
events during the time of the alert.
If either of the redis
or sentinel
services is down, restart it with
gitlab-ctl restart redis
or
gitlab-ctl restart sentinel
.
Else check for possible issues in /var/log/gitlab/redis/current
(e.g. resync
from master) and see [redis_replication.md].
Miscellaneous
Section titled “Miscellaneous”BigKeys analysis
Section titled “BigKeys analysis”Per https://gitlab.com/gitlab-com/gl-infra/scalability/-/issues/360 there may be a script that runs periodically (hourly by default) on a redis replica, to collect ‘bigkeys’ output and store it for later analysis.
The bigkeys are stored in a GCS bucket named gitlab-gprd-redis-analysis
under the gitlab-production
project.
The frequency can be controlled with the chef attribute redis_analysis.bigkeys.timer_on_calendar
, being a systemd time spec. You probably do not want to run it more than once an hour (it’s intended for broad-brush data collection, not fine-grained), although other than considering how long it takes to run and avoiding overlap there’s not actual constraint on that.
If it needs to be stopped for some reason (it is running badly, is causing undue load, or other unexpected effects) it can be
- Stopped if currently running, with `sudo systemctl stop redis-bigkeys-extract.service’
- Prevented from running again (until chef next runs) with
sudo systemctl stop redis-bigkeys-extract.timer
- Turned off by chef by setting the attribute ‘redis_analysis.bigkeys.timer_enabled` to false, e.g. in a role
References
Section titled “References”- https://blog.octo.com/en/what-redis-deployment-do-you-need/
- https://lzone.de/cheat-sheet/Redis
- https://tech.trivago.com/2017/01/25/learn-redis-the-hard-way-in-production/
- https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/80
- https://gitlab.com/gitlab-com/gl-infra/scalability/issues/49
- https://gitlab.com/gitlab-com/gl-infra/reliability/-/issues/7199
- https://gitlab.com/gitlab-com/gl-infra/reliability/-/issues/9414