Redis集群在从机转主机时不可用

vcudknz3  于 2023-03-28  发布在  Redis
关注(0)|答案(1)|浏览(260)

我现在在redis集群(v6.0.6,三个从机和三个主机)面临的情况-当一个节点从机切换到主机时,集群变得完全不可用。
一台服务器上有一个节点,另一台服务器上有五个节点。每个节点包含大约50 GB的数据。两台服务器都有足够的RAM来处理所有这些数据。
来自#1节点的日志:

4049748:M 05 Feb 2022 10:51:17.875 # Configuration change detected. Reconfiguring myself as a replica of cceb31e25d09e517b13d02a7afda01ca7c600dbe
4049748:S 05 Feb 2022 10:51:17.875 # Connection with replica client id #49754369 lost.
4049748:S 05 Feb 2022 10:51:17.875 * Before turning into a replica, using my own master parameters to synthesize a cached master: I may be able to synchronize with the new master with just a partial transfer.
4049748:S 05 Feb 2022 10:51:18.258 * Connecting to MASTER 172.16.0.1:6380
4049748:S 05 Feb 2022 10:51:18.258 * MASTER <-> REPLICA sync started
4049748:S 05 Feb 2022 10:51:18.258 * Non blocking connect for SYNC fired the event.
4049748:S 05 Feb 2022 10:51:18.259 * Master replied to PING, replication can continue...
4049748:S 05 Feb 2022 10:51:18.259 * Trying a partial resynchronization (request c2b950c078cdd48f345f855a154a99ff8d83f56d:2276202457008).
4049748:S 05 Feb 2022 10:51:20.366 * Full resync from master: b699d18495ae14e792a6dc09c9b413831b5d1951:2276203677470
4049748:S 05 Feb 2022 10:51:20.366 * Discarding previously cached master state.
4049748:S 05 Feb 2022 10:51:22.303 * MASTER <-> REPLICA sync: receiving streamed RDB from master with EOF to parser
4049748:S 05 Feb 2022 10:51:22.303 * MASTER <-> REPLICA sync: Flushing old data
4049748:S 05 Feb 2022 10:51:30.219 * MASTER <-> REPLICA sync: Loading DB in memory
4049748:S 05 Feb 2022 10:51:30.219 * Loading RDB produced by version 6.0.6
4049748:S 05 Feb 2022 10:51:30.219 * RDB age 8 seconds
4049748:S 05 Feb 2022 10:51:30.219 * RDB memory usage when created 56745.49 Mb
4049748:S 05 Feb 2022 11:01:45.652 * MASTER <-> REPLICA sync: Finished with success
4049748:S 05 Feb 2022 11:01:47.880 * Background append only file rewriting started by pid 3054262
4049748:S 05 Feb 2022 11:07:56.301 * AOF rewrite child asks to stop sending diffs.
3054262:C 05 Feb 2022 11:07:56.301 * Parent agreed to stop sending diffs. Finalizing AOF...
3054262:C 05 Feb 2022 11:07:56.301 * Concatenating 248.03 MB of AOF diff received from parent.
3054262:C 05 Feb 2022 11:07:57.307 * SYNC append only file rewrite performed
3054262:C 05 Feb 2022 11:07:59.858 * AOF rewrite: 2231 MB of memory used by copy-on-write
4049748:S 05 Feb 2022 11:08:02.165 * Background AOF rewrite terminated with success
4049748:S 05 Feb 2022 11:08:02.167 * Residual parent diff successfully flushed to the rewritten AOF (1.39 MB)
4049748:S 05 Feb 2022 11:08:02.167 * Background AOF rewrite finished successfully

来自#2节点的日志:

3829405:S 05 Feb 2022 10:51:16.248 * FAIL message received from 83dcf2551f9652e6f6f79a2d38348c6c16949f25 about fc8e0258e3700c5fde1e3c0487ac1841718db33a
3829405:S 05 Feb 2022 10:51:16.266 # Start of election delayed for 765 milliseconds (rank #0, offset 2276202456934).
3829405:S 05 Feb 2022 10:51:17.067 # Starting a failover election for epoch 114.
3829405:S 05 Feb 2022 10:51:17.070 # Failover election won: I'm the new master.
3829405:S 05 Feb 2022 10:51:17.070 # configEpoch set to 114 after successful failover
3829405:M 05 Feb 2022 10:51:17.070 # Connection with master lost.
3829405:M 05 Feb 2022 10:51:17.070 * Caching the disconnected master state.
3829405:M 05 Feb 2022 10:51:17.070 * Discarding previously cached master state.
3829405:M 05 Feb 2022 10:51:17.070 # Setting secondary replication ID to c2b950c078cdd48f345f855a154a99ff8d83f56d, valid up to offset: 2276202456935. New replication ID is b699d18495ae14e792a6dc09c9b413831b5d1951
3829405:M 05 Feb 2022 10:51:18.116 * Clear FAIL state for node fc8e0258e3700c5fde1e3c0487ac1841718db33a: master without slots is reachable again.
3829405:M 05 Feb 2022 10:51:18.261 * Replica 172.16.0.3:6384 asks for synchronization
3829405:M 05 Feb 2022 10:51:18.261 * Partial resynchronization not accepted: Requested offset for second ID was 2276202457008, but I can reply up to 2276202456935
3829405:M 05 Feb 2022 10:51:18.261 * Delay next BGSAVE for diskless SYNC
3829405:M 05 Feb 2022 10:51:20.367 * Starting BGSAVE for SYNC with target: replicas sockets
3829405:M 05 Feb 2022 10:51:22.289 * Background RDB transfer started by pid 2706728
3829405:M 05 Feb 2022 10:51:27.991 * Marking node fc8e0258e3700c5fde1e3c0487ac1841718db33a as failing (quorum reached).
3829405:M 05 Feb 2022 10:51:30.408 * Clear FAIL state for node fc8e0258e3700c5fde1e3c0487ac1841718db33a: replica is reachable again.
2706728:C 05 Feb 2022 10:58:46.389 * RDB: 1064 MB of memory used by copy-on-write
3829405:M 05 Feb 2022 10:58:46.389 # Diskless rdb transfer, done reading from pipe, 1 replicas still up.
3829405:M 05 Feb 2022 10:58:48.963 * Background RDB transfer terminated with success
3829405:M 05 Feb 2022 10:58:48.963 * Streamed RDB transfer with replica 172.16.0.3:6384 succeeded (socket). Waiting for REPLCONF ACK from slave to enable streaming
3829405:M 05 Feb 2022 10:58:50.021 * Marking node fc8e0258e3700c5fde1e3c0487ac1841718db33a as failing (quorum reached).
3829405:M 05 Feb 2022 11:01:50.058 * Clear FAIL state for node fc8e0258e3700c5fde1e3c0487ac1841718db33a: replica is reachable again.
3829405:M 05 Feb 2022 11:01:50.751 * Synchronization with replica 172.16.0.3:6384 succeeded

Redis配置(所有节点都类似)

bind 172.16.0.1
protected-mode yes
port 6380
tcp-backlog 511
timeout 300
tcp-keepalive 300
daemonize yes
supervised no
pidfile "/var/run/redis/redis-server-6380.pid"
loglevel notice
logfile "/var/log/redis/redis-server-6380.log"
databases 16
always-show-logo no
#save 900 1
#save 300 10
#save 60 10000
stop-writes-on-bgsave-error yes
rdbcompression yes
rdbchecksum yes
dbfilename "dump.rdb"
dir "/var/lib/redis-6380"
replica-serve-stale-data yes
replica-read-only no
repl-diskless-sync yes
repl-diskless-sync-delay 1
repl-disable-tcp-nodelay no
replica-priority 100
maxclients 10000
maxmemory 0
maxmemory-policy noeviction
lazyfree-lazy-eviction no
lazyfree-lazy-expire no
lazyfree-lazy-server-del no
replica-lazy-flush no
appendonly yes
appendfilename "appendonly.aof"
appendfsync everysec
no-appendfsync-on-rewrite no
auto-aof-rewrite-percentage 100
auto-aof-rewrite-min-size 64mb
aof-load-truncated yes
aof-use-rdb-preamble yes
cluster-enabled yes
cluster-config-file "nodes-6380.conf"
cluster-node-timeout 5000
slowlog-log-slower-than 500000
slowlog-max-len 128
latency-monitor-threshold 0
notify-keyspace-events ""
hash-max-ziplist-entries 512
hash-max-ziplist-value 64
list-max-ziplist-size -2
list-compress-depth 0
set-max-intset-entries 512
zset-max-ziplist-entries 128
zset-max-ziplist-value 64
hll-sparse-max-bytes 3000
stream-node-max-bytes 4kb
stream-node-max-entries 100
activerehashing yes
client-output-buffer-limit normal 0 0 0
client-output-buffer-limit replica 20gb 10gb 0
client-output-buffer-limit pubsub 32mb 8mb 60
hz 10
dynamic-hz yes
aof-rewrite-incremental-fsync yes
rdb-save-incremental-fsync yes
user default on nopass ~* +@all
repl-timeout 300
cluster-require-full-coverage no
repl-diskless-load swapdb

Screen from zabbix
什么是可以检查和修复的?

k10s72fa

k10s72fa1#

一种可能的情况是,你的主Redis服务器忙碌了,从服务器上的RDB传输/加载很慢。
可以通过增加复制缓冲区限制来解决此问题。
复制缓冲区是在Redis从服务器与主服务器同步时保存数据的内存缓冲区。
要修复此问题,您可以查看以下命令的输出:

"config get client-output-buffer-limit"

如果输出如下:

1) "client-output-buffer-limit"

2) "normal 0 0 0 slave 268435456 67108864 60 pubsub 33554432 8388608 60"

您可以像这样增加缓冲区限制:

config set client-output-buffer-limit "slave 10737418240 10737418240 60"

注意:命令应该在主节点上运行
for more information you can see this blog post about replication buffer

相关问题