redis集群中故障状态的再现

zujrkrfu  于 2021-06-10  发布在  Redis
关注(0)|答案(0)|浏览(310)

不久前,我们在生产中遇到了一个redis集群(6个节点,3个主节点,从节点)中断,下面给出了相关的日志。

78367:M 26 Jul 2020 09:38:35.143 # Cluster state changed: fail
78367:M 26 Jul 2020 09:39:16.847 # Configuration change detected. Reconfiguring myself as a replica of 6afa0d0ffadcff546d49251ab25bc1bc3560142b
78367:S 26 Jul 2020 09:39:16.848 # Connection with replica <host1>:<port1> lost.
78367:S 26 Jul 2020 09:39:16.848 * Before turning into a replica, using my master parameters to synthesize a cached master: I may be able to synchronize with the new master with just a partial transfer.
78367:S 26 Jul 2020 09:39:16.890 * Connecting to MASTER <host1>:<port1>
78367:S 26 Jul 2020 09:39:16.890 * MASTER <-> REPLICA sync started
78367:S 26 Jul 2020 09:39:16.891 * Non blocking connect for SYNC fired the event.
78367:S 26 Jul 2020 09:39:16.891 * Master replied to PING, replication can continue...
78367:S 26 Jul 2020 09:39:16.892 * Trying a partial resynchronization (request c366cd12c8af87284822a0e26f0d00cbfa88a198:410270005304).
78367:S 26 Jul 2020 09:39:16.921 * Full resync from master: e75ac81e4d48c04320bc2244a54b9278f218221b:410270005205
78367:S 26 Jul 2020 09:39:16.921 * Discarding previously cached master state.
78367:S 26 Jul 2020 09:39:17.994 # Cluster state changed: ok

相关redis.conf

pidfile /var/run/redis_9000.pid

protected-mode yes

maxclients 10000
lua-time-limit 5000

slowlog-log-slower-than 10000
slowlog-max-len 128

# ************Persistence*******

appendonly no

# aof-use-rdb-preamble yes

save ""

dbfilename redis.rdb
appendfilename redis.aof

stop-writes-on-bgsave-error no

repl-diskless-sync no
repl-diskless-sync-delay 5
repl-disable-tcp-nodelay no

# *************CLUSTER CONFIG***************

cluster-enabled yes
cluster-require-full-coverage no
cluster-node-timeout 15000
cluster-config-file node.conf
cluster-migration-barrier 1

# ********MEMORY********

maxmemory-policy noeviction

# ********REPLICATION***************

# repl-timeout 600

# repl-backlog-size 100mb

# slave-priority 0

no-appendfsync-on-rewrite no
auto-aof-rewrite-percentage 100
auto-aof-rewrite-min-size 64mb

aof-load-truncated yes
latency-monitor-threshold 0
aof-rewrite-incremental-fsync yes

repl-timeout 120
repl-backlog-size 20mb
slave-priority 100
client-output-buffer-limit normal 0 0 0
client-output-buffer-limit slave 400mb 350mb 300

根据以下文件:https://redis.io/commands/cluster-info,群集状态更改为 fail 说明时间:
集群状态:如果节点能够接收查询,则状态为ok。如果至少有一个哈希槽未绑定(没有关联的节点)、处于错误状态(为其提供服务的节点用fail标志标记),或者如果此节点无法访问大多数主节点,则失败。
但是,我无法理解是什么导致集群进入失败状态,以及如何再次重现同一问题以及如何缓解它。
注:由于 cluster-require-full-coverage no 属性,当主从对下降时,集群状态仍报告为 okCLUSTER INFO 命令。

暂无答案!

目前还没有任何答案,快来回答吧!

相关问题