Zookeeper问题-如果首领被杀,需要15分钟恢复

2uluyalo  于 2022-12-09  发布在  Apache
关注(0)|答案(3)|浏览(159)

i m trying to implement Kafka with zookeeper in my network but i am facing a weird issue with zookeeper. i have looked around google and realized that many other users reported such issue but no one posted any proper solution for this .
My current setup has 3 different zookeeper nodes (32 GB ram dedicated boxes) The issue is if i kill zookeeper leader, both the remaining follower nodes also goes down and they do not recover for at-least next 15-20 minutes.
All i am getting in the zookeeper logs is "notification timeouts" without any explanation
Here is my zookeeper config file

tickTime=2000
initLimit=10
syncLimit=5
maxClientCnxns=100
maxSessionTimeout=50000
dataDir=/var/lib/zookeeper
clientPort=2181
autopurge.snapRetainCount=100
autopurge.purgeInterval=1
preAllocSize=131072
snapCount=3000000

server.1=zo1:2888:3888
server.2=zo2:2888:3888
server.3=zo3:2888:3888

in my /etc/hosts file i have mapped zo1 , zo2 , zo3 to their ip address.
Note:i have also tested by setting current node ip to 0.0.0.0 it doesn't makes any difference.
just few minutes ago i tested it and again it failed to recover. As i have three node cluster zo1 , zo2 and zo3 . zo3 was the leader and zo1 and zo2 were followers. after i killed zo3 node . it took approx 13 minutes to recover automatically . i got the following logs in zo1 and zo2 .
Log for zo1.
tail /var/lib/zookeeper/zookeeper.out -n 10000 | grep 'QuorumPeer'

2019-01-02 10:25:50,848 [myid:1] - INFO  [QuorumPeer[myid=1]/0:0:0:0:0:0:0:0:2181:FollowerZooKeeperServer@140] - Shutting down
2019-01-02 10:25:50,848 [myid:1] - INFO  [QuorumPeer[myid=1]/0:0:0:0:0:0:0:0:2181:ZooKeeperServer@505] - shutting down
2019-01-02 10:25:50,848 [myid:1] - INFO  [QuorumPeer[myid=1]/0:0:0:0:0:0:0:0:2181:FollowerRequestProcessor@107] - Shutting down
2019-01-02 10:25:50,848 [myid:1] - INFO  [QuorumPeer[myid=1]/0:0:0:0:0:0:0:0:2181:CommitProcessor@184] - Shutting down
2019-01-02 10:25:50,848 [myid:1] - INFO  [QuorumPeer[myid=1]/0:0:0:0:0:0:0:0:2181:FinalRequestProcessor@402] - shutdown of request processor complete
2019-01-02 10:25:50,849 [myid:1] - INFO  [QuorumPeer[myid=1]/0:0:0:0:0:0:0:0:2181:SyncRequestProcessor@208] - Shutting down
2019-01-02 10:25:50,849 [myid:1] - INFO  [QuorumPeer[myid=1]/0:0:0:0:0:0:0:0:2181:QuorumPeer@865] - LOOKING
2019-01-02 10:25:50,850 [myid:1] - INFO  [QuorumPeer[myid=1]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@818] - New election. My id =  1, proposed zxid=0x2d00035c8e
2019-01-02 10:25:51,057 [myid:1] - INFO  [QuorumPeer[myid=1]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@852] - Notification time out: 400
2019-01-02 10:25:51,458 [myid:1] - INFO  [QuorumPeer[myid=1]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@852] - Notification time out: 800
2019-01-02 10:25:52,259 [myid:1] - INFO  [QuorumPeer[myid=1]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@852] - Notification time out: 1600
2019-01-02 10:25:53,859 [myid:1] - INFO  [QuorumPeer[myid=1]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@852] - Notification time out: 3200
2019-01-02 10:25:57,060 [myid:1] - INFO  [QuorumPeer[myid=1]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@852] - Notification time out: 6400
2019-01-02 10:26:03,461 [myid:1] - INFO  [QuorumPeer[myid=1]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@852] - Notification time out: 12800
2019-01-02 10:26:16,262 [myid:1] - INFO  [QuorumPeer[myid=1]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@852] - Notification time out: 25600
2019-01-02 10:26:41,862 [myid:1] - INFO  [QuorumPeer[myid=1]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@852] - Notification time out: 51200
2019-01-02 10:27:33,063 [myid:1] - INFO  [QuorumPeer[myid=1]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@852] - Notification time out: 60000
2019-01-02 10:28:33,065 [myid:1] - INFO  [QuorumPeer[myid=1]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@852] - Notification time out: 60000
2019-01-02 10:29:33,066 [myid:1] - INFO  [QuorumPeer[myid=1]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@852] - Notification time out: 60000
2019-01-02 10:30:33,066 [myid:1] - INFO  [QuorumPeer[myid=1]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@852] - Notification time out: 60000
2019-01-02 10:31:33,067 [myid:1] - INFO  [QuorumPeer[myid=1]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@852] - Notification time out: 60000
2019-01-02 10:32:33,068 [myid:1] - INFO  [QuorumPeer[myid=1]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@852] - Notification time out: 60000
2019-01-02 10:33:33,069 [myid:1] - INFO  [QuorumPeer[myid=1]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@852] - Notification time out: 60000
2019-01-02 10:34:33,069 [myid:1] - INFO  [QuorumPeer[myid=1]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@852] - Notification time out: 60000
2019-01-02 10:35:33,070 [myid:1] - INFO  [QuorumPeer[myid=1]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@852] - Notification time out: 60000
2019-01-02 10:36:33,071 [myid:1] - INFO  [QuorumPeer[myid=1]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@852] - Notification time out: 60000
2019-01-02 10:37:33,071 [myid:1] - INFO  [QuorumPeer[myid=1]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@852] - Notification time out: 60000
2019-01-02 10:38:33,072 [myid:1] - INFO  [QuorumPeer[myid=1]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@852] - Notification time out: 60000
2019-01-02 10:39:33,073 [myid:1] - INFO  [QuorumPeer[myid=1]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@852] - Notification time out: 60000
2019-01-02 10:40:33,074 [myid:1] - INFO  [QuorumPeer[myid=1]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@852] - Notification time out: 60000
2019-01-02 10:41:33,075 [myid:1] - INFO  [QuorumPeer[myid=1]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@852] - Notification time out: 60000
2019-01-02 10:42:33,076 [myid:1] - INFO  [QuorumPeer[myid=1]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@852] - Notification time out: 60000
2019-01-02 10:43:33,076 [myid:1] - INFO  [QuorumPeer[myid=1]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@852] - Notification time out: 60000
2019-01-02 10:43:33,082 [myid:1] - INFO  [WorkerSender[myid=1]:QuorumPeer$QuorumServer@167] - Resolved hostname: zo3 to address: zo3/144.76.xxx.xxx
2019-01-02 10:43:33,091 [myid:1] - INFO  [WorkerSender[myid=1]:QuorumPeer$QuorumServer@167] - Resolved hostname: zo3 to address: zo3/144.76.xxx.xxx
2019-01-02 10:43:33,290 [myid:1] - INFO  [QuorumPeer[myid=1]/0:0:0:0:0:0:0:0:2181:QuorumPeer@935] - FOLLOWING
2019-01-02 10:43:33,290 [myid:1] - INFO  [QuorumPeer[myid=1]/0:0:0:0:0:0:0:0:2181:ZooKeeperServer@173] - Created server with tickTime 2000 minSessionTimeout 4000 maxSessionTimeout 50000 datadir /var/lib/zookeeper/version-2 snapdir /var/lib/zookeeper/version-2
2019-01-02 10:43:33,291 [myid:1] - INFO  [QuorumPeer[myid=1]/0:0:0:0:0:0:0:0:2181:Follower@64] - FOLLOWING - LEADER ELECTION TOOK - 1062441
2019-01-02 10:43:33,291 [myid:1] - INFO  [QuorumPeer[myid=1]/0:0:0:0:0:0:0:0:2181:QuorumPeer$QuorumServer@167] - Resolved hostname: zo2 to address: zo2/88.198.35.34
2019-01-02 10:43:33,294 [myid:1] - WARN  [QuorumPeer[myid=1]/0:0:0:0:0:0:0:0:2181:Learner@237] - Unexpected exception, tries=0, connecting to zo2/88.198.35.34:2888
        at org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:937)
2019-01-02 10:43:34,468 [myid:1] - INFO  [QuorumPeer[myid=1]/0:0:0:0:0:0:0:0:2181:Learner@332] - Getting a diff from the leader 0x2d00035c8e
2019-01-02 10:43:35,120 [myid:1] - INFO  [QuorumPeer[myid=1]/0:0:0:0:0:0:0:0:2181:ZooKeeperServer@687] - Established session 0x2680a49e3dc0013 with negotiated timeout 6000 for client /5.9.xxx.xxx:36664
2019-01-02 10:43:35,244 [myid:1] - INFO  [QuorumPeer[myid=1]/0:0:0:0:0:0:0:0:2181:ZooKeeperServer@687] - Established session 0x1680a49b6b90011 with negotiated timeout 30000 for client /5.9.xxx.xxx:36668
2019-01-02 10:43:35,625 [myid:1] - WARN  [QuorumPeer[myid=1]/0:0:0:0:0:0:0:0:2181:Follower@118] - Got zxid 0x2e00000001 expected 0x1

Logs from node zo2 , which became leader later

2019-01-02 10:25:50,852 [myid:2] - INFO  [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:NIOServerCnxn@1044] - Closed socket connection for client /5.9.xxx.xxx:21218 which had sessionid 0x2680a49e3dc0012
2019-01-02 10:25:50,852 [myid:2] - INFO  [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:FollowerZooKeeperServer@140] - Shutting down
2019-01-02 10:25:50,853 [myid:2] - INFO  [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:ZooKeeperServer@505] - shutting down
2019-01-02 10:25:50,853 [myid:2] - INFO  [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:FollowerRequestProcessor@107] - Shutting down
2019-01-02 10:25:50,854 [myid:2] - INFO  [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:CommitProcessor@184] - Shutting down
2019-01-02 10:25:50,854 [myid:2] - INFO  [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:FinalRequestProcessor@402] - shutdown of request processor complete
2019-01-02 10:25:50,856 [myid:2] - INFO  [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:SyncRequestProcessor@208] - Shutting down
2019-01-02 10:25:50,857 [myid:2] - INFO  [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:QuorumPeer@865] - LOOKING
2019-01-02 10:25:50,858 [myid:2] - INFO  [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@818] - New election. My id =  2, proposed zxid=0x2d00035c8e
2019-01-02 10:25:51,061 [myid:2] - INFO  [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@852] - Notification time out: 400
2019-01-02 10:25:51,462 [myid:2] - INFO  [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@852] - Notification time out: 800
2019-01-02 10:25:52,283 [myid:2] - INFO  [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@852] - Notification time out: 1600
2019-01-02 10:25:53,884 [myid:2] - INFO  [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@852] - Notification time out: 3200
2019-01-02 10:25:57,084 [myid:2] - INFO  [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@852] - Notification time out: 6400
2019-01-02 10:26:03,485 [myid:2] - INFO  [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@852] - Notification time out: 12800
2019-01-02 10:26:16,286 [myid:2] - INFO  [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@852] - Notification time out: 25600
2019-01-02 10:26:41,887 [myid:2] - INFO  [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@852] - Notification time out: 51200
2019-01-02 10:27:33,087 [myid:2] - INFO  [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@852] - Notification time out: 60000
2019-01-02 10:28:33,088 [myid:2] - INFO  [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@852] - Notification time out: 60000
2019-01-02 10:29:33,089 [myid:2] - INFO  [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@852] - Notification time out: 60000
2019-01-02 10:30:33,090 [myid:2] - INFO  [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@852] - Notification time out: 60000
2019-01-02 10:31:33,091 [myid:2] - INFO  [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@852] - Notification time out: 60000
2019-01-02 10:32:33,092 [myid:2] - INFO  [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@852] - Notification time out: 60000
2019-01-02 10:33:33,092 [myid:2] - INFO  [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@852] - Notification time out: 60000
2019-01-02 10:34:33,093 [myid:2] - INFO  [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@852] - Notification time out: 60000
2019-01-02 10:35:33,094 [myid:2] - INFO  [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@852] - Notification time out: 60000
2019-01-02 10:36:33,095 [myid:2] - INFO  [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@852] - Notification time out: 60000
2019-01-02 10:37:33,095 [myid:2] - INFO  [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@852] - Notification time out: 60000
2019-01-02 10:38:33,096 [myid:2] - INFO  [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@852] - Notification time out: 60000
2019-01-02 10:39:33,097 [myid:2] - INFO  [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@852] - Notification time out: 60000
2019-01-02 10:40:33,098 [myid:2] - INFO  [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@852] - Notification time out: 60000
2019-01-02 10:41:33,099 [myid:2] - INFO  [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@852] - Notification time out: 60000
2019-01-02 10:42:33,100 [myid:2] - INFO  [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@852] - Notification time out: 60000
2019-01-02 10:43:33,293 [myid:2] - INFO  [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:QuorumPeer@947] - LEADING
2019-01-02 10:43:33,299 [myid:2] - INFO  [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:Leader@62] - TCP NoDelay set to: true
2019-01-02 10:43:33,301 [myid:2] - INFO  [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:ZooKeeperServer@173] - Created server with tickTime 2000 minSessionTimeout 4000 maxSessionTimeout 50000 datadir /var/lib/zookeeper/version-2 snapdir /var/lib/zookeeper/version-2
2019-01-02 10:43:33,301 [myid:2] - INFO  [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:Leader@371] - LEADING - LEADER ELECTION TOOK - 1062443
2019-01-02 10:43:34,307 [myid:2] - INFO  [LearnerHandler-/144.76.120.143:64542:LearnerHandler@346] - Follower sid: 1 : info : org.apache.zookeeper.server.quorum.QuorumPeer$QuorumServer@33d2c290
2019-01-02 10:43:34,509 [myid:2] - INFO  [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:Leader@961] - Have quorum of supporters, sids: [ 1,2 ]; starting up and setting last processed zxid: 0x2e00000000

As you can see all i am getting is continuous timeouts in the log without any explanation. Been testing it since more then a week still cant find any solution for this.
I would be very grateful if somebody could point me in right direction.
Thanks

vsdwdz23

vsdwdz231#

我们遇到了同样的问题,发现Zookeeper领导人选举发生在15分钟后。我们通过在Zookeeper属性中设置tcpKeepAlive=true来绕过这15分钟的空闲时间。

nx7onnlm

nx7onnlm2#

No election can happen when the cluster size is 2. If you intend to run a HA Zookeeper cluster, increase your zookeeper count to 5. Also, Zookeeper doesn't need 32GB to run effectively.
Check out:
https://docs.confluent.io/current/zookeeper/deployment.html#multi-node-setup for cluster information
and
https://docs.confluent.io/current/zookeeper/deployment.html#jvm for JVM sizing.

5kgi1eie

5kgi1eie3#

对于面临相同问题的任何人-原因是zookeeper处于空闲状态。如果您的zookeeper节点由于节点之间没有活动(数据事务)而处于空闲状态,则zookeeper节点内部将停止相互通信。如果任何zookeeper节点出现故障,其他节点将不会知道,直到它们在10-15分钟后进行周期性ping。

相关问题