为什么我的Zookeeper服务器无法重新加入仲裁?

5q4ezhmt  于 2022-12-09  发布在  Apache
关注(0)|答案(2)|浏览(364)

I have three servers in my quorum. They are running ZooKeeper 3.4.5. Two of them appear to be running fine based on the output from mntr . One of them was restarted a couple days ago due to a deploy, and since then has not been able to join the quorum. Some lines in the logs that stick out are:

2014-03-03 18:44:40,995 [myid:1] - INFO  [main:QuorumPeer@429] - currentEpoch not found! Creating with a reasonable default of 0. This should only happen when you are upgrading your installation

and:

2014-03-03 18:44:41,233 [myid:1] - INFO  [QuorumPeer[myid=1]/0.0.0.0:2181:QuorumCnxManager@190] - Have smaller server identifier, so dropping the connection: (2, 1)
2014-03-03 18:44:41,234 [myid:1] - INFO  [QuorumPeer[myid=1]/0.0.0.0:2181:QuorumCnxManager@190] - Have smaller server identifier, so dropping the connection: (3, 1)
2014-03-03 18:44:41,235 [myid:1] - INFO  [QuorumPeer[myid=1]/0.0.0.0:2181:FastLeaderElection@774] - Notification time out: 400

Googling for the first ('currentEpoch not found!') led me to JIRA ZOOKEEPER-1653 - zookeeper fails to start because of inconsistent epoch . It describes a bug fix but doesn't describe a way to resolve the issue without upgrading zookeeper.
Googling for the second ('Have smaller server identifier, so dropping the connection') led me to JIRA ZOOKEEPER-1506 - Re-try DNS hostname -> IP resolution if node connection fails . This makes sense because I am using AWS Elastic IPs for the servers. The fix for this issue seems to be to do a rolling restart, which would cause us to temporarily lose quorum.
It looks like the second issue is definitely in play because I see timeouts in the other ZooKeeper server's logs (the ones still in the quorum) when trying to connect to the first server. What I'm not sure of is if the first issue will disappear when I do a rolling restart. I would like to avoid upgrading and/or doing a rolling restart, but if I have to do a rolling restart I'd like to avoid doing it multiple times. Is there a way to fix the first issue without upgrading? Or even better: Is there a way to resolve both issues without doing a rolling restart?
Thanks for reading and for your help!

mkshixfv

mkshixfv1#

这是一只臭虫的饲养员:Server is unable to join quorum after connection broken to other peers重新启动引线可解决此问题。

06odsfpq

06odsfpq2#

当您的机组或主机使用相同的ID以不同的IP重新加入集群时,每个人都会遇到此问题。对于您的主机,您的IP可能会更改,因为在您的配置中可能指定了0.0.0.0或域名。因此,请按照以下说明操作:
1.停止所有服务器,并在配置中使用

server.1=10.x.x.x:1234:5678
server.2=10.x.x.y:1234:5678
server.3=10.x.x.z:1234:5678

不是DNS名称

1.使用您的IP LAN作为标识符
1.启动你的服务器它应该工作

相关问题