所以我一直在生产环境中使用Kafka 3.1.0。其中一个虚拟机必须进行实时迁移,但由于某些问题,实时迁移失败,节点已被强制迁移,这涉及到完全重启虚拟机。
在那个VM启动后,Kafka“完全”停止工作--客户端无法连接和生产/消费任何东西。JMX指标仍在显示,但该节点将许多分区显示为“离线分区”。
查看日志,该特定节点不断显示许多INCONSISTENT_TOPIC_ID
错误。示例:
WARN [ReplicaFetcher replicaId=4, leaderId=2, fetcherId=0] Received INCONSISTENT_TOPIC_ID from the leader for partition my-topic-3. This error may be returned transiently when the partition is being created or deleted, but it is not expected to persist. (kafka.server.ReplicaFetcherThread)
WARN [ReplicaFetcher replicaId=4, leaderId=2, fetcherId=0] Received INCONSISTENT_TOPIC_ID from the leader for partition my-topic-2. This error may be returned transiently when the partition is being created or deleted, but it is not expected to persist. (kafka.server.ReplicaFetcherThread)
WARN [ReplicaFetcher replicaId=4, leaderId=2, fetcherId=0] Received INCONSISTENT_TOPIC_ID from the leader for partition my-topic-3. This error may be returned transiently when the partition is being created or deleted, but it is not expected to persist. (kafka.server.ReplicaFetcherThread)
WARN [ReplicaFetcher replicaId=4, leaderId=2, fetcherId=0] Received INCONSISTENT_TOPIC_ID from the leader for partition my-topic-2. This error may be returned transiently when the partition is being created or deleted, but it is not expected to persist. (kafka.server.ReplicaFetcherThread)
WARN [ReplicaFetcher replicaId=4, leaderId=2, fetcherId=0] Received INCONSISTENT_TOPIC_ID from the leader for partition my-topic-3. This error may be returned transiently when the partition is being created or deleted, but it is not expected to persist. (kafka.server.ReplicaFetcherThread)
然而,如果你看看其他Kafka经纪人,他们会显示一些不同的错误(我没有日志样本)-UNKNOWN_TOPIC_ID
…
另一个有趣的问题-我已经描述了Kafka的主题,这是我得到的:
Topic: my-topic TopicId: XXXXXXXXXXXXXXXXXXXXXX PartitionCount: 4 ReplicationFactor: 4 Configs: segment.bytes=214748364,unclean.leader.election.enable=true,retention.bytes=214748364
Topic: my-topic Partition: 0 Leader: 2 Replicas: 5,2,3,0 Isr: 2
Topic: my-topic Partition: 1 Leader: 0 Replicas: 0,1,2,3 Isr: 0
Topic: my-topic Partition: 2 Leader: 2 Replicas: 1,2,3,4 Isr: 2
Topic: my-topic Partition: 3 Leader: 2 Replicas: 2,3,4,5 Isr: 2
为什么它只显示1个ISR,而每个分区应该有4个ISR?一开始为什么会发生这种事?
我已经添加了额外的分区,如下所示:
Topic: my-topic TopicId: XXXXXXXXXXXXXXXXXXXXXX PartitionCount: 5 ReplicationFactor: 4 Configs: segment.bytes=214748364,unclean.leader.election.enable=true,retention.bytes=214748364
Topic: my-topic Partition: 0 Leader: 2 Replicas: 5,2,3,0 Isr: 2
Topic: my-topic Partition: 1 Leader: 0 Replicas: 0,1,2,3 Isr: 0
Topic: my-topic Partition: 2 Leader: 2 Replicas: 1,2,3,4 Isr: 2
Topic: my-topic Partition: 3 Leader: 2 Replicas: 2,3,4,5 Isr: 2
Topic: my-topic Partition: 4 Leader: 3 Replicas: 3,4,5,0 Isr: 3,4,5,0
我知道有kafka-reassign-partitions.sh
脚本,它修复了前期制作环境中的类似问题,但我更感兴趣的是为什么会发生这种情况?
this可能有关联吗?我已经设置了参数replica.lag.time.max.ms=5000
(超过默认的500
),即使在重新启动所有节点之后,它也没有帮助。
1条答案
按热度按时间wgeznvg71#
当会话中的主题ID与日志中的主题ID不匹配时,通常会发生这种情况。要解决此问题,您必须确保主题ID在您的集群中保持一致。
如果您使用的是ZooKeeper,请在仍处于同步状态的某个节点上的
zkCli.sh
中运行此命令,并记下topic_id
-接下来,对于每个节点,检查
my-topic
主题的所有分区的文件partition.metadata
。该文件可以在logs.dir
中找到(请参阅server.properties)。例如,如果将
logs.dir
设置为/media/kafka-data
,则可以在-分区%1的
/media/kafka-data/my-topic-1/partition.meta
。分区2的
/media/kafka-data/my-topic-2/partition.meta
,依此类推。文件的内容可能如下所示(您可以看到它与ZooKeeper拥有的
topic_id
匹配)-您需要确保集群中
my-topic
的所有parition.metadata
文件中的topic_id
的值是相同的。如果您在任何分区中遇到不同的主题ID,您可以使用任何文本编辑器对其进行编辑(或编写脚本为您完成此操作)。完成后,您可能需要一次重新启动一个代理,以使此更改生效。