Kafka 3.1.0集群停止工作,出现错误Consistent_TOPIC_ID和UNKNOWN_TOPIC_ID

hgc7kmma  于 2022-10-07  发布在  Kafka
关注(0)|答案(1)|浏览(669)

所以我一直在生产环境中使用Kafka 3.1.0。其中一个虚拟机必须进行实时迁移,但由于某些问题,实时迁移失败,节点已被强制迁移,这涉及到完全重启虚拟机。

在那个VM启动后,Kafka“完全”停止工作--客户端无法连接和生产/消费任何东西。JMX指标仍在显示,但该节点将许多分区显示为“离线分区”。

查看日志,该特定节点不断显示许多INCONSISTENT_TOPIC_ID错误。示例:

WARN [ReplicaFetcher replicaId=4, leaderId=2, fetcherId=0] Received INCONSISTENT_TOPIC_ID from the leader for partition my-topic-3. This error may be returned transiently when the partition is being created or deleted, but it is not expected to persist. (kafka.server.ReplicaFetcherThread)
WARN [ReplicaFetcher replicaId=4, leaderId=2, fetcherId=0] Received INCONSISTENT_TOPIC_ID from the leader for partition my-topic-2. This error may be returned transiently when the partition is being created or deleted, but it is not expected to persist. (kafka.server.ReplicaFetcherThread)
WARN [ReplicaFetcher replicaId=4, leaderId=2, fetcherId=0] Received INCONSISTENT_TOPIC_ID from the leader for partition my-topic-3. This error may be returned transiently when the partition is being created or deleted, but it is not expected to persist. (kafka.server.ReplicaFetcherThread)
WARN [ReplicaFetcher replicaId=4, leaderId=2, fetcherId=0] Received INCONSISTENT_TOPIC_ID from the leader for partition my-topic-2. This error may be returned transiently when the partition is being created or deleted, but it is not expected to persist. (kafka.server.ReplicaFetcherThread)
WARN [ReplicaFetcher replicaId=4, leaderId=2, fetcherId=0] Received INCONSISTENT_TOPIC_ID from the leader for partition my-topic-3. This error may be returned transiently when the partition is being created or deleted, but it is not expected to persist. (kafka.server.ReplicaFetcherThread)

然而,如果你看看其他Kafka经纪人,他们会显示一些不同的错误(我没有日志样本)-UNKNOWN_TOPIC_ID

另一个有趣的问题-我已经描述了Kafka的主题,这是我得到的:

Topic: my-topic        TopicId: XXXXXXXXXXXXXXXXXXXXXX PartitionCount: 4       ReplicationFactor: 4    Configs: segment.bytes=214748364,unclean.leader.election.enable=true,retention.bytes=214748364
        Topic: my-topic        Partition: 0    Leader: 2       Replicas: 5,2,3,0       Isr: 2
        Topic: my-topic        Partition: 1    Leader: 0       Replicas: 0,1,2,3       Isr: 0
        Topic: my-topic        Partition: 2    Leader: 2       Replicas: 1,2,3,4       Isr: 2
        Topic: my-topic        Partition: 3    Leader: 2       Replicas: 2,3,4,5       Isr: 2

为什么它只显示1个ISR,而每个分区应该有4个ISR?一开始为什么会发生这种事?

我已经添加了额外的分区,如下所示:

Topic: my-topic        TopicId: XXXXXXXXXXXXXXXXXXXXXX PartitionCount: 5       ReplicationFactor: 4    Configs: segment.bytes=214748364,unclean.leader.election.enable=true,retention.bytes=214748364
        Topic: my-topic        Partition: 0    Leader: 2       Replicas: 5,2,3,0       Isr: 2
        Topic: my-topic        Partition: 1    Leader: 0       Replicas: 0,1,2,3       Isr: 0
        Topic: my-topic        Partition: 2    Leader: 2       Replicas: 1,2,3,4       Isr: 2
        Topic: my-topic        Partition: 3    Leader: 2       Replicas: 2,3,4,5       Isr: 2
        Topic: my-topic        Partition: 4    Leader: 3       Replicas: 3,4,5,0       Isr: 3,4,5,0

我知道有kafka-reassign-partitions.sh脚本,它修复了前期制作环境中的类似问题,但我更感兴趣的是为什么会发生这种情况?

this可能有关联吗?我已经设置了参数replica.lag.time.max.ms=5000(超过默认的500),即使在重新启动所有节点之后,它也没有帮助。

wgeznvg7

wgeznvg71#

当会话中的主题ID与日志中的主题ID不匹配时,通常会发生这种情况。要解决此问题,您必须确保主题ID在您的集群中保持一致。

如果您使用的是ZooKeeper,请在仍处于同步状态的某个节点上的zkCli.sh中运行此命令,并记下topic_id-

[zk: localhost:2181(CONNECTED) 10] get /brokers/topics/my-topic
{"partitions":{"0":[5,1,2],"1":[5,1,2],"2":[5,1,2],"3":[5,1,2],"4":
[5,1,2],"5":[5,1,2],"6":[5,1,2],"7":[5,1,2],"8":[5,1,2],"9":
[5,1,2]},"topic_id":"s3zoLdMp-T3CIotKlkBpMgL","adding_replicas":
{},"removing_replicas":{},"version":3}

接下来,对于每个节点,检查my-topic主题的所有分区的文件partition.metadata。该文件可以在logs.dir中找到(请参阅server.properties)。

例如,如果将logs.dir设置为/media/kafka-data,则可以在-

分区%1的/media/kafka-data/my-topic-1/partition.meta

分区2的/media/kafka-data/my-topic-2/partition.meta,依此类推。

文件的内容可能如下所示(您可以看到它与ZooKeeper拥有的topic_id匹配)-

version: 0
topic_id: s3zoLdMp-T3CIotKlkBpMgL

您需要确保集群中my-topic的所有parition.metadata文件中的topic_id的值是相同的。如果您在任何分区中遇到不同的主题ID,您可以使用任何文本编辑器对其进行编辑(或编写脚本为您完成此操作)。

完成后,您可能需要一次重新启动一个代理,以使此更改生效。

相关问题