Kafka 3.1.0集群停止工作，出现错误Consistent_TOPIC_ID和UNKNOWN_TOPIC_ID

所以我一直在生产环境中使用Kafka 3.1.0。其中一个虚拟机必须进行实时迁移，但由于某些问题，实时迁移失败，节点已被强制迁移，这涉及到完全重启虚拟机。

在那个VM启动后，Kafka“完全”停止工作--客户端无法连接和生产/消费任何东西。JMX指标仍在显示，但该节点将许多分区显示为“离线分区”。

查看日志，该特定节点不断显示许多INCONSISTENT_TOPIC_ID错误。示例：

WARN [ReplicaFetcher replicaId=4, leaderId=2, fetcherId=0] Received INCONSISTENT_TOPIC_ID from the leader for partition my-topic-3. This error may be returned transiently when the partition is being created or deleted, but it is not expected to persist. (kafka.server.ReplicaFetcherThread)
WARN [ReplicaFetcher replicaId=4, leaderId=2, fetcherId=0] Received INCONSISTENT_TOPIC_ID from the leader for partition my-topic-2. This error may be returned transiently when the partition is being created or deleted, but it is not expected to persist. (kafka.server.ReplicaFetcherThread)
WARN [ReplicaFetcher replicaId=4, leaderId=2, fetcherId=0] Received INCONSISTENT_TOPIC_ID from the leader for partition my-topic-3. This error may be returned transiently when the partition is being created or deleted, but it is not expected to persist. (kafka.server.ReplicaFetcherThread)
WARN [ReplicaFetcher replicaId=4, leaderId=2, fetcherId=0] Received INCONSISTENT_TOPIC_ID from the leader for partition my-topic-2. This error may be returned transiently when the partition is being created or deleted, but it is not expected to persist. (kafka.server.ReplicaFetcherThread)
WARN [ReplicaFetcher replicaId=4, leaderId=2, fetcherId=0] Received INCONSISTENT_TOPIC_ID from the leader for partition my-topic-3. This error may be returned transiently when the partition is being created or deleted, but it is not expected to persist. (kafka.server.ReplicaFetcherThread)

然而，如果你看看其他Kafka经纪人，他们会显示一些不同的错误(我没有日志样本)-UNKNOWN_TOPIC_ID…

另一个有趣的问题-我已经描述了Kafka的主题，这是我得到的：

Topic: my-topic        TopicId: XXXXXXXXXXXXXXXXXXXXXX PartitionCount: 4       ReplicationFactor: 4    Configs: segment.bytes=214748364,unclean.leader.election.enable=true,retention.bytes=214748364
        Topic: my-topic        Partition: 0    Leader: 2       Replicas: 5,2,3,0       Isr: 2
        Topic: my-topic        Partition: 1    Leader: 0       Replicas: 0,1,2,3       Isr: 0
        Topic: my-topic        Partition: 2    Leader: 2       Replicas: 1,2,3,4       Isr: 2
        Topic: my-topic        Partition: 3    Leader: 2       Replicas: 2,3,4,5       Isr: 2

为什么它只显示1个ISR，而每个分区应该有4个ISR？一开始为什么会发生这种事？

我已经添加了额外的分区，如下所示：

Topic: my-topic        TopicId: XXXXXXXXXXXXXXXXXXXXXX PartitionCount: 5       ReplicationFactor: 4    Configs: segment.bytes=214748364,unclean.leader.election.enable=true,retention.bytes=214748364
        Topic: my-topic        Partition: 0    Leader: 2       Replicas: 5,2,3,0       Isr: 2
        Topic: my-topic        Partition: 1    Leader: 0       Replicas: 0,1,2,3       Isr: 0
        Topic: my-topic        Partition: 2    Leader: 2       Replicas: 1,2,3,4       Isr: 2
        Topic: my-topic        Partition: 3    Leader: 2       Replicas: 2,3,4,5       Isr: 2
        Topic: my-topic        Partition: 4    Leader: 3       Replicas: 3,4,5,0       Isr: 3,4,5,0

我知道有kafka-reassign-partitions.sh脚本，它修复了前期制作环境中的类似问题，但我更感兴趣的是为什么会发生这种情况？

this可能有关联吗？我已经设置了参数replica.lag.time.max.ms=5000(超过默认的500)，即使在重新启动所有节点之后，它也没有帮助。

当会话中的主题ID与日志中的主题ID不匹配时，通常会发生这种情况。要解决此问题，您必须确保主题ID在您的集群中保持一致。

如果您使用的是ZooKeeper，请在仍处于同步状态的某个节点上的zkCli.sh中运行此命令，并记下topic_id-

[zk: localhost:2181(CONNECTED) 10] get /brokers/topics/my-topic
{"partitions":{"0":[5,1,2],"1":[5,1,2],"2":[5,1,2],"3":[5,1,2],"4":
[5,1,2],"5":[5,1,2],"6":[5,1,2],"7":[5,1,2],"8":[5,1,2],"9":
[5,1,2]},"topic_id":"s3zoLdMp-T3CIotKlkBpMgL","adding_replicas":
{},"removing_replicas":{},"version":3}

接下来，对于每个节点，检查my-topic主题的所有分区的文件partition.metadata。该文件可以在logs.dir中找到(请参阅server.properties)。

例如，如果将logs.dir设置为/media/kafka-data，则可以在-

分区%1的/media/kafka-data/my-topic-1/partition.meta。

分区2的/media/kafka-data/my-topic-2/partition.meta，依此类推。

文件的内容可能如下所示(您可以看到它与ZooKeeper拥有的topic_id匹配)-

version: 0
topic_id: s3zoLdMp-T3CIotKlkBpMgL

您需要确保集群中my-topic的所有parition.metadata文件中的topic_id的值是相同的。如果您在任何分区中遇到不同的主题ID，您可以使用任何文本编辑器对其进行编辑(或编写脚本为您完成此操作)。

完成后，您可能需要一次重新启动一个代理，以使此更改生效。

Kafka 3.1.0集群停止工作，出现错误Consistent_TOPIC_ID和UNKNOWN_TOPIC_ID

1条答案

相关问题

热门标签

最新问答