2020年5月4日 22:31:44go评论227阅读模式

英文:

Kafka stops after a certain period of time, lost brokers

问题

我有 4 个正在运行 Debezium 的 Kafka 实例。在正常运行了一段时间后，三台 Kafka 机器在某段时间内失去了网络连接，在 connectDistributed.out 日志文件中出现了大量以下错误信息：

[2020-05-04 13:27:02,526] WARN [Consumer clientId=connector-consumer-sink-warehouse-6,
groupId=connect-sink-warehouse] 133 个分区的 leader brokers 没有匹配的监听器，
包括 [SCT010-2，SC2010-2，SC1010-0，SC1010-1，SF4010-0，SUB010-0，SUB010-1，SWP010-0，
SWP010-1，ACO010-2] (org.apache.kafka.clients.NetworkClient:1044)

我有 4 台 Kafka 机器，代理从 0 到 3

192.168.240.70 - 代理 0
192.168.240.71 - 代理 1
192.168.240.72 - 代理 2
192.168.240.73 - 代理 3

Zookeeper：

192.168.240.70

以下是我的 server.properties 配置文件 - 除了 listeners、advertised.listeners 指向 Kafka 所安装的机器的相同 IP 以及必须是唯一的 broker.id（从 0 到 3）外，其他都是相同的：

broker.id=0
listeners=CONTROLLER://192.168.240.70:9091,INTERNAL://192.168.240.70:9092
advertised.listeners=CONTROLLER://192.168.240.70:9091,INTERNAL://192.168.240.70:9092
listener.security.protocol.map=CONTROLLER:PLAINTEXT,INTERNAL:PLAINTEXT
control.plane.listener.name=CONTROLLER
inter.broker.listener.name=INTERNAL
num.network.threads=3
num.io.threads=8
socket.send.buffer.bytes=102400
socket.receive.buffer.bytes=102400
socket.request.max.bytes=104857600
log.dirs=/home/william/kafka/data/kafka/
num.partitions=3
num.recovery.threads.per.data.dir=1
offsets.topic.replication.factor=3
transaction.state.log.replication.factor=3
transaction.state.log.min.isr=1
log.retention.hours=150
log.retention.bytes=200000000000
log.segment.bytes=1073741824
log.retention.check.interval.ms=300000

zookeeper.connect=192.168.240.70:2181
zookeeper.connection.timeout.ms=6000
group.initial.rebalance.delay.ms=3

Kafka 主题（configs/offsets 和 status）显示出复制问题。这是否与监听器配置有关？

关于连接器的健康状况：

在 Kafka Connect 上的连接器健康状况：

另外，在 Kafka Connect 上，只显示了一个代理：

如何修复这个错误？这似乎与 leader 选举或长时间没有代理访问后找到 leader 相关。

英文:

I have 4 kafka running with debezium. After some days running well, three kafka machines get out of the network for a period of time and, on connectDistributed.out log file I have a lot of messages with the following error:

[2020-05-04 13:27:02,526] WARN [Consumer clientId=connector-consumer-sink-warehouse-6, 
groupId=connect-sink-warehouse] 133 partitions have leader brokers without a matching listener,
 including [SCT010-2, SC2010-2, SC1010-0, SC1010-1, SF4010-0, SUB010-0, SUB010-1, SWP010-0, 
SWP010-1, ACO010-2] (org.apache.kafka.clients.NetworkClient:1044)

I have 4 Kafka machines, brokers from 0 to 3

192.168.240.70 - Broker 0
192.168.240.71 - Broker 1
192.168.240.72 - Broker 2
192.168.240.73 - Broker 3

Zookeeper:

192.168.240.70

Follow my server.properties - There are the same, except the listeners, advertised.listeners that points for the same IP of the machine that Kafka is installed and broker.id that must be unique (from 0 to 3):

broker.id=0
listeners=CONTROLLER://192.168.240.70:9091,INTERNAL://192.168.240.70:9092
advertised.listeners=CONTROLLER://192.168.240.70:9091,INTERNAL://192.168.240.70:9092
listener.security.protocol.map=CONTROLLER:PLAINTEXT,INTERNAL:PLAINTEXT
control.plane.listener.name=CONTROLLER
inter.broker.listener.name=INTERNAL
num.network.threads=3
num.io.threads=8
socket.send.buffer.bytes=102400
socket.receive.buffer.bytes=102400
socket.request.max.bytes=104857600
log.dirs=/home/william/kafka/data/kafka/
num.partitions=3
num.recovery.threads.per.data.dir=1
offsets.topic.replication.factor=3
transaction.state.log.replication.factor=3
transaction.state.log.min.isr=1
log.retention.hours=150
log.retention.bytes=200000000000
log.segment.bytes=1073741824
log.retention.check.interval.ms=300000

zookeeper.connect=192.168.240.70:2181
zookeeper.connection.timeout.ms=6000
group.initial.rebalance.delay.ms=3

The kafka topics (configs/offsets and status) shows problems in replication. There are some related with listeners config?

About the connectors health:

And, on kafka connect. only one broker are presented:

How can I fix this error? Seems to be something related to leader election or finding the leader after a long time without broker access.

答案1

得分: 0

经过研究，我找到了问题所在。所以，在这里提供一些关于问题的概念，以帮助解决：

当我们创建分布式的 Kafka 系统时，我们将主题分布在各个代理（brokers）周围，并创建了领导者选举机制。在我的情况下，我有 4 个代理，Zookeeper 动态选择谁将成为特定主题的领导者。

由于有三个四个 Kafka 服务器已经宕机超过一个小时，Zookeeper 尝试访问并且无法联系到领导者。由于我的配置要求将相同主题复制到三个代理，Zookeeper 无法保持主题的健康状态。

我们有重新平衡配置：group.initial.rebalance.delay.ms=3，它会每 3 秒尝试重新平衡。并且在某个 Kafka 代理宕机之前，我们有有限的尝试次数来重新连接。这些尝试次数已经用完，Zookeeper 无法联系到失联的代理。

换句话说，代理并没有宕机，它们只是由于网络问题无法联系到 Zookeeper，所以一段时间后，Zookeeper 的尝试停止了，然后 Kafka 代理再次可达，但重新平衡的尝试也停止了。

简单地重启我的代理以通知 Zookeeper 重新连接解决了我的问题，因为在重启时，Kafka 代理会告诉 Zookeeper：-我在这里，等待您的指示。而且，Zookeeper 识别到了丢失的领导者，将一切重新连接到正确的位置。

所以，这只是简单地通过重新启动代理来告诉 Zookeeper 重新连接解决了我的问题，因为在重启时，Kafka 代理会告诉 Zookeeper：-我在这里，等待您的指示。而且，Zookeeper 识别到了丢失的领导者，将一切重新连接到正确的位置。

英文:

After a research I found the problem. So, to help here, follow the concept about the problem:

When we create a distributed Kafka systems, we distribute the topics around the brokers and create leader electors. In my case, I have 4 brokers that Zookeeper chooses on the fly who will be the leader of certain topics.

As three of four Kafka servers was down for more than a hour, Zookeeper tried to reach and can't reach the leader. As my config points the need to replicate to three brokers the same topic, Zookeeper can't maintain the topic healthy.

We have the rebalance config: group.initial.rebalance.delay.ms=3 that try to rebalance at every 3 seconds. And we have a limited number of tries to reconnect before one Kafka broker was down. The tries was occurred and Zookeeper can't reached the brokers lost.

In other way, the brokers was not down, they're only cant't reach Zookeeper by network problems so, after a time, the tries from Zookeeper was stopped and, after a time, Kafka brokers are reachable again but, the tries to rebalance are stopped.

Just, simply restarting my brokers to tell to zookeeper to reconnect solved my problem because, on restart, Kafka brokers tell to zookeeper: -I'm here, waiting for your instructions. And Zookeeper, recognizing the leaders lost, reconnect everything on the right place.

So, trying to be helpful.

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

Kafka会在一段时间后停止运行，出现丢失的代理。

问题

答案1

Curator 5.x与Zookeeper客户端3.5.x以及Zookeeper服务器3.4.x

Three threads.T1 print 1,4,7.. messages sequence T2 print 2,5,8.. and T3 print 3,6,9.. How do I synchronize these three to print 1-15 message sequence

Gradle任务 `bootrun` 在IntelliJ中的设置

连接Hibernate到MySQL数据库时遇到的问题。

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论