Kafka会在一段时间后停止运行,出现丢失的代理。

huangapple go评论126阅读模式
英文:

Kafka stops after a certain period of time, lost brokers

问题

我有 4 个正在运行 Debezium 的 Kafka 实例。在正常运行了一段时间后,三台 Kafka 机器在某段时间内失去了网络连接,在 connectDistributed.out 日志文件中出现了大量以下错误信息:

[2020-05-04 13:27:02,526] WARN [Consumer clientId=connector-consumer-sink-warehouse-6,
groupId=connect-sink-warehouse] 133 个分区的 leader brokers 没有匹配的监听器,
包括 [SCT010-2,SC2010-2,SC1010-0,SC1010-1,SF4010-0,SUB010-0,SUB010-1,SWP010-0,
SWP010-1,ACO010-2] (org.apache.kafka.clients.NetworkClient:1044)

我有 4 台 Kafka 机器,代理从 0 到 3

192.168.240.70 - 代理 0
192.168.240.71 - 代理 1
192.168.240.72 - 代理 2
192.168.240.73 - 代理 3

Zookeeper:

192.168.240.70

以下是我的 server.properties 配置文件 - 除了 listenersadvertised.listeners 指向 Kafka 所安装的机器的相同 IP 以及必须是唯一的 broker.id(从 0 到 3)外,其他都是相同的:

broker.id=0
listeners=CONTROLLER://192.168.240.70:9091,INTERNAL://192.168.240.70:9092
advertised.listeners=CONTROLLER://192.168.240.70:9091,INTERNAL://192.168.240.70:9092
listener.security.protocol.map=CONTROLLER:PLAINTEXT,INTERNAL:PLAINTEXT
control.plane.listener.name=CONTROLLER
inter.broker.listener.name=INTERNAL
num.network.threads=3
num.io.threads=8
socket.send.buffer.bytes=102400
socket.receive.buffer.bytes=102400
socket.request.max.bytes=104857600
log.dirs=/home/william/kafka/data/kafka/
num.partitions=3
num.recovery.threads.per.data.dir=1
offsets.topic.replication.factor=3
transaction.state.log.replication.factor=3
transaction.state.log.min.isr=1
log.retention.hours=150
log.retention.bytes=200000000000
log.segment.bytes=1073741824
log.retention.check.interval.ms=300000

zookeeper.connect=192.168.240.70:2181
zookeeper.connection.timeout.ms=6000
group.initial.rebalance.delay.ms=3

Kafka 主题(configs/offsets 和 status)显示出复制问题。这是否与监听器配置有关?

关于连接器的健康状况:

在 Kafka Connect 上的连接器健康状况:

另外,在 Kafka Connect 上,只显示了一个代理:

如何修复这个错误?这似乎与 leader 选举或长时间没有代理访问后找到 leader 相关。

英文:

I have 4 kafka running with debezium. After some days running well, three kafka machines get out of the network for a period of time and, on connectDistributed.out log file I have a lot of messages with the following error:

[2020-05-04 13:27:02,526] WARN [Consumer clientId=connector-consumer-sink-warehouse-6, 
groupId=connect-sink-warehouse] 133 partitions have leader brokers without a matching listener,
 including [SCT010-2, SC2010-2, SC1010-0, SC1010-1, SF4010-0, SUB010-0, SUB010-1, SWP010-0, 
SWP010-1, ACO010-2] (org.apache.kafka.clients.NetworkClient:1044)

I have 4 Kafka machines, brokers from 0 to 3

192.168.240.70 - Broker 0
192.168.240.71 - Broker 1
192.168.240.72 - Broker 2
192.168.240.73 - Broker 3

Zookeeper:

192.168.240.70

Follow my server.properties - There are the same, except the listeners, advertised.listeners that points for the same IP of the machine that Kafka is installed and broker.id that must be unique (from 0 to 3):

broker.id=0
listeners=CONTROLLER://192.168.240.70:9091,INTERNAL://192.168.240.70:9092
advertised.listeners=CONTROLLER://192.168.240.70:9091,INTERNAL://192.168.240.70:9092
listener.security.protocol.map=CONTROLLER:PLAINTEXT,INTERNAL:PLAINTEXT
control.plane.listener.name=CONTROLLER
inter.broker.listener.name=INTERNAL
num.network.threads=3
num.io.threads=8
socket.send.buffer.bytes=102400
socket.receive.buffer.bytes=102400
socket.request.max.bytes=104857600
log.dirs=/home/william/kafka/data/kafka/
num.partitions=3
num.recovery.threads.per.data.dir=1
offsets.topic.replication.factor=3
transaction.state.log.replication.factor=3
transaction.state.log.min.isr=1
log.retention.hours=150
log.retention.bytes=200000000000
log.segment.bytes=1073741824
log.retention.check.interval.ms=300000

zookeeper.connect=192.168.240.70:2181
zookeeper.connection.timeout.ms=6000
group.initial.rebalance.delay.ms=3

The kafka topics (configs/offsets and status) shows problems in replication. There are some related with listeners config?

Kafka会在一段时间后停止运行,出现丢失的代理。

About the connectors health:

Kafka会在一段时间后停止运行,出现丢失的代理。

And, on kafka connect. only one broker are presented:

Kafka会在一段时间后停止运行,出现丢失的代理。

How can I fix this error? Seems to be something related to leader election or finding the leader after a long time without broker access.

答案1

得分: 0

经过研究,我找到了问题所在。所以,在这里提供一些关于问题的概念,以帮助解决:

当我们创建分布式的 Kafka 系统时,我们将主题分布在各个代理(brokers)周围,并创建了领导者选举机制。在我的情况下,我有 4 个代理,Zookeeper 动态选择谁将成为特定主题的领导者。

由于有三个四个 Kafka 服务器已经宕机超过一个小时,Zookeeper 尝试访问并且无法联系到领导者。由于我的配置要求将相同主题复制到三个代理,Zookeeper 无法保持主题的健康状态。

我们有重新平衡配置:group.initial.rebalance.delay.ms=3,它会每 3 秒尝试重新平衡。并且在某个 Kafka 代理宕机之前,我们有有限的尝试次数来重新连接。这些尝试次数已经用完,Zookeeper 无法联系到失联的代理。

换句话说,代理并没有宕机,它们只是由于网络问题无法联系到 Zookeeper,所以一段时间后,Zookeeper 的尝试停止了,然后 Kafka 代理再次可达,但重新平衡的尝试也停止了。

简单地重启我的代理以通知 Zookeeper 重新连接解决了我的问题,因为在重启时,Kafka 代理会告诉 Zookeeper:-我在这里,等待您的指示。而且,Zookeeper 识别到了丢失的领导者,将一切重新连接到正确的位置。

所以,这只是简单地通过重新启动代理来告诉 Zookeeper 重新连接解决了我的问题,因为在重启时,Kafka 代理会告诉 Zookeeper:-我在这里,等待您的指示。而且,Zookeeper 识别到了丢失的领导者,将一切重新连接到正确的位置。

英文:

After a research I found the problem. So, to help here, follow the concept about the problem:

When we create a distributed Kafka systems, we distribute the topics around the brokers and create leader electors. In my case, I have 4 brokers that Zookeeper chooses on the fly who will be the leader of certain topics.

As three of four Kafka servers was down for more than a hour, Zookeeper tried to reach and can't reach the leader. As my config points the need to replicate to three brokers the same topic, Zookeeper can't maintain the topic healthy.

We have the rebalance config: group.initial.rebalance.delay.ms=3 that try to rebalance at every 3 seconds. And we have a limited number of tries to reconnect before one Kafka broker was down. The tries was occurred and Zookeeper can't reached the brokers lost.

In other way, the brokers was not down, they're only cant't reach Zookeeper by network problems so, after a time, the tries from Zookeeper was stopped and, after a time, Kafka brokers are reachable again but, the tries to rebalance are stopped.

Just, simply restarting my brokers to tell to zookeeper to reconnect solved my problem because, on restart, Kafka brokers tell to zookeeper: -I'm here, waiting for your instructions. And Zookeeper, recognizing the leaders lost, reconnect everything on the right place.

So, trying to be helpful.

huangapple
  • 本文由 发表于 2020年5月4日 22:31:44
  • 转载请务必保留本文链接:https://go.coder-hub.com/61594643.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定