英文:
kafka commit during rebalancing
问题
场景:
- Kafka版本为2.4.1。
- Kafka分区正在主动处理消息。
- CPU使用率较低,内存使用率中等,没有观察到限流。
- 使用confluent的go客户端版本1.7.0在k8s上部署的Golang应用程序。
- k8s删除了一些pod,kafka消费者组进入重新平衡。
- 在重新平衡期间正在处理的消息被卡住,需要大约17分钟才能处理完,通常处理时间最多为3-4秒。
- 没有数据库限流,实际负载甚至不到峰值的10%。
- k8s的pod具有1个核心和1GB内存。
- 消息在同一个线程中被消费和处理。
- 之前我们发现6个集群节点中的一个broker不健康,我们进行了替换,之后我们开始遇到这个问题。
问题 - 为什么消息被卡住了?是因为重新平衡导致处理线程挂起了吗?还是其他原因?
提前感谢你的回答!
英文:
The scenario:
- Kafka version 2.4.1.
- Kafka partitions are processing messages actively.
- CPU usage is less, memory usage is mediocre and no throttling is observed.
- Golang Applications deployed on k8s using confluent's go client version 1.7.0.
- k8s deletes some of the pods, kafka consumer group goes into rebalancing.
- The message which was getting processed during this rebalancing gets stuck in the middle and takes around 17 mins to get processed, usual processing time is 3-4 seconds max.
- No DB throttling, load is actually not even 10% of our peak.
- k8s pods have 1 core and 1gb of memory.
- Messages are consumed and processed in the same thread.
- Earlier we found that one of the brokers in the 6 cluster node was unhealthy and we replaced it, post which we started facing the issue.
Question - Why did the message get stuck? Is it because rebalancing made the processing thread hang? OR something else?
Thanks in advance for your answers!
答案1
得分: 1
消息因重新平衡而停滞,这是由于您的消费者组(CG)正在进行重新平衡。Kafka的重新平衡过程是正常的程序,当新成员加入CG或离开CG时,总是会触发重新平衡。在重新平衡期间,消费者会停止处理一段时间的消息,因此,从主题处理事件会有一些延迟。但是,如果CG停在PreparingRebalance
状态,您将无法处理任何数据。
您可以通过运行一些Kafka命令来识别CG的状态,例如:
kafka-consumer-groups.sh --bootstrap-server $BROKERS:$PORT --group $CG --describe --state
它应该显示CG的状态,例如:
GROUP COORDINATOR (ID) ASSIGNMENT-STRATEGY STATE #MEMBERS
name-of-consumer-group brokerX.com:9092 (1) Empty 0
在上面的示例中,您有STATE: EMPTY
消费者组状态可能有5种状态:
稳定 - 当CG稳定并且所有成员成功连接时
空 - 当组中没有成员时(通常意味着模块已关闭或崩溃)
PreparingRebalance - 成员正在连接到CG时(这可能表明客户端存在问题,当成员不断崩溃时,但也是CG在变为稳定状态之前的状态)
CompletingRebalance - 当PreparingRebalance正在完成重新平衡过程时的状态
Dead - 消费者组没有任何成员,并且元数据已被删除。
要确定问题是集群还是客户端引起的PreparingRebalance
,只需停止客户端并执行命令以验证CG状态...如果CG仍然显示成员,则必须重新启动作为该CG协调器的代理,例如brokerX.com:9092
...如果在停止连接到CG的所有客户端后,CG变为空,则意味着客户端代码/数据存在问题,导致成员离开/重新加入CG,因此您会看到CG始终处于PreparingRebalance状态,您需要调查为什么会发生这种情况。
由于我记得Kafka版本2.4.1中存在错误,并在2.4.1.1中修复,您可以在这里阅读相关信息:
- https://issues.apache.org/jira/browse/KAFKA-9752
- https://aws.amazon.com/about-aws/whats-new/2020/09/amazon-msk-now-offers-version-2-4-1-1-fixing-a-perpetual-rebalance-bug-in-apache-kafka-2-4-1/
我的故障排除步骤应该能够帮助您验证是否遇到了该错误问题,或者只是糟糕的代码。
英文:
Messages are stuck due to rebalancing which is happening for your consumer group (CG). The rebalancing process for Kafka is normal procedure and is always triggered when new member joins the CG or leaves the CG. During rebalance, consumers stop processing messages for some period of time, and, as a result, processing of events from a topic happens with some delay. But if the CG stuck in PreparingRebalance
you will not process any data.
You can identify the CG state by running some Kafka commands as example:
kafka-consumer-groups.sh --bootstrap-server $BROKERS:$PORT --group $CG --describe --state
and it should show you the status of the CG as example:
GROUP COORDINATOR (ID) ASSIGNMENT-STRATEGY STATE #MEMBERS
name-of-consumer-group brokerX.com:9092 (1) Empty 0
in above example you have STATE : EMPTY
The ConsumerGroup State may have 5 states:
Stable - is when the CG is stable and has all members connected successfully
Empty - is when there is no members in the group (usually mean the module is down or crashed)
PreparingRebalance - is when the members are connecting to the CG (it may indicate issue with client when members keep crashing but also is the State of CG before gets stable state)
CompletingRebalance - is the state when the PreparingRebalance is completing the process of rebalancing
Dead - consumer group does not have any members and metadata has been removed.
To indicate if the issue is on Cluster or client per PreparingRebalance
just stop the client and execute the command to verify CG state... if the CG will be still showing members .. then you have to restart the broker which is pointed in the output command as Coordinator of that CG example brokerX.com:9092
.. if the CG become empty once you stop all clients connected to the CG would mean that something is off with the client code/data which causes members to leave/rejoin CG and as effect of this you sees that the CG is always in the status of PreparingRebalance that you will need to investigate why is this happening.
since from what I recall there was bug in Kafka version 2.4.1. and been fixed in 2.4.1.1 you can read about it here:
- https://issues.apache.org/jira/browse/KAFKA-9752
- https://aws.amazon.com/about-aws/whats-new/2020/09/amazon-msk-now-offers-version-2-4-1-1-fixing-a-perpetual-rebalance-bug-in-apache-kafka-2-4-1/
my troubleshooting steps should show you how can you verify If this is the case that you facing the bug issue or is just bad code.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论