英文:
Can Kafka be used as a distribute work queue
问题
我正在考虑使用Kafka作为分布式工作队列,多个工作者可以从中检索任务。我的原始设计如下:
工作生产者 ---> Kafka主题 ------工作者1
|
|__工作者2
...
|__工作者n
这个设计存在以下问题:
-
如果某个工作者从主题中获取任务并立即提交偏移量,那么在发生故障的情况下可能不会重新处理该任务。
-
如果某个工作者从主题中获取任务并仅在完成后提交偏移量,那么其他工作者也可能获取此任务并处理它。如果任务持续时间很长,那么几乎所有工作者都将获取相同的任务并完全处理它,从而抑制了分发的特性。
我正在寻找一种方法将队列中的任务“标记”为“正在处理”,以便其他人不能消费该任务,但不会提交偏移量(因为它可能会失败并需要重新处理)。这种实现是否可行?
英文:
I'm considering Kafka to use as a distributed work queue multiple workers can retrieve tasks from. My original design looks as:
Work Producer ---> Kafka topic ------worker 1
|
|__worker 2
...
|__worker n
The problems with this design is this:
-
If some worker takes a task from the topic and immediately commits offset then in case of failure the task may not be reprocessed.
-
If some worker takes a task from the topic and commits offset only on finish then other workers may also takes this task and process it. If the task is pretty long lasting then almost all workers will take the same task and process it completely inhibiting the distributing nature.
I'm looking for a way "mark" a task in a queue as "in progress" so it's not consumed by anyone else, but offset is not committed (because it may fail and needs reprocessing). Is it possible to implement?
答案1
得分: 3
> 如果某个工作人员从主题中获取任务并立即提交偏移量,那么如果出现故障,则可能不会重新处理该任务。
在这种情况下,我建议使用手动提交并禁用消费者的auto.commit.offset配置。
> 如果某个工作人员从主题中获取任务并仅在完成时提交偏移量,则其他工作人员也可能获取此任务并处理它。如果任务持续时间相当长,则几乎所有工作人员都将获取相同的任务并完全处理它,从而抑制了分发的特性。
您可以通过使用分区设计主题和使用ConsumerGroup设计消费者来处理这种情况。在Kafka中,每个分区只能由Consumer Group内的一个消费者线程读取。
这意味着只要您的所有消费者(或“工作人员”)属于同一个ConsumerGroup,绝对不会出现两个工作人员同时开始读取和处理相同的消息。
英文:
> If some worker takes a task from the topic and immediately commits offset then in case of failure the task may not be reprocessed.
In that case I recommend to use manual commits and disable the auto.commit.offset configuration of your consumer.
> If some worker takes a task from the topic and commits offset only on finish then other workers may also takes this task and process it. If the task is pretty long lasting then almost all workers will take the same task and process it completely inhibiting the distributing nature.
You could deal with this scenario by designing your topic with partitions and your consumers with a ConsumerGroup. In Kafka, every partition can only be read by one consumer thread within a Consumer Group.
That means, as long as all your consumers (or "workers") belong to the same ConsumerGroup it will never be the case that two workers will start reading and processing the same message.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论