英文:
Prevent duplicate message publish in Kafka topic if message is present in topic
问题
我正在开发一个中间系统的软件,它用于将数据从一个系统传输到另一个系统。
我们在其中使用了Kafka和Spring Boot。我最近开始在这个系统上工作,对Kafka的了解有限。
我正在寻找解决方案,以防止重复发布消息到Kafka主题,如果相同的消息已经存在于主题中并且尚未被消费。
例如 -
考虑消息 "ABC",我们已经发布到Kafka主题,并且尚未被消费者消费,然后我们再次收到相同的消息 "ABC",那么我们将跳过重复发布这个相同的消息。
只有当消息 "ABC" 当前不在Kafka主题中时,才会再次发布。
注意 - 我们无法控制另一个发送消息给我们的系统,因此存在接收重复消息的情况。
提前感谢任何关于实现这一目标的想法/解决方案和建议。
英文:
I am working on one software which is middle system and works to transfer data from one to another system.
We are using Kafka with spring boot in it. I am recently started working this system and having limited knowledge on Kafka.
I am searching for solution on preventing duplicate message publishing to Kafka topic if same message is already present in topic and is not yet consumed.
For example -
Consider message "ABC" we have published to Kafka topic and is not yet consumed by consumer and we again received same message "ABC" then we will just skip this same message publishing.
Message "ABC" will be published again only if it is not present currently in Kafka topic.
Note - We do not have control on another system which send messages to us therefore there are situation where we received duplicate messages.
Thanks in advance for any ideas/solutions and suggestion for achieve the same.
答案1
得分: 0
Kafka 无法控制此问题。它只看到字节数组,不会反序列化数据进行比较;每个生成的记录都位于唯一的偏移量上,并且在代理中没有可用的 API 可以知道是否存在 "ABC",除非消耗整个主题(这可能很容易达到 TB 的数据量),对于每个新事件来说,这始终是一个线性扫描。
因此,您需要使用其他系统,具有像 Redis 或具有索引的 MongoDB 这样的快速常量时间键/属性查找功能,以告知您是否之前已经看到或处理了该值。
或者,您可以以幂等的方式编写您的消费者处理逻辑,其中重复项不会影响任何内容。例如,数据库更新始终会覆盖最新的信息,即使对于相同的 ID 或整个有效载荷匹配也是如此。
英文:
Kafka has no control over this. It sees byte arrays and will not deserialize data to compare anything; every record produced is at a unique offset and there's no API available in the broker to know if "ABC" exists without consuming the entire topic (which could easily be TB of data) and is always going to be a linear scan for every new event.
Therefore, you'll need some other system with fast constant-time key/property lookups like Redis or an indexed MongoDB that tells you if that value has been seen and/or processed before.
Or, you write your consumer processing logic in an idempotent way, where duplicates don't affect anything. For example, a database update always overrides the latest seen information, even for the same ID or if the entire payload matches.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论