Apache Beam – 去重(Deduplication)函数的限制是什么

huangapple go评论75阅读模式
英文:

Apache Beam - what are the limits of Deduplication function

问题

我有一个使用Apache Beam构建的Google数据流水线。该应用程序每天接收约5000万条记录,现在为了忽略重复记录,我们计划使用Beam框架提供的Deduplication函数。

该文档未说明Deduplication函数可处理的最大输入计数,也未说明其可以持久保留数据的最大持续时间。

对于简单地将大约5000万条记录传送到去重函数中,其中大约一半将是重复记录,然后保持持久性为7天,这是否是一个良好的设计呢?

英文:

I've a Google dataflow pipeline, build using Apace Beam. The application receives about 50M records everyday, now to ignore duplicate records, we are planning to use the Deduplication Function provided by beam framework.

The document doesn't states the maximum input count for which the Deduplication function would work neither the maximum duration for which it can persist the data.

Would it be good design, to simply throw 50M records onto the deduplication function, out of which around half would be duplicates, and save keep the persistence duration of 7 days?

答案1

得分: 1

根据您提供的链接中描述的去重函数,它会在每个窗口内进行去重操作。

如果您设置了一个1小时的窗口,并且重复的数据每3小时到达一次,该函数将不会对其进行去重,因为它们位于不同的窗口内。

因此,您可以定义跨越1天或更长时间的窗口。这并没有限制。数据存储在工作节点上(以节省空间),同时也保留在内存中(以提高效率)。随着数据量的增加,服务器配置必须更强大和更大,以便管理大量的数据。

英文:

The deduplication function, as described in the link that you provide, performs a deduplicate per window.

If you have window of 1H, and you duplicate arrive every 3H, the function don't duplicate them, because they are in different windows.

So, you can define window over 1 day, or more. There is no limit. The data are stored on the workers (to save them), and also kept in memory (for efficiency). And more you have data, stronger and bigger must be the server config to manage the quantity of data.

huangapple
  • 本文由 发表于 2020年10月13日 19:26:21
  • 转载请务必保留本文链接:https://go.coder-hub.com/64334326.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定