问题

我有一个使用Apache Beam构建的Google数据流水线。该应用程序每天接收约5000万条记录，现在为了忽略重复记录，我们计划使用Beam框架提供的Deduplication函数。

该文档未说明Deduplication函数可处理的最大输入计数，也未说明其可以持久保留数据的最大持续时间。

对于简单地将大约5000万条记录传送到去重函数中，其中大约一半将是重复记录，然后保持持久性为7天，这是否是一个良好的设计呢？

英文:

I've a Google dataflow pipeline, build using Apace Beam. The application receives about 50M records everyday, now to ignore duplicate records, we are planning to use the Deduplication Function provided by beam framework.

The document doesn't states the maximum input count for which the Deduplication function would work neither the maximum duration for which it can persist the data.

Would it be good design, to simply throw 50M records onto the deduplication function, out of which around half would be duplicates, and save keep the persistence duration of 7 days?

答案1

得分: 1

根据您提供的链接中描述的去重函数，它会在每个窗口内进行去重操作。

如果您设置了一个1小时的窗口，并且重复的数据每3小时到达一次，该函数将不会对其进行去重，因为它们位于不同的窗口内。

因此，您可以定义跨越1天或更长时间的窗口。这并没有限制。数据存储在工作节点上（以节省空间），同时也保留在内存中（以提高效率）。随着数据量的增加，服务器配置必须更强大和更大，以便管理大量的数据。

英文:

The deduplication function, as described in the link that you provide, performs a deduplicate per window.

If you have window of 1H, and you duplicate arrive every 3H, the function don't duplicate them, because they are in different windows.

So, you can define window over 1 day, or more. There is no limit. The data are stored on the workers (to save them), and also kept in memory (for efficiency). And more you have data, stronger and bigger must be the server config to manage the quantity of data.

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

Apache Beam – 去重（Deduplication）函数的限制是什么

问题

答案1

Recieving an ActionListener as a constructor parameter and storing it so other methods in the class can add that action listener to buttons?

删除重复内容 Java

在Eclipse中实现Web服务的错误。

无法修复 spring-security-oauth2-resource-server 上的漏洞。

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论