问题

ClickHouse支持数据处理中的常见用例，即去重和聚合/汇总。ClickHouse同时支持这两种功能（ReplacingMergeTree：去重和SummingMergeTree：聚合）。我们正在努力将它们结合起来使用：物化视图不能用于将数据从去重表移动到汇总表，因为它们在插入时触发，而去重发生在插入之前（请参见此处的说明）。

是否有一种方法可以先进行去重，然后在ClickHouse中执行汇总？

我们考虑过的方法有：

在插入时进行去重（例如，从Kafka读取数据的物化视图）。已经去重的数据将被写入SummingMergeTree表，然后进行汇总。可以使用标准的SQL技术进行去重，例如group by、distinct或使用row_number的窗口函数，然后按rownum=1进行过滤。这种方法的缺点是去重仅在从Kafka读取的块内应用，而不在外部应用。去重窗口不可调整。
使用ReplacingMergeTree表，让ClickHouse进行去重，但同时运行一个外部的定期调度程序，将数据移动到SummingMergeTree表中。 "移动"将使用INSERT INTO .. SELECT语句进行（我知道，不应该使用FINAL）或使用上述提到的其他SQL去重方法。

在我迄今为止阅读和看过的所有文档、博客文章和YouTube视频中，我还没有找到建议的（如果可能的话，仅使用ClickHouse）首先按ID对Kafka流进行去重，然后对数据进行聚合的方法。

英文:

A common use case in data processing is deduplication and aggregation/rollups. Clickhouse supports both (ReplacingMergeTree:deduplication and SummingMergeTree:aggregation). We are struggling putting both together: Materialized Views cannot be used to move the data from the deduplicated table to the rollup table because they trigger on insert, which happens before the deduplication (see the note here).

Is there a way to achieve deduplication first and then do a rollup in Clickhouse?

Approaches we have been thinking of:

Doing the deduplication on insert (e.g. a Materialized View which reads from Kafka). The already deduplicated data would be written to a SummingMergeTree table which then does the rollup. The deduplication could be done using standard SQL techniques such as group by, distinct or a window function with row_number and filtering by rownum=1 afterwards. The downside of this approach is that deduplication is only applied within the blocks read from Kafka, but not outside. The deduplication window is not adjustable.
Use a ReplacingMergeTree table letting Clickhouse do the deduplication, but additionally run an external, periodic scheduler to move the data into a SummingMergeTree table. "Moving" would be an INSERT INTO .. SELECT statement using FINAL (I know, shouldn't be used) or some other SQL deduplication as outlined above.

In all the documentation, blog posts and YT videos I have read and seen so far I haven't found a recommended (if possible clickhouse only) to first deduplicate a Kafka stream by id and then performing an aggregation on the data.

答案1

得分: 1

另外，对于你的第二个选项，将来你将能够使用可刷新的物化视图（尚未合并）来进行调度。
https://github.com/ClickHouse/ClickHouse/issues/33919

英文:

Additionally, for your option two, you'll be able in the future to use Refreshable Materialized Views (not merged yet) for your scheduling.
https://github.com/ClickHouse/ClickHouse/issues/33919

答案2

得分: 0

如果重复出现在一个批次内，您可以尝试将插入的目标表设置为ReplacingMergeTree。然后，一个mv将在此表上触发，并使用FINAL插入到一个summingmergetree中。我认为这是您的选项1 - 您的去重窗口与插入块大小一样大。

英文:

If the duplicates occur within a batch, you could try and make your target table for the inserts a ReplacingMergeTree. A mv would then trigger on this table and use FINAL to insert into a summingmergetree. I believe this is your option 1 - your deduplication window is as large as the insert block size.

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

ClickHouse：在从Kafka读取后进行去重（deduplication）再进行Rollup。

问题

答案1

答案2

意外的Clickhouse日期时间结果

将数据从其他ClickHouse服务器加载到ClickHouse中。

Clickhouse: n/n 个代理已宕机

ClickHouse：将表与外部数组进行”左连接”。

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论