英文:
Clickhouse: deduplication before rollup (after reading from Kafka)
问题
ClickHouse支持数据处理中的常见用例,即去重和聚合/汇总。ClickHouse同时支持这两种功能(ReplacingMergeTree:去重和SummingMergeTree:聚合)。我们正在努力将它们结合起来使用:物化视图不能用于将数据从去重表移动到汇总表,因为它们在插入时触发,而去重发生在插入之前(请参见此处的说明)。
是否有一种方法可以先进行去重,然后在ClickHouse中执行汇总?
我们考虑过的方法有:
-
在插入时进行去重(例如,从Kafka读取数据的物化视图)。已经去重的数据将被写入SummingMergeTree表,然后进行汇总。可以使用标准的SQL技术进行去重,例如
group by
、distinct
或使用row_number
的窗口函数,然后按rownum=1
进行过滤。这种方法的缺点是去重仅在从Kafka读取的块内应用,而不在外部应用。去重窗口不可调整。 -
使用ReplacingMergeTree表,让ClickHouse进行去重,但同时运行一个外部的定期调度程序,将数据移动到SummingMergeTree表中。 "移动"将使用
INSERT INTO .. SELECT
语句进行(我知道,不应该使用FINAL
)或使用上述提到的其他SQL去重方法。
在我迄今为止阅读和看过的所有文档、博客文章和YouTube视频中,我还没有找到建议的(如果可能的话,仅使用ClickHouse)首先按ID对Kafka流进行去重,然后对数据进行聚合的方法。
英文:
A common use case in data processing is deduplication and aggregation/rollups. Clickhouse supports both (ReplacingMergeTree:deduplication and SummingMergeTree:aggregation). We are struggling putting both together: Materialized Views cannot be used to move the data from the deduplicated table to the rollup table because they trigger on insert, which happens before the deduplication (see the note here).
Is there a way to achieve deduplication first and then do a rollup in Clickhouse?
Approaches we have been thinking of:
-
Doing the deduplication on insert (e.g. a Materialized View which reads from Kafka). The already deduplicated data would be written to a SummingMergeTree table which then does the rollup. The deduplication could be done using standard SQL techniques such as
group by
,distinct
or a window function withrow_number
and filtering byrownum=1
afterwards. The downside of this approach is that deduplication is only applied within the blocks read from Kafka, but not outside. The deduplication window is not adjustable. -
Use a ReplacingMergeTree table letting Clickhouse do the deduplication, but additionally run an external, periodic scheduler to move the data into a SummingMergeTree table. "Moving" would be an
INSERT INTO .. SELECT
statement usingFINAL
(I know, shouldn't be used) or some other SQL deduplication as outlined above.
In all the documentation, blog posts and YT videos I have read and seen so far I haven't found a recommended (if possible clickhouse only) to first deduplicate a Kafka stream by id and then performing an aggregation on the data.
答案1
得分: 1
另外,对于你的第二个选项,将来你将能够使用可刷新的物化视图(尚未合并)来进行调度。
https://github.com/ClickHouse/ClickHouse/issues/33919
英文:
Additionally, for your option two, you'll be able in the future to use Refreshable Materialized Views (not merged yet) for your scheduling.
https://github.com/ClickHouse/ClickHouse/issues/33919
答案2
得分: 0
如果重复出现在一个批次内,您可以尝试将插入的目标表设置为ReplacingMergeTree。然后,一个mv将在此表上触发,并使用FINAL插入到一个summingmergetree中。我认为这是您的选项1 - 您的去重窗口与插入块大小一样大。
英文:
If the duplicates occur within a batch, you could try and make your target table for the inserts a ReplacingMergeTree. A mv would then trigger on this table and use FINAL to insert into a summingmergetree. I believe this is your option 1 - your deduplication window is as large as the insert block size.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论