如何在Flink中执行带有更新的多窗口聚合?

huangapple go评论65阅读模式
英文:

How to perform multiple window aggregation with update in Flink?

问题

我有一个使用情况,在这种情况下,我收到包含不同信息集的事件流,希望对它们执行聚合。对于这些聚合中的每一个,都需要多个滚动窗口,例如:每日、每周、每月、每年等等。
最初的聚合只是对计数的基本相加,但以后可能会涉及这些事件之间的一些分析/连接处理。因此,如果事件A每天发生一次,另一个事件B每周发生一次,结果可能是这样的:

每日
     A:1
     B:1(仅限于接收当天)
每周
     A:7
     B:1
每月
     A:30(30天的月份)
     B:4(某些情况下为5)
每年
     A:365
     B:52(某些情况下为53)

这个用例只涉及滚动窗口,而不是滑动窗口,我正在寻找如何实现这个用例。主要问题在于,我不想等到窗口结束,希望每隔大约10分钟就接收更新。
我查看了Flink,有一些方法可以做到这一点,比如使用ProcessWindow函数、增量聚合、流切片、广播状态等等。但由于我对Flink还比较新,我不太确定应该使用什么,是否有任何我忽视的问题。

如果有人能帮忙解决这个问题,那将非常棒。

英文:

I have a use case wherein I'm receiving a stream of events containing different sets of information and want to perform aggregations on them. For each of these aggregations, there are multiple tumbling windows which are needed eg: Daily, Weekly, Monthly, Yearly etc.
The aggregations initially are basic addition of the counts seen but could later be some analytics/joins handling across these events. So if an event A comes once everyday and another event B comes once every week, the result would be something like this:

Daily
     A: 1
     B: 1 (Only for the day it was received)
Weekly
     A: 7
     B: 1
Monthly
     A: 30 (30 day month)
     B: 4 (5 in some cases)
Yearly
     A: 365
     B: 52 (53 in some cases)

The usecase is only around tumbling windows and not sliding windows and I'm looking at how to implement this usecase. The main problem is that I don't want to wait until the end of the window and want to keep receiving updates every 10 minutes or so.
I took a look at flink and there are some ways we could do it such as using a ProcessWindow function, incremental aggregation, stream slicing, broadcasting state etc. but since I'm pretty new to flink I'm not completely sure on what to use and if there are any pitfalls I'm missing.

Would be great if anyone could help me out.

答案1

得分: 1

实现 Flink 中的窗口选择有以下几种方式:

  1. Flink SQL
  2. DataStream 窗口 API
  3. ProcessFunction

我认为你每 10 分钟生成更新的需求不适合使用 SQL。

关于窗口 API,内置的 TimeWindow 窗口分配器不支持月份和年份,而且每 10 分钟生成更新的要求需要自定义触发器。虽然付出足够的努力可以克服这些限制,但我认为不值得。

我建议使用 ProcessFunction 来实现。Flink 文档中嵌入的培训中有如何使用 ProcessFunction 实现滚动时间窗口的示例,你可以将其作为起点,扩展该示例以满足你的需求应该不会很困难。

英文:

The choices for implementing windows on Flink are

  1. Flink SQL
  2. the DataStream Window API
  3. a ProcessFunction

I don't think your requirement to produce updates every 10 minutes is a good fit for SQL.

As for the Window API, the built-in TimeWindow window assigner doesn't support months and years, and the requirement to product updates every 10 minutes requires a custom Trigger. With enough effort you could overcome these limitations, but I don't think it's worth it.

I would instead implement this using a ProcessFunction. The training that is embedded the Flink docs has an example of how to use a process function to implement tumbling time windows that you can use as a starting point. Extending that example to meet your requirements shouldn't be very difficult.

huangapple
  • 本文由 发表于 2020年10月1日 17:27:41
  • 转载请务必保留本文链接:https://go.coder-hub.com/64152528.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定