英文:
How does Flink countWindow work in detail
问题
我正在使用简单的示例学习Flink。
我已经根据这里调整了WindowWordCount
示例,并在这个简单的数据文件上运行它:
cat data.txt
a a b c c
我忽略了slideSize
,只使用了countWindow(windowSize)
(因此根据这篇文章,它被称为滚动窗口)用于多个窗口大小。我对结果感到困惑。
当windowSize = 1
时,输出为:
(a,1)
(a,1)
(b,1)
(c,1)
(c,1)
这完全合理。但当windowSize = 2
时,输出为:
(a,2)
(c,2)
为什么没有b
的计数?
当windowSize = 3
时,输出为空,这也让我不太理解。
请问是否有人可以帮我理解在windowSize = 2
和windowSize = 3
的情况下输出是如何产生的?
我的代码在https://gist.github.com/zyxue/bc566180d2b01f1e2e77c1bbe3a7c5e5
我使用类似以下命令运行它:
./bin/flink run /path/to/target/myflinkapp-1.0-SNAPSHOT.jar --input data.txt --output /tmp/out --window 1
英文:
I'm learning Flink with simple toy examples.
I have adapted the WindowWordCount
example from here and run it on this simple data file
cat data.txt
a a b c c
I ignored the slideSize
and only do countWindow(windowSize)
(hence it's called tumbling window according to this post for multiple window sizes. I'm confused by the results.
When windowSize = 1
, the output is
(a,1)
(a,1)
(b,1)
(c,1)
(c,1)
which makes total sense. But when windowSize = 2
, the output is
(a,2)
(c,2)
where is count for b
?
when windowSize = 3
, the output is empty, which I don't understand either.
Can anyone help me understand how the outputs are produced in the cases of windowSize = 2
and windowSize = 3
, please?
My code is on https://gist.github.com/zyxue/bc566180d2b01f1e2e77c1bbe3a7c5e5
And I run it with command like
./bin/flink run /path/to/target/myflinkapp-1.0-SNAPSHOT.jar --input data.txt --output /tmp/out --window 1
答案1
得分: 2
Trigger
for CountWindow
仅对完整的窗口触发窗口函数 — 换句话说,经过处理给定键的windowSize
个事件后,窗口才会触发。
例如,当windowSize = 2
,只有a
和c
有足够的事件。由于只有一个b
,作业会以b
的部分填充窗口结束。
如果您想为部分计数窗口生成报告,可以使用自定义触发器,该触发器还会对超时做出反应。
英文:
The Trigger
for CountWindow
only triggers the window function for complete windows -- in other words, after processing windowSize
events for a given key, then the window will fire.
For example, with windowSize = 2
, only for a
and c
are there enough events. Since there is only one b
, the job ends with a partially filled window for b
.
You can use a custom trigger that also reacts to a timeout if you want to generate reports for partial count windows.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论