KStream的join立即触发连接函数,如何将其延迟到窗口结束时?

huangapple go评论76阅读模式
英文:

KStream join fires join function instantly, how to delay it at the end of window?

问题

按照综合文章《跨越流的连接》1 的解释,Outer KStream-KStream Join 会在元素到达时立即发射元素,甚至在等待另一个 K-Stream 中的匹配之前就发射,这样做的缺点是它会将未连接的事件与每个连接的事件一起复制。

您能否提供任何实现事件连接的其他方法,既不会复制(如外连接),也不会丢失(如内连接)?


根据相同的点击-查看事件示例

KStream<String, JsonNode> joinedEventsStream = 
    clickEventsStream.outerJoin(viewEventsStream,
        (clickEvent, viewEvent) -> processJoin(clickEvent, viewEvent), /* 如果找到匹配项,立即触发,*/
                                                                      /* 否则在2秒后触发 */
        JoinWindows.of(Duration.ofSeconds(2L)), StreamJoined.with(Serdes.String(), jsonSerde, jsonSerde)
    );

预期结果如下所示:

  • 点击事件在查看之后1秒到达 - 连接的事件 (A,A)
  • 点击事件在查看之后11秒到达 - 每个事件都不同。每个事件在其到达后2秒(窗口大小)之后触发。(B,null) (null,B)
  • 查看事件在点击之后1秒到达 - 连接的事件 (C,C)
  • 有一个查看事件但没有点击事件 - 在其到达后2秒内未连接的事件 (D,null)
  • 有一个点击事件但没有查看事件 - 在其到达后2秒内未连接的事件 (null,E)

KStream的join立即触发连接函数,如何将其延迟到窗口结束时?

英文:

As explained in the comprehensive article Crossing the Streams. The Outer KStream-KStream Join emits element as soon as it arrives, even before waiting for its match in another K-Stream. Downside of this is that it duplicates not-joined event along with every joined event.

Can you suggest any alternate way to implement a join of events without duplicating(as in outer join) or missing(as in inner join)?


As per the same click-view events example:

KStream&lt;String, JsonNode&gt; joinedEventsStream = 
     clickEventsStream.outerJoin(viewEventsStream,
			(clickEvent, viewEvent) -&gt; processJoin(clickEvent, viewEvent),/* Fire quickly if match found,*/
                                                                          /* else fire after 2 seconds */
			JoinWindows.of(Duration.ofSeconds(2L)),	StreamJoined.with(Serdes.String(), jsonSerde, jsonSerde)
	);

Expected results are explained below:

  • a click event arrives 1 sec after the view - Joined events (A,A)
  • a click event arrives 11 sec after the view - Different events for each. Each one after 2 seconds(Window size) of its arrival.(B,null) (null,B)
  • a view Event arrives 1 sec after the click - Joined events (C,C)
  • there is a view event but no click - Not-joined event after 2 seconds of its arrival (D,null)
  • there is a click event but no view - Not-joined event after 2 seconds of its arrival (null,E)

KStream的join立即触发连接函数,如何将其延迟到窗口结束时?

答案1

得分: 1

目前(Kafka 2.7.0版本)的行为与博文中描述的一致。这个问题已经多次出现,我们最近创建了一个工单来更改这个行为:https://issues.apache.org/jira/browse/KAFKA-10847

目前,您可以在连接操作之后使用下游的有状态操作,将记录缓冲直到窗口结束(或者更好的是,窗口关闭,即窗口结束加上优雅期)为止。这样可以让您过滤掉不必要的左/外连接结果。

英文:

Atm (Kafka 2.7.0) the behavior is as describe in the blog post. This question came up multiple times already, and we create a ticket recently to change the behavior: https://issues.apache.org/jira/browse/KAFKA-10847

Atm, you could use a downstream stateful operation after the join, to buffer records until the window end (or maybe better, window close, ie, window end plus grace period) is reached. This allows you to filter out spurious left/outer join result.

huangapple
  • 本文由 发表于 2020年9月16日 15:59:32
  • 转载请务必保留本文链接:https://go.coder-hub.com/63915605.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定