2023年2月9日 02:53:02go评论75阅读模式

英文:

Can I avoid network shuffle when creating a KeyedStream from a pre-partitioned Kinesis Data Stream in Apache Flink?

问题

能否从预先分片/预先分区的Kinesis数据流创建KeyedStream，而无需进行网络洗牌（例如使用reinterpretAsKeyedStream或类似的方法）？ - 如果不可能（即从Kinesis中消费然后使用keyBy是唯一可靠的方法），那么通过对源数据流的分片字段使用keyBy能否最小化网络洗牌（例如env.addSource(source).keyBy(pojo -> pojo.getTransactionId())，其中源数据流按transactionId字段分片）？ - 如果上述操作是可能的，有哪些限制？

我目前了解到的情况

我描述的功能已经被reinterpretAsKeyedStream实现，但这个功能是实验性的，并且似乎有很大的缺点（根据下面的stackoverflow帖子中的讨论）
除了上述内容，我找到的所有与reinterpretAsKeyedStream相关的讨论都是在Kafka的背景下进行的，因此我不确定在处理Kinesis数据流时结果会有何不同。

我的应用背景

关于配置：Kinesis数据流和Flink都将以无服务器方式托管，并根据负载自动进行扩缩容（据我了解，这意味着不能使用reinterpretAsKeyedStream）。

任何帮助/见解都将不胜感激，谢谢！

英文:

Is it possible to create a KeyedStream from a pre-sharded/pre-partitioned Kinesis Data Stream without the need for a network shuffle (i.e. using reinterpretAsKeyedStream or something similar)?

If that is not possible (i.e. the only reliable is to consume from Kinesis and then use keyBy), then is network shuffling at least minimized by doing a keyBy on a the field that the source is sharded by (e.g. env.addSource(source).keyBy(pojo -> pojo.getTransactionId()), where the source is a kinesis data stream that is sharded by transactionId)
If the above is possible, what are the limitations?

What I've Learned so Far

The functionality I am describing is already implemented by reinterpretAsKeyedStream, but this feature is experimental and seems to have significant drawbacks (as per discussions in the stackoverflow posts below)
In addition to the above, all the discussions related to reinterpretAsKeyedStream that I've found are in the context of Kafka, so I'm not sure how the outcomes differ for a Kinesis Data Stream

Context of my Application

Re. configurations: both the Kinesis Data Stream and Flink will be hosted serverlessly, and automatically scale up/down depending on load (which as I understand it, means that reinterpretAsKeyedStream cannot be used)

Any help/insight is much appreciated, thanks!

答案1

得分: 2

我不相信有任何简单的方法来实现你想要的，至少不是一种能够适应源和集群并行度变化的方式。我曾经使用直升机特技来实现类似的功能，但这牵涉到脆弱的代码（取决于Flink如何处理分区）。

英文:

I don't believe there's any way to easily do what you want, at least not in a way that's resilient to changes in the parallelism of your source and your cluster. I have used helicopter stunts to achieve something similar to this, but it involved fragile code (depends on exactly how Flink handles partitioning).

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

Can I avoid network shuffle when creating a KeyedStream from a pre-partitioned Kinesis Data Stream in Apache Flink?

问题

答案1

状态后端是否需要检查点？

Apache Flink SQL 流查询的数据持久性

Flink键控处理函数在闲置源上的定时器

如何在Flink中添加同一数据流的多个消费者

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。