英文:
Can I avoid network shuffle when creating a KeyedStream from a pre-partitioned Kinesis Data Stream in Apache Flink?
问题
能否从预先分片/预先分区的Kinesis数据流创建KeyedStream,而无需进行网络洗牌(例如使用reinterpretAsKeyedStream或类似的方法)? - 如果不可能(即从Kinesis中消费然后使用keyBy是唯一可靠的方法),那么通过对源数据流的分片字段使用keyBy能否最小化网络洗牌(例如env.addSource(source).keyBy(pojo -> pojo.getTransactionId()),其中源数据流按transactionId字段分片)? - 如果上述操作是可能的,有哪些限制?
我目前了解到的情况
- 我描述的功能已经被
reinterpretAsKeyedStream实现,但这个功能是实验性的,并且似乎有很大的缺点(根据下面的stackoverflow帖子中的讨论)- reinterpretAsStream文档: https://nightlies.apache.org/flink/flink-docs-master/docs/dev/datastream/experimental/
 - https://stackoverflow.com/questions/72629086/using-keyby-vs-reinterpretaskeyedstream-when-reading-from-kafka
 - https://stackoverflow.com/questions/73278930/apache-flink-how-to-align-flink-and-kafka-sharding
 
 - 除了上述内容,我找到的所有与
reinterpretAsKeyedStream相关的讨论都是在Kafka的背景下进行的,因此我不确定在处理Kinesis数据流时结果会有何不同。 
我的应用背景
- 关于配置:Kinesis数据流和Flink都将以无服务器方式托管,并根据负载自动进行扩缩容(据我了解,这意味着不能使用
reinterpretAsKeyedStream)。 
任何帮助/见解都将不胜感激,谢谢!
英文:
Is it possible to create a KeyedStream from a pre-sharded/pre-partitioned Kinesis Data Stream without the need for a network shuffle (i.e. using reinterpretAsKeyedStream or something similar)?
- If that is not possible (i.e. the only reliable is to consume from Kinesis and then use 
keyBy), then is network shuffling at least minimized by doing akeyByon a the field that the source is sharded by (e.g.env.addSource(source).keyBy(pojo -> pojo.getTransactionId()), where the source is a kinesis data stream that is sharded bytransactionId) - If the above is possible, what are the limitations?
 
What I've Learned so Far
- The functionality I am describing is already implemented by 
reinterpretAsKeyedStream, but this feature is experimental and seems to have significant drawbacks (as per discussions in the stackoverflow posts below)- reinterpretAsStream docs: https://nightlies.apache.org/flink/flink-docs-master/docs/dev/datastream/experimental/
 - https://stackoverflow.com/questions/72629086/using-keyby-vs-reinterpretaskeyedstream-when-reading-from-kafka
 - https://stackoverflow.com/questions/73278930/apache-flink-how-to-align-flink-and-kafka-sharding
 
 - In addition to the above, all the discussions related to 
reinterpretAsKeyedStreamthat I've found are in the context of Kafka, so I'm not sure how the outcomes differ for a Kinesis Data Stream 
Context of my Application
- Re. configurations: both the Kinesis Data Stream and Flink will be hosted serverlessly, and automatically scale up/down depending on load (which as I understand it, means that 
reinterpretAsKeyedStreamcannot be used) 
Any help/insight is much appreciated, thanks!
答案1
得分: 2
我不相信有任何简单的方法来实现你想要的,至少不是一种能够适应源和集群并行度变化的方式。我曾经使用直升机特技来实现类似的功能,但这牵涉到脆弱的代码(取决于Flink如何处理分区)。
英文:
I don't believe there's any way to easily do what you want, at least not in a way that's resilient to changes in the parallelism of your source and your cluster. I have used helicopter stunts to achieve something similar to this, but it involved fragile code (depends on exactly how Flink handles partitioning).
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。


评论