英文:
Can I avoid network shuffle when creating a KeyedStream from a pre-partitioned Kinesis Data Stream in Apache Flink?
问题
能否从预先分片/预先分区的Kinesis数据流创建KeyedStream
,而无需进行网络洗牌(例如使用reinterpretAsKeyedStream
或类似的方法)? - 如果不可能(即从Kinesis中消费然后使用keyBy
是唯一可靠的方法),那么通过对源数据流的分片字段使用keyBy
能否最小化网络洗牌(例如env.addSource(source).keyBy(pojo -> pojo.getTransactionId())
,其中源数据流按transactionId
字段分片)? - 如果上述操作是可能的,有哪些限制?
我目前了解到的情况
- 我描述的功能已经被
reinterpretAsKeyedStream
实现,但这个功能是实验性的,并且似乎有很大的缺点(根据下面的stackoverflow帖子中的讨论)- reinterpretAsStream文档: https://nightlies.apache.org/flink/flink-docs-master/docs/dev/datastream/experimental/
- https://stackoverflow.com/questions/72629086/using-keyby-vs-reinterpretaskeyedstream-when-reading-from-kafka
- https://stackoverflow.com/questions/73278930/apache-flink-how-to-align-flink-and-kafka-sharding
- 除了上述内容,我找到的所有与
reinterpretAsKeyedStream
相关的讨论都是在Kafka的背景下进行的,因此我不确定在处理Kinesis数据流时结果会有何不同。
我的应用背景
- 关于配置:Kinesis数据流和Flink都将以无服务器方式托管,并根据负载自动进行扩缩容(据我了解,这意味着不能使用
reinterpretAsKeyedStream
)。
任何帮助/见解都将不胜感激,谢谢!
英文:
Is it possible to create a KeyedStream
from a pre-sharded/pre-partitioned Kinesis Data Stream without the need for a network shuffle (i.e. using reinterpretAsKeyedStream
or something similar)?
- If that is not possible (i.e. the only reliable is to consume from Kinesis and then use
keyBy
), then is network shuffling at least minimized by doing akeyBy
on a the field that the source is sharded by (e.g.env.addSource(source).keyBy(pojo -> pojo.getTransactionId())
, where the source is a kinesis data stream that is sharded bytransactionId
) - If the above is possible, what are the limitations?
What I've Learned so Far
- The functionality I am describing is already implemented by
reinterpretAsKeyedStream
, but this feature is experimental and seems to have significant drawbacks (as per discussions in the stackoverflow posts below)- reinterpretAsStream docs: https://nightlies.apache.org/flink/flink-docs-master/docs/dev/datastream/experimental/
- https://stackoverflow.com/questions/72629086/using-keyby-vs-reinterpretaskeyedstream-when-reading-from-kafka
- https://stackoverflow.com/questions/73278930/apache-flink-how-to-align-flink-and-kafka-sharding
- In addition to the above, all the discussions related to
reinterpretAsKeyedStream
that I've found are in the context of Kafka, so I'm not sure how the outcomes differ for a Kinesis Data Stream
Context of my Application
- Re. configurations: both the Kinesis Data Stream and Flink will be hosted serverlessly, and automatically scale up/down depending on load (which as I understand it, means that
reinterpretAsKeyedStream
cannot be used)
Any help/insight is much appreciated, thanks!
答案1
得分: 2
我不相信有任何简单的方法来实现你想要的,至少不是一种能够适应源和集群并行度变化的方式。我曾经使用直升机特技来实现类似的功能,但这牵涉到脆弱的代码(取决于Flink如何处理分区)。
英文:
I don't believe there's any way to easily do what you want, at least not in a way that's resilient to changes in the parallelism of your source and your cluster. I have used helicopter stunts to achieve something similar to this, but it involved fragile code (depends on exactly how Flink handles partitioning).
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论