英文:
Does Flink have the minPartitions setting for Kafka like Spark?
问题
在Spark中,有minPartitions
设置。根据文档,它的作用如下:
期望从Kafka读取的最小分区数。默认情况下,Spark将topicPartitions与从Kafka消费的Spark分区进行一对一映射。如果将此选项设置为大于topicPartitions的值,Spark将将大的Kafka分区分割成较小的部分。请注意,此配置类似于提示:Spark任务的数量将大约为minPartitions。这可能会少于或多于,具体取决于四舍五入误差或未收到任何新数据的Kafka分区。
Flink是否有类似的功能?我读到有KafkaPartitionSplit
,但如何使用它?
文档还说到:
源分割 #
Kafka源中的源分割表示Kafka主题的分区。Kafka源分割包括:
TopicPartition,表示分割
分区的起始偏移
分区的停止偏移,在源以有界模式运行时才可用
仅当处理历史数据时,像这样分割源才有意义吗?
英文:
In Spark, there's the minPartitions
setting. Here's what it's for as per documentation:
Desired minimum number of partitions to read from Kafka. By default, Spark has a 1-1
mapping of topicPartitions to Spark partitions consuming from Kafka. If you set this
option to a value greater than your topicPartitions, Spark will divvy up large Kafka
partitions to smaller pieces. Please note that this configuration is like a hint: the
number of Spark tasks will be approximately minPartitions. It can be less or more
depending on rounding errors or Kafka partitions that didn't receive any new data.
Does Flink feature anything similar? I read that there's the KafkaPartitionSplit
, but how do you use that?
The documentation also says this:
Source Split #
A source split in Kafka source represents a partition of Kafka topic. A Kafka source split consists of:
TopicPartition the split representing
Starting offset of the partition
**Stopping offset of the partition, only available when the source is running in bounded mode**
Does splitting the source like this only make sense when processing historical data?
答案1
得分: 1
你的 Flink 工作流中的 Kafka 源具有并行性,它定义了从 Kafka 主题读取数据时将使用多少子任务(每个在单独的槽中运行)。Flink 使用主题名称的哈希值来选择第一个分配给分区 0 的子任务,然后以轮询的方式继续将分区分配给子任务。
因此,通常要设置源的并行性,使分区数量是并行性的倍数。例如,如果你有 40 个分区,将并行性设置为 40、20 或 10,以确保每个子任务处理相同数量的分区(从而减轻数据倾斜问题)。
英文:
Your Kafka source in the Flink workflow has a parallelism, which defines how many sub-tasks (each running in a separate slot) will be used to read from your Kafka topic. Flink hashes the topic name to pick the first sub-task that's assigned partition 0, and then continues (in a round-robin manner) to assign partitions to sub-tasks.
So you typically want to set your source parallelism such that the number of partitions is a multiple of the parallelism. E.g. if you have 40 partitions, set your parallelism to 40, or 20, or 10, such that each sub-task is processing the same number of partitions (and thus mitigate data skew issues).
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论