Flink是否像Spark一样具有用于Kafka的minPartitions设置?

huangapple go评论55阅读模式
英文:

Does Flink have the minPartitions setting for Kafka like Spark?

问题

在Spark中,有minPartitions设置。根据文档,它的作用如下:

期望从Kafka读取的最小分区数。默认情况下,Spark将topicPartitions与从Kafka消费的Spark分区进行一对一映射。如果将此选项设置为大于topicPartitions的值,Spark将将大的Kafka分区分割成较小的部分。请注意,此配置类似于提示:Spark任务的数量将大约为minPartitions。这可能会少于或多于,具体取决于四舍五入误差或未收到任何新数据的Kafka分区。

Flink是否有类似的功能?我读到有KafkaPartitionSplit,但如何使用它?

文档还说到:

源分割 #
Kafka源中的源分割表示Kafka主题的分区。Kafka源分割包括:

TopicPartition,表示分割
分区的起始偏移
分区的停止偏移,在源以有界模式运行时才可用

仅当处理历史数据时,像这样分割源才有意义吗?

英文:

In Spark, there's the minPartitions setting. Here's what it's for as per documentation:

Desired minimum number of partitions to read from Kafka. By default, Spark has a 1-1 
mapping of topicPartitions to Spark partitions consuming from Kafka. If you set this 
option to a value greater than your topicPartitions, Spark will divvy up large Kafka 
partitions to smaller pieces. Please note that this configuration is like a hint: the 
number of Spark tasks will be approximately minPartitions. It can be less or more 
depending on rounding errors or Kafka partitions that didn't receive any new data.

Does Flink feature anything similar? I read that there's the KafkaPartitionSplit, but how do you use that?

The documentation also says this:

Source Split #
A source split in Kafka source represents a partition of Kafka topic. A Kafka source split consists of:

TopicPartition the split representing
Starting offset of the partition
**Stopping offset of the partition, only available when the source is running in bounded mode**

Does splitting the source like this only make sense when processing historical data?

答案1

得分: 1

你的 Flink 工作流中的 Kafka 源具有并行性,它定义了从 Kafka 主题读取数据时将使用多少子任务(每个在单独的槽中运行)。Flink 使用主题名称的哈希值来选择第一个分配给分区 0 的子任务,然后以轮询的方式继续将分区分配给子任务。

因此,通常要设置源的并行性,使分区数量是并行性的倍数。例如,如果你有 40 个分区,将并行性设置为 40、20 或 10,以确保每个子任务处理相同数量的分区(从而减轻数据倾斜问题)。

英文:

Your Kafka source in the Flink workflow has a parallelism, which defines how many sub-tasks (each running in a separate slot) will be used to read from your Kafka topic. Flink hashes the topic name to pick the first sub-task that's assigned partition 0, and then continues (in a round-robin manner) to assign partitions to sub-tasks.

So you typically want to set your source parallelism such that the number of partitions is a multiple of the parallelism. E.g. if you have 40 partitions, set your parallelism to 40, or 20, or 10, such that each sub-task is processing the same number of partitions (and thus mitigate data skew issues).

huangapple
  • 本文由 发表于 2023年7月3日 10:20:27
  • 转载请务必保留本文链接:https://go.coder-hub.com/76601515.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定