问题

在Spark中，有minPartitions设置。根据文档，它的作用如下：

期望从Kafka读取的最小分区数。默认情况下，Spark将topicPartitions与从Kafka消费的Spark分区进行一对一映射。如果将此选项设置为大于topicPartitions的值，Spark将将大的Kafka分区分割成较小的部分。请注意，此配置类似于提示：Spark任务的数量将大约为minPartitions。这可能会少于或多于，具体取决于四舍五入误差或未收到任何新数据的Kafka分区。

Flink是否有类似的功能？我读到有KafkaPartitionSplit，但如何使用它？

文档还说到：

源分割 #
Kafka源中的源分割表示Kafka主题的分区。Kafka源分割包括：

TopicPartition，表示分割
分区的起始偏移
分区的停止偏移，在源以有界模式运行时才可用

仅当处理历史数据时，像这样分割源才有意义吗？

英文:

In Spark, there's the minPartitions setting. Here's what it's for as per documentation:

Desired minimum number of partitions to read from Kafka. By default, Spark has a 1-1 
mapping of topicPartitions to Spark partitions consuming from Kafka. If you set this 
option to a value greater than your topicPartitions, Spark will divvy up large Kafka 
partitions to smaller pieces. Please note that this configuration is like a hint: the 
number of Spark tasks will be approximately minPartitions. It can be less or more 
depending on rounding errors or Kafka partitions that didn&#39;t receive any new data.

Does Flink feature anything similar? I read that there's the KafkaPartitionSplit, but how do you use that?

The documentation also says this:

Source Split #
A source split in Kafka source represents a partition of Kafka topic. A Kafka source split consists of:

TopicPartition the split representing
Starting offset of the partition
**Stopping offset of the partition, only available when the source is running in bounded mode**

Does splitting the source like this only make sense when processing historical data?

答案1

得分: 1

你的 Flink 工作流中的 Kafka 源具有并行性，它定义了从 Kafka 主题读取数据时将使用多少子任务（每个在单独的槽中运行）。Flink 使用主题名称的哈希值来选择第一个分配给分区 0 的子任务，然后以轮询的方式继续将分区分配给子任务。

因此，通常要设置源的并行性，使分区数量是并行性的倍数。例如，如果你有 40 个分区，将并行性设置为 40、20 或 10，以确保每个子任务处理相同数量的分区（从而减轻数据倾斜问题）。

英文:

Your Kafka source in the Flink workflow has a parallelism, which defines how many sub-tasks (each running in a separate slot) will be used to read from your Kafka topic. Flink hashes the topic name to pick the first sub-task that's assigned partition 0, and then continues (in a round-robin manner) to assign partitions to sub-tasks.

So you typically want to set your source parallelism such that the number of partitions is a multiple of the parallelism. E.g. if you have 40 partitions, set your parallelism to 40, or 20, or 10, such that each sub-task is processing the same number of partitions (and thus mitigate data skew issues).

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

Flink是否像Spark一样具有用于Kafka的minPartitions设置？

问题

答案1

使用 RocksDB 作为 Flink 中状态后端创建快照所进行的 API 调用是什么？

你应该将自己的Jackson版本导入到Flink项目中吗？

如何在 Flink 中使用 TestContainers 和 Cassandra sink。

Can you implement Flink's AggregateFunction with Generic Types?

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论