2020年1月7日 01:52:56go评论68阅读模式

英文:

Number of partitions when creating a Spark dataframe

问题

问题已经在其他帖子中提出，但似乎我的问题不适用于它们之一。

我在本地模式下使用Spark 2.4.4，将主节点设置为local[16]以使用16个核心。我还在Web界面中看到已分配了16个核心。

我创建了一个导入约8MB的CSV文件的数据框，如下所示：

val df = spark.read.option("inferSchema", "true").option("header", "true").csv("Datasets/globalpowerplantdatabasev120/*.csv")

最后，我打印数据框的分区数量：

df.rdd.partitions.size

res5: Int = 2

答案是2。

为什么呢？据我所了解，分区的数量取决于默认情况下设置为与核心数（16）相等的执行器数量。

我尝试使用spark.default.Parallelism = 4和/或spark.executor.instances = 4来设置执行器的数量，并启动了一个新的Spark对象，但分区数量没有改变。

有什么建议吗？

英文:

The question has been asked in other thread, but it seems my problem doesn't fit in any of them.

I'm using Spark 2.4.4 in local mode, I set the master to local[16] to use 16 cores. I also see in the web UI 16 cores have been allocated.

I create a dataframe importing a csv file of about 8MB like this:

val df = spark.read.option(&quot;inferSchema&quot;, &quot;true&quot;).option(&quot;header&quot;, &quot;true&quot;).csv(&quot;Datasets/globalpowerplantdatabasev120/*.csv&quot;)

finally I print the number of partitions the dataframe is made of:

df.rdd.partitions.size

res5: Int = 2

Answer is 2.

Why? As far as I read around, the number of partitions depends on the number of executors that is by default set equal the numer of cores(16).

I tried to set the number of esecutors using spark.default.Parallelism = 4 and/or spark.executor.instances = 4 and started a new spark object but nothing changed in the number of partitions.

Any suggestion?

答案1

得分: 1

当您使用Spark读取文件时，分区数量被计算为defaultMinPartitions和基于Hadoop输入分片大小除以块大小计算的分片数量之间的最大值。由于您的文件很小，所以您得到的分区数量是2，这是这两者中的最大值。

默认的defaultMinPartitions计算如下：

def defaultMinPartitions: Int = math.min(defaultParallelism, 2)

请查看https://github.com/apache/spark/blob/e9f983df275c138626af35fd263a7abedf69297f/core/src/main/scala/org/apache/spark/SparkContext.scala#L2329

英文:

When you read a file using Spark the number of partitions is calculated as the maximum between defaultMinPartitions and the number of splits computed based on hadoop input split size divided by the block size. Since your file is small so the number of partitions you are getting is 2 which is the maximum of the two.

The default defaultMinPartitions is calculated as

def defaultMinPartitions: Int = math.min(defaultParallelism, 2)

Please check https://github.com/apache/spark/blob/e9f983df275c138626af35fd263a7abedf69297f/core/src/main/scala/org/apache/spark/SparkContext.scala#L2329

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

创建Spark数据框时的分区数量

问题

答案1

如何根据要求在SPARK AZURE-DATABRICKS中使用SCALA将JSON对象转换为列中的值

怎样在Java Spark中对一个包含array<string>类型的数据集进行单词统计？

巨大的时间间隔在 Spark 作业之间。

根据另一个DataFrame的最大日期筛选Spark Java DataFrame日期。

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论