创建Spark数据框时的分区数量

huangapple go评论62阅读模式
英文:

Number of partitions when creating a Spark dataframe

问题

问题已经在其他帖子中提出,但似乎我的问题不适用于它们之一。

我在本地模式下使用Spark 2.4.4,将主节点设置为local[16]以使用16个核心。我还在Web界面中看到已分配了16个核心。

我创建了一个导入约8MB的CSV文件的数据框,如下所示:

val df = spark.read.option("inferSchema", "true").option("header", "true").csv("Datasets/globalpowerplantdatabasev120/*.csv")

最后,我打印数据框的分区数量:

df.rdd.partitions.size

res5: Int = 2

答案是2。

为什么呢?据我所了解,分区的数量取决于默认情况下设置为与核心数(16)相等的执行器数量。

我尝试使用spark.default.Parallelism = 4和/或spark.executor.instances = 4来设置执行器的数量,并启动了一个新的Spark对象,但分区数量没有改变。

有什么建议吗?

英文:

The question has been asked in other thread, but it seems my problem doesn't fit in any of them.

I'm using Spark 2.4.4 in local mode, I set the master to local[16] to use 16 cores. I also see in the web UI 16 cores have been allocated.

I create a dataframe importing a csv file of about 8MB like this:

val df = spark.read.option("inferSchema", "true").option("header", "true").csv("Datasets/globalpowerplantdatabasev120/*.csv")

finally I print the number of partitions the dataframe is made of:

df.rdd.partitions.size

res5: Int = 2

Answer is 2.

Why? As far as I read around, the number of partitions depends on the number of executors that is by default set equal the numer of cores(16).

I tried to set the number of esecutors using spark.default.Parallelism = 4 and/or spark.executor.instances = 4 and started a new spark object but nothing changed in the number of partitions.

Any suggestion?

答案1

得分: 1

当您使用Spark读取文件时,分区数量被计算为defaultMinPartitions和基于Hadoop输入分片大小除以块大小计算的分片数量之间的最大值。由于您的文件很小,所以您得到的分区数量是2,这是这两者中的最大值。

默认的defaultMinPartitions计算如下:

def defaultMinPartitions: Int = math.min(defaultParallelism, 2)

请查看https://github.com/apache/spark/blob/e9f983df275c138626af35fd263a7abedf69297f/core/src/main/scala/org/apache/spark/SparkContext.scala#L2329

英文:

When you read a file using Spark the number of partitions is calculated as the maximum between defaultMinPartitions and the number of splits computed based on hadoop input split size divided by the block size. Since your file is small so the number of partitions you are getting is 2 which is the maximum of the two.

The default defaultMinPartitions is calculated as

def defaultMinPartitions: Int = math.min(defaultParallelism, 2)

Please check https://github.com/apache/spark/blob/e9f983df275c138626af35fd263a7abedf69297f/core/src/main/scala/org/apache/spark/SparkContext.scala#L2329

huangapple
  • 本文由 发表于 2020年1月7日 01:52:56
  • 转载请务必保留本文链接:https://go.coder-hub.com/59616720.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定