英文:
Number of partitions when creating a Spark dataframe
问题
问题已经在其他帖子中提出,但似乎我的问题不适用于它们之一。
我在本地模式下使用Spark 2.4.4,将主节点设置为local[16]
以使用16个核心。我还在Web界面中看到已分配了16个核心。
我创建了一个导入约8MB的CSV文件的数据框,如下所示:
val df = spark.read.option("inferSchema", "true").option("header", "true").csv("Datasets/globalpowerplantdatabasev120/*.csv")
最后,我打印数据框的分区数量:
df.rdd.partitions.size
res5: Int = 2
答案是2。
为什么呢?据我所了解,分区的数量取决于默认情况下设置为与核心数(16)相等的执行器数量。
我尝试使用spark.default.Parallelism = 4
和/或spark.executor.instances = 4
来设置执行器的数量,并启动了一个新的Spark对象,但分区数量没有改变。
有什么建议吗?
英文:
The question has been asked in other thread, but it seems my problem doesn't fit in any of them.
I'm using Spark 2.4.4 in local mode, I set the master to local[16]
to use 16 cores. I also see in the web UI 16 cores have been allocated.
I create a dataframe importing a csv file of about 8MB like this:
val df = spark.read.option("inferSchema", "true").option("header", "true").csv("Datasets/globalpowerplantdatabasev120/*.csv")
finally I print the number of partitions the dataframe is made of:
df.rdd.partitions.size
res5: Int = 2
Answer is 2.
Why? As far as I read around, the number of partitions depends on the number of executors that is by default set equal the numer of cores(16).
I tried to set the number of esecutors using spark.default.Parallelism = 4
and/or spark.executor.instances = 4
and started a new spark object but nothing changed in the number of partitions.
Any suggestion?
答案1
得分: 1
当您使用Spark读取文件时,分区数量被计算为defaultMinPartitions和基于Hadoop输入分片大小除以块大小计算的分片数量之间的最大值。由于您的文件很小,所以您得到的分区数量是2,这是这两者中的最大值。
默认的defaultMinPartitions计算如下:
def defaultMinPartitions: Int = math.min(defaultParallelism, 2)
请查看https://github.com/apache/spark/blob/e9f983df275c138626af35fd263a7abedf69297f/core/src/main/scala/org/apache/spark/SparkContext.scala#L2329
英文:
When you read a file using Spark the number of partitions is calculated as the maximum between defaultMinPartitions and the number of splits computed based on hadoop input split size divided by the block size. Since your file is small so the number of partitions you are getting is 2 which is the maximum of the two.
The default defaultMinPartitions is calculated as
def defaultMinPartitions: Int = math.min(defaultParallelism, 2)
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论