Spark RDD中创建的分区数量

huangapple go评论53阅读模式
英文:

Number of partitions created in spark rdd

问题

在这种情况下,为什么会创建2个分区而不是1个?

英文:

I have one file called hello.txt with 32 bytes in Hadoop file system.
It created 1 data block in HDFS. As I know, ideally it should create 1 partition equal to number of data blocks of file.
But in output I see 2 partitions, See example.

-> pyspark --master yarn --executor-cores 1 --num-executors 1 --name test1

->rdd1 = sc.textFile("hdfs://localhost:9000/hello.txt")

->sc.defaultParallelism

o/p:-

2

->rdd1.getNumPartitions()

o/p:-

2

->rdd1.glom().collect()

o/p:-

[['hi hello', 'hello everyone'], ['bye everyone', '']]

Can someone explain me how 2 partitions are created in this case instead of 1?

答案1

得分: 1

默认情况下,Spark/PySpark 创建的分区数等于机器中的 CPU 核心数。请参阅了解 Spark 分区

但是,您可以显式指定要创建的分区数。

rdd1 = sc.textFile("hdfs://localhost:9000/hello.txt", 1)
英文:

By default, Spark/PySpark creates partitions that are equal to the number of CPU cores in the machine. See Understanding Spark Partitioning

However, you can explicitly specify the number of partitions to be created.

->rdd1 = sc.textFile("hdfs://localhost:9000/hello.txt",1)

答案2

得分: 0

根据我的记忆,Hadoop系统的默认块大小是128MB,所以如果一个文件小于128MB,它仍然会占用整个块,但该块中剩余的空间将保持未使用状态。

您的文件hello.txt只有32字节,小于默认块大小,但Hadoop仍然为该文件分配了一个128MB的块,并留下了块的其余部分为空白。

另一方面,Spark的默认并行度设置为2,它会将分区拆分为两个较小的分区以进行并行处理。块中的数据将被分割成不同的分区,例如一个分区包含hi hello,另一个分区包含hello everyone等。

而Spark根据Hadoop块大小创建分区,而不是根据文件数据的实际大小。因此,即使hello.txt只有一个块,Spark仍然会创建两个分区。

英文:

As I recall, the default block size in Hadoop system is 128 MB, so if a file is smaller than 128 MB, it will still take up the whole block, but the remaining space in that block will stay unused.

Your file hello.txt is 32 bytes, which is smaller than the default block size, but the Hadoop still allocated one block of 128 MB for the file, and leaving the rest of the block empty.

On the other hand Spark's default parallelism is set to 2, it would split he partition into two smaller partitions for parallel processing. The data in the block will be broken into different partitions, such as one partition is containing hi hello, and the other is containing hello everyone etc.

And Spark creates partitions based on the Hadoop block size, not based on the actual size of the file data. Therefore, even though hello.txt has only one block, Spark will still create two partitions.

huangapple
  • 本文由 发表于 2023年4月17日 14:03:41
  • 转载请务必保留本文链接:https://go.coder-hub.com/76032096.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定