英文:
Number of partitions created in spark rdd
问题
在这种情况下,为什么会创建2个分区而不是1个?
英文:
I have one file called hello.txt
with 32 bytes in Hadoop file system.
It created 1 data block in HDFS. As I know, ideally it should create 1 partition equal to number of data blocks of file.
But in output I see 2 partitions, See example.
-> pyspark --master yarn --executor-cores 1 --num-executors 1 --name test1
->rdd1 = sc.textFile("hdfs://localhost:9000/hello.txt")
->sc.defaultParallelism
o/p:-
2
->rdd1.getNumPartitions()
o/p:-
2
->rdd1.glom().collect()
o/p:-
[['hi hello', 'hello everyone'], ['bye everyone', '']]
Can someone explain me how 2 partitions are created in this case instead of 1?
答案1
得分: 1
默认情况下,Spark/PySpark 创建的分区数等于机器中的 CPU 核心数。请参阅了解 Spark 分区
但是,您可以显式指定要创建的分区数。
rdd1 = sc.textFile("hdfs://localhost:9000/hello.txt", 1)
英文:
By default, Spark/PySpark creates partitions that are equal to the number of CPU cores in the machine. See Understanding Spark Partitioning
However, you can explicitly specify the number of partitions to be created.
->rdd1 = sc.textFile("hdfs://localhost:9000/hello.txt",1)
答案2
得分: 0
根据我的记忆,Hadoop系统的默认块大小是128MB,所以如果一个文件小于128MB,它仍然会占用整个块,但该块中剩余的空间将保持未使用状态。
您的文件hello.txt
只有32字节,小于默认块大小,但Hadoop仍然为该文件分配了一个128MB的块,并留下了块的其余部分为空白。
另一方面,Spark的默认并行度设置为2,它会将分区拆分为两个较小的分区以进行并行处理。块中的数据将被分割成不同的分区,例如一个分区包含hi hello
,另一个分区包含hello everyone
等。
而Spark根据Hadoop块大小创建分区,而不是根据文件数据的实际大小。因此,即使hello.txt
只有一个块,Spark仍然会创建两个分区。
英文:
As I recall, the default block size in Hadoop system is 128 MB, so if a file is smaller than 128 MB, it will still take up the whole block, but the remaining space in that block will stay unused.
Your file hello.txt
is 32 bytes, which is smaller than the default block size, but the Hadoop still allocated one block of 128 MB for the file, and leaving the rest of the block empty.
On the other hand Spark's default parallelism is set to 2, it would split he partition into two smaller partitions for parallel processing. The data in the block will be broken into different partitions, such as one partition is containing hi hello
, and the other is containing hello everyone
etc.
And Spark creates partitions based on the Hadoop block size, not based on the actual size of the file data. Therefore, even though hello.txt
has only one block, Spark will still create two partitions.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论