问题

在这种情况下，为什么会创建2个分区而不是1个？

英文:

I have one file called hello.txt with 32 bytes in Hadoop file system.
It created 1 data block in HDFS. As I know, ideally it should create 1 partition equal to number of data blocks of file.
But in output I see 2 partitions, See example.

-&gt; pyspark --master yarn --executor-cores 1 --num-executors 1 --name test1

-&gt;rdd1 = sc.textFile(&quot;hdfs://localhost:9000/hello.txt&quot;)

-&gt;sc.defaultParallelism

o/p:-

2

-&gt;rdd1.getNumPartitions()

o/p:-

2

-&gt;rdd1.glom().collect()

o/p:-

[[&#39;hi hello&#39;, &#39;hello everyone&#39;], [&#39;bye everyone&#39;, &#39;&#39;]]

Can someone explain me how 2 partitions are created in this case instead of 1?

答案1

得分: 1

默认情况下，Spark/PySpark 创建的分区数等于机器中的 CPU 核心数。请参阅了解 Spark 分区

但是，您可以显式指定要创建的分区数。

rdd1 = sc.textFile("hdfs://localhost:9000/hello.txt", 1)

英文:

By default, Spark/PySpark creates partitions that are equal to the number of CPU cores in the machine. See Understanding Spark Partitioning

However, you can explicitly specify the number of partitions to be created.

-&gt;rdd1 = sc.textFile(&quot;hdfs://localhost:9000/hello.txt&quot;,1)

答案2

得分: 0

根据我的记忆，Hadoop系统的默认块大小是128MB，所以如果一个文件小于128MB，它仍然会占用整个块，但该块中剩余的空间将保持未使用状态。

您的文件hello.txt只有32字节，小于默认块大小，但Hadoop仍然为该文件分配了一个128MB的块，并留下了块的其余部分为空白。

另一方面，Spark的默认并行度设置为2，它会将分区拆分为两个较小的分区以进行并行处理。块中的数据将被分割成不同的分区，例如一个分区包含hi hello，另一个分区包含hello everyone等。

而Spark根据Hadoop块大小创建分区，而不是根据文件数据的实际大小。因此，即使hello.txt只有一个块，Spark仍然会创建两个分区。

英文:

As I recall, the default block size in Hadoop system is 128 MB, so if a file is smaller than 128 MB, it will still take up the whole block, but the remaining space in that block will stay unused.

Your file hello.txt is 32 bytes, which is smaller than the default block size, but the Hadoop still allocated one block of 128 MB for the file, and leaving the rest of the block empty.

On the other hand Spark's default parallelism is set to 2, it would split he partition into two smaller partitions for parallel processing. The data in the block will be broken into different partitions, such as one partition is containing hi hello, and the other is containing hello everyone etc.

And Spark creates partitions based on the Hadoop block size, not based on the actual size of the file data. Therefore, even though hello.txt has only one block, Spark will still create two partitions.

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

Spark RDD中创建的分区数量

问题

答案1

答案2

PySpark在DataFrame的一列中计算RDD的平均值。

Pyspark的regexp_extract无法识别’=’作为一个字符？

提取列值为整数。

在Spark会话中设置 “table”。

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论