2023年7月27日 17:10:12go评论74阅读模式

英文:

Different number of partitions after spark.read & filter depending on Databricks runtime

问题

我有以下格式的parquet文件保存在delta lake中：

data/
├─ _delta_log/
├─ Year=2018/
│  ├─ Month=1/
│  │  ├─ Day=1/
│  │  │  ├─ part-01797-bd37cd91-cea6-421c-b9bb-0578796bc909.c000.snappy.parquet
│  │  ├─ ...
│  │  ├─ Day=31/
│  │  │  ├─ part-01269-3b9691cf-311a-4d26-8348-c0ced0e06bf0.c000.snappy.parquet
│  ├─ Month=2/
├─ Year=2019/

当我读取所有数据，过滤给定的月份，并显示分区数时：

df=spark.read.format('delta').load("/data")
df=df.filter((F.col("Year")==2018) & (F.col("Month")==1))
df.rdd.getNumPartitions()

根据我使用的Databricks Runtime版本不同，结果会有所不同：

DBR 10.4：15个分区
DBR 12.2：10个分区

文件存储中2018年1月的parquet文件总数为31（每天一个，如上面的ASCII树所示）

我的问题是：

为什么在过滤后，无论是15个还是10个分区，都会将31个分区减少？
DBR 10.4和DBR 12.2之间的差异是什么？由于DataFrame中有一个TimeStamp列，是否与DBR 11.2中发布的Ingestion Time Clustering有关？

英文:

I have parquet files saved in the following delta lake format:

data/
├─ _delta_log/
├─ Year=2018/
│  ├─ Month=1/
│  │  ├─ Day=1/
│  │  │  ├─ part-01797-bd37cd91-cea6-421c-b9bb-0578796bc909.c000.snappy.parquet
│  │  ├─ ...
│  │  ├─ Day=31/
│  │  │  ├─ part-01269-3b9691cf-311a-4d26-8348-c0ced0e06bf0.c000.snappy.parquet
│  ├─ Month=2/
├─ Year=2019/

When I read all data; filter for a given month; and show number of partitions:

df=spark.read.format(&#39;delta&#39;).load(&quot;/data&quot;)
df=df.filter((F.col(&quot;Year&quot;)==2018) &amp; (F.col(&quot;Month&quot;)==1))
df.rdd.getNumPartitions()

I get different results depending on which Databricks Runtime I use:

DBR 10.4: 15 partitions
DBR 12.2: 10 partitions

The total number of parquet files for January 2018 in file storage is 31 (1 per day as the above ascii tree would indicate)

My questions are:

Why in both cases is the #partitions being reduced from 31 to 15 or 10 after filtering?
Why the difference between DBR 10.4 and DBR 12.2? Since there is a TimeStamp column in the DataFrame, could it have something to do with Ingestion Time Clustering which was released in DBR 11.2?

答案1

得分: 1

分区与您的DataFrame中的记录数量无关。

链接：https://sparkbyexamples.com/spark/spark-partitioning-understanding/

> Spark默认根据多个因素对数据进行分区，这些因素取决于您在哪个模式下运行作业。

Spark尝试在减少磁盘洗牌的同时平衡利用整个集群（这需要大量的网络计算）。

链接：https://medium.com/@dipayandev/everything-you-need-to-understand-data-partitioning-in-spark-487d4be63b9c

英文:

Partitions are not tied to the number of records in your DataFrame.

Link: https://sparkbyexamples.com/spark/spark-partitioning-understanding/

> Spark by default partitions data based on a number of factors, and the factors differ were you running your job on and what mode.

Spark tries to balance utilizing the entire cluster while reducing the amount of disk shuffling (takes a lot of networking compute).

Link: https://medium.com/@dipayandev/everything-you-need-to-understand-data-partitioning-in-spark-487d4be63b9c

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

Different number of partitions after spark.read & filter depending on Databricks runtime

问题

答案1

Comments convention "// $example on:" and "// $example off:" in Scala and Java

Smartsheet SDK 抛出错误，导致 Databricks 作业失败。

PySpark：在匹配后提取5个下一个单词

Databricks：从pandas创建spark数据帧时出现问题

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论