英文:
Different number of partitions after spark.read & filter depending on Databricks runtime
问题
我有以下格式的parquet文件保存在delta lake中:
data/
├─ _delta_log/
├─ Year=2018/
│ ├─ Month=1/
│ │ ├─ Day=1/
│ │ │ ├─ part-01797-bd37cd91-cea6-421c-b9bb-0578796bc909.c000.snappy.parquet
│ │ ├─ ...
│ │ ├─ Day=31/
│ │ │ ├─ part-01269-3b9691cf-311a-4d26-8348-c0ced0e06bf0.c000.snappy.parquet
│ ├─ Month=2/
├─ Year=2019/
当我读取所有数据,过滤给定的月份,并显示分区数时:
df=spark.read.format('delta').load("/data")
df=df.filter((F.col("Year")==2018) & (F.col("Month")==1))
df.rdd.getNumPartitions()
根据我使用的Databricks Runtime版本不同,结果会有所不同:
- DBR 10.4:15个分区
- DBR 12.2:10个分区
文件存储中2018年1月的parquet文件总数为31(每天一个,如上面的ASCII树所示)
我的问题是:
- 为什么在过滤后,无论是15个还是10个分区,都会将31个分区减少?
- DBR 10.4和DBR 12.2之间的差异是什么?由于DataFrame中有一个TimeStamp列,是否与DBR 11.2中发布的Ingestion Time Clustering有关?
英文:
I have parquet files saved in the following delta lake format:
data/
├─ _delta_log/
├─ Year=2018/
│ ├─ Month=1/
│ │ ├─ Day=1/
│ │ │ ├─ part-01797-bd37cd91-cea6-421c-b9bb-0578796bc909.c000.snappy.parquet
│ │ ├─ ...
│ │ ├─ Day=31/
│ │ │ ├─ part-01269-3b9691cf-311a-4d26-8348-c0ced0e06bf0.c000.snappy.parquet
│ ├─ Month=2/
├─ Year=2019/
When I read all data; filter for a given month; and show number of partitions:
df=spark.read.format('delta').load("/data")
df=df.filter((F.col("Year")==2018) & (F.col("Month")==1))
df.rdd.getNumPartitions()
I get different results depending on which Databricks Runtime I use:
- DBR 10.4: 15 partitions
- DBR 12.2: 10 partitions
The total number of parquet files for January 2018 in file storage is 31 (1 per day as the above ascii tree would indicate)
My questions are:
- Why in both cases is the #partitions being reduced from 31 to 15 or 10 after filtering?
- Why the difference between DBR 10.4 and DBR 12.2? Since there is a TimeStamp column in the DataFrame, could it have something to do with Ingestion Time Clustering which was released in DBR 11.2?
答案1
得分: 1
- 分区与您的DataFrame中的记录数量无关。
链接:https://sparkbyexamples.com/spark/spark-partitioning-understanding/
> Spark默认根据多个因素对数据进行分区,这些因素取决于您在哪个模式下运行作业。
- Spark尝试在减少磁盘洗牌的同时平衡利用整个集群(这需要大量的网络计算)。
链接:https://medium.com/@dipayandev/everything-you-need-to-understand-data-partitioning-in-spark-487d4be63b9c
英文:
- Partitions are not tied to the number of records in your DataFrame.
Link: https://sparkbyexamples.com/spark/spark-partitioning-understanding/
> Spark by default partitions data based on a number of factors, and the factors differ were you running your job on and what mode.
- Spark tries to balance utilizing the entire cluster while reducing the amount of disk shuffling (takes a lot of networking compute).
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论