Different number of partitions after spark.read & filter depending on Databricks runtime

huangapple go评论71阅读模式
英文:

Different number of partitions after spark.read & filter depending on Databricks runtime

问题

我有以下格式的parquet文件保存在delta lake中:

data/
├─ _delta_log/
├─ Year=2018/
│  ├─ Month=1/
│  │  ├─ Day=1/
│  │  │  ├─ part-01797-bd37cd91-cea6-421c-b9bb-0578796bc909.c000.snappy.parquet
│  │  ├─ ...
│  │  ├─ Day=31/
│  │  │  ├─ part-01269-3b9691cf-311a-4d26-8348-c0ced0e06bf0.c000.snappy.parquet
│  ├─ Month=2/
├─ Year=2019/

当我读取所有数据,过滤给定的月份,并显示分区数时:

df=spark.read.format('delta').load("/data")
df=df.filter((F.col("Year")==2018) & (F.col("Month")==1))
df.rdd.getNumPartitions()

根据我使用的Databricks Runtime版本不同,结果会有所不同:

  • DBR 10.4:15个分区
  • DBR 12.2:10个分区

文件存储中2018年1月的parquet文件总数为31(每天一个,如上面的ASCII树所示)

我的问题是:

  1. 为什么在过滤后,无论是15个还是10个分区,都会将31个分区减少?
  2. DBR 10.4和DBR 12.2之间的差异是什么?由于DataFrame中有一个TimeStamp列,是否与DBR 11.2中发布的Ingestion Time Clustering有关?
英文:

I have parquet files saved in the following delta lake format:

data/
├─ _delta_log/
├─ Year=2018/
│  ├─ Month=1/
│  │  ├─ Day=1/
│  │  │  ├─ part-01797-bd37cd91-cea6-421c-b9bb-0578796bc909.c000.snappy.parquet
│  │  ├─ ...
│  │  ├─ Day=31/
│  │  │  ├─ part-01269-3b9691cf-311a-4d26-8348-c0ced0e06bf0.c000.snappy.parquet
│  ├─ Month=2/
├─ Year=2019/

When I read all data; filter for a given month; and show number of partitions:

df=spark.read.format('delta').load("/data")
df=df.filter((F.col("Year")==2018) & (F.col("Month")==1))
df.rdd.getNumPartitions()

I get different results depending on which Databricks Runtime I use:

  • DBR 10.4: 15 partitions
  • DBR 12.2: 10 partitions

The total number of parquet files for January 2018 in file storage is 31 (1 per day as the above ascii tree would indicate)

My questions are:

  1. Why in both cases is the #partitions being reduced from 31 to 15 or 10 after filtering?
  2. Why the difference between DBR 10.4 and DBR 12.2? Since there is a TimeStamp column in the DataFrame, could it have something to do with Ingestion Time Clustering which was released in DBR 11.2?

答案1

得分: 1

  1. 分区与您的DataFrame中的记录数量无关。

链接:https://sparkbyexamples.com/spark/spark-partitioning-understanding/

> Spark默认根据多个因素对数据进行分区,这些因素取决于您在哪个模式下运行作业。

  1. Spark尝试在减少磁盘洗牌的同时平衡利用整个集群(这需要大量的网络计算)。

链接:https://medium.com/@dipayandev/everything-you-need-to-understand-data-partitioning-in-spark-487d4be63b9c

英文:
  1. Partitions are not tied to the number of records in your DataFrame.

Link: https://sparkbyexamples.com/spark/spark-partitioning-understanding/

> Spark by default partitions data based on a number of factors, and the factors differ were you running your job on and what mode.

  1. Spark tries to balance utilizing the entire cluster while reducing the amount of disk shuffling (takes a lot of networking compute).

Link: https://medium.com/@dipayandev/everything-you-need-to-understand-data-partitioning-in-spark-487d4be63b9c

huangapple
  • 本文由 发表于 2023年7月27日 17:10:12
  • 转载请务必保留本文链接:https://go.coder-hub.com/76778189.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定