英文:
Is combining executeCompaction() and executeZOrderBy() in Databricks sensible?
问题
I'm wondering whether or not it makes sense to combine the two operations.
Let's say I have a delta_table
from some .parquet
files. I read them using DeltaTable.forPath(spark, path)
.
Now I want to do some optimization on the dataset.
I know about the following two optimizations:
-
delta_table.optimize().executeCompaction()
-
delta_table.optimize().executeZOrderBy(some_cols)
My questions:
- Does it make sense to combine the operations, in the sense that will my
SQL
perform better on it? - Does the order matter?
If there is no clear answer, I will just perform some test queries and time them.
Thanks in advance.
英文:
I'm wondering whether or not it makes sense to combine the two operations.
Let's say I have a delta_table
from some .parquet
files. I read them using DeltaTable.forPath(spark, path)
.
Now I want to do some optimization on the dataset.
I know about the following two optimizations:
-
delta_table.optimize().executeCompaction()
-
delta_table.optimize().executeZOrderBy(some_cols)
My questions:
- Does it make sense to combine the operations, in the sense that will my
SQL
perform better on it? - Does the order matter?
If there is no clear answer, I will just perform some test queries and time them.
Thanks in advance.
答案1
得分: 1
executeCompaction()
执行executeCompaction()
函数可以改善从表中执行的读取查询速度,特别是在处理小文件时。提高速度的一种方法是将小文件合并为较大的文件,这对于Spark来说更快速。这个操作与ZORDER无关。
Caveats:
注意:
如果您有大量数据,只想优化其中的一部分,那么可以通过使用WHERE子句指定可选的分区谓词来缩小范围。除非分区是按时间顺序添加的,否则在这方面可能会遇到一些困难。
对于Unity Catalog管理的表,如果您使用SQL端点或Databricks Runtime 11.3 LTS或更高版本,则Databricks会自动调整大多数这些配置,具体取决于您的环境。因此,根据您的环境,有些是手动配置的,有些是自动配置的。
executeZOrderBy
用于增强数据跳过(WHERE子句)功能,标准数据跳过已经由Databricks / delta API标准应用并依赖于统计信息。但您可以通过使用ZORDER来为高基数数据物理上合并数据,而不仅仅是存储统计信息。
Caveats:
注意:我好像从某处记得这取决于查询类型和一小部分数据。这是一个昂贵的操作。
最后,顺序并不是问题,因为:
- ZORDER通过减少文件来工作,无论如何都有效果。
- 如果在数据的子集上同时执行两者,那么是否应用于数据的非重叠部分并不重要。如果数据有重叠部分,那么根据我最近完成的项目经验,您可以选择执行其中一个操作。
英文:
From the manuals:
executeCompaction()
Delta Lake on Databricks can improve the speed of read queries from a table - the well known small files problem. One way to improve this speed is to coalesce small files into larger ones. Simply faster for Spark. This is operation independent of ZORDERbying as such.
Caveats:
If you have a large amount of data and only want to optimize a subset of it, then you can specify an optional partition predicate via WHERE clause so as to reduce the scope. I have a little difficulty on this one, unless partitions are added in order time-wise.
For Unity Catalog managed tables, Databricks tunes most of these configurations automatically, if you’re using a SQL endpoint or Databricks Runtime 11.3 LTS or above. So, some manual, some automatic depending on your environment.
executeZOrderBy
Used for 'enhanced' data skipping (WHERE clause), for which 'standard' data skipping which is standardly applied by Databricks / delta API already and relying on statistics. But you can embellish this all bu ZORDER for high acrdianlity that physically colocates data as opposed to just storing stats.
Caveats: I seem to remember from somewhere it depends on type of query and small range of data. Expensive operation.
Finally Order of is not an issue as 1) ZORDER works by reducing files, anyway. 2) If doing both on a subset of data, it does not matter if you were applying to non-overlapping parts of the data. If the data overlaps, then you would do one or the other is my experience on projects done recently.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论