2023年5月24日 23:34:19go评论132阅读模式

英文:

Is combining executeCompaction() and executeZOrderBy() in Databricks sensible?

问题

I'm wondering whether or not it makes sense to combine the two operations.

Let's say I have a delta_table from some .parquet files. I read them using DeltaTable.forPath(spark, path).

Now I want to do some optimization on the dataset.

I know about the following two optimizations:

delta_table.optimize().executeCompaction()
delta_table.optimize().executeZOrderBy(some_cols)

My questions:

Does it make sense to combine the operations, in the sense that will my SQL perform better on it?
Does the order matter?

If there is no clear answer, I will just perform some test queries and time them.

Thanks in advance.

英文:

I'm wondering whether or not it makes sense to combine the two operations.

Let's say I have a delta_table from some .parquet files. I read them using DeltaTable.forPath(spark, path).

Now I want to do some optimization on the dataset.

I know about the following two optimizations:

delta_table.optimize().executeCompaction()
delta_table.optimize().executeZOrderBy(some_cols)

My questions:

Does it make sense to combine the operations, in the sense that will my SQL perform better on it?
Does the order matter?

If there is no clear answer, I will just perform some test queries and time them.

Thanks in advance.

答案1

得分: 1

executeCompaction()
执行executeCompaction()函数可以改善从表中执行的读取查询速度，特别是在处理小文件时。提高速度的一种方法是将小文件合并为较大的文件，这对于Spark来说更快速。这个操作与ZORDER无关。

Caveats：
注意：
如果您有大量数据，只想优化其中的一部分，那么可以通过使用WHERE子句指定可选的分区谓词来缩小范围。除非分区是按时间顺序添加的，否则在这方面可能会遇到一些困难。

对于Unity Catalog管理的表，如果您使用SQL端点或Databricks Runtime 11.3 LTS或更高版本，则Databricks会自动调整大多数这些配置，具体取决于您的环境。因此，根据您的环境，有些是手动配置的，有些是自动配置的。

executeZOrderBy
用于增强数据跳过（WHERE子句）功能，标准数据跳过已经由Databricks / delta API标准应用并依赖于统计信息。但您可以通过使用ZORDER来为高基数数据物理上合并数据，而不仅仅是存储统计信息。

Caveats：
注意：我好像从某处记得这取决于查询类型和一小部分数据。这是一个昂贵的操作。

最后，顺序并不是问题，因为：

ZORDER通过减少文件来工作，无论如何都有效果。
如果在数据的子集上同时执行两者，那么是否应用于数据的非重叠部分并不重要。如果数据有重叠部分，那么根据我最近完成的项目经验，您可以选择执行其中一个操作。

英文:

From the manuals:

executeCompaction()
Delta Lake on Databricks can improve the speed of read queries from a table - the well known small files problem. One way to improve this speed is to coalesce small files into larger ones. Simply faster for Spark. This is operation independent of ZORDERbying as such.

Caveats:
If you have a large amount of data and only want to optimize a subset of it, then you can specify an optional partition predicate via WHERE clause so as to reduce the scope. I have a little difficulty on this one, unless partitions are added in order time-wise.

For Unity Catalog managed tables, Databricks tunes most of these configurations automatically, if you’re using a SQL endpoint or Databricks Runtime 11.3 LTS or above. So, some manual, some automatic depending on your environment.

executeZOrderBy
Used for 'enhanced' data skipping (WHERE clause), for which 'standard' data skipping which is standardly applied by Databricks / delta API already and relying on statistics. But you can embellish this all bu ZORDER for high acrdianlity that physically colocates data as opposed to just storing stats.

Caveats: I seem to remember from somewhere it depends on type of query and small range of data. Expensive operation.

Finally Order of is not an issue as 1) ZORDER works by reducing files, anyway. 2) If doing both on a subset of data, it does not matter if you were applying to non-overlapping parts of the data. If the data overlaps, then you would do one or the other is my experience on projects done recently.

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

Is combining executeCompaction() and executeZOrderBy() in Databricks sensible?

问题

答案1

从Databricks笔记本中删除Azure SQL表

Cplex优化程序返回结果为零。

如何使用正则表达式解决这个Pyspark代码块

如何使用Pyspark获取跨多个文件排序的Parquet行组统计信息？

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论