英文:
The dbt select * idiom and Snowflake
问题
如在"我们如何组织我们的dbt项目"指南中所述,一个常见建议的dbt习惯/最佳实践是首先使用select *
语句定义关于贡献表的CTEs。例如:
with
orders as (
select * from {{ ref('stg_jaffle_shop__orders') }}
),
order_payments as (
select * from {{ ref('int_payments_pivoted_to_orders') }}
),
orders_and_order_payments_joined as (
select
. . . .
from orders
left join order_payments on orders.order_id = order_payments.order_id
)
select * from orders_and_payments_joined
即使在下游最终未使用所有来自贡献表的列,这在dbt示例代码中经常出现。我们找到的理由是这样可以生成更干净、更对象化的代码。
我们正在Snowflake上运行。我在dbt组中看到有人断言,在这个平台上,select *
习惯不会影响性能,因为有底层的查询优化等等,但我想确切知道。由于缓存问题,我们自己难以进行性能基准测试。您知道是否是这种情况,或者我们应该避免在这里使用select *
,并从一开始就指定我们真正需要的列吗?
英文:
As described in the "How we structure our dbt project" guide, a commonly suggested dbt idiom / best practice is to begin by defining CTEs around the contributing tables, using select *
statements. For example:
with
orders as (
select * from {{ ref('stg_jaffle_shop__orders' )}}
),
order_payments as (
select * from {{ ref('int_payments_pivoted_to_orders') }}
),
orders_and_order_payments_joined as (
select
. . . .
from orders
left join order_payments on orders.order_id = order_payments.order_id
)
select * from orders_and_payments_joined
This is routinely shown in dbt sample code, even when not all the columns from the contributing tables are ultimately used downstream. The rationale we've found described is that this makes for cleaner, more object-like code.
We are running on Snowflake. I have seen assertions in dbt groups that the select *
idiom does not impact performance on that platform, due to under-the-hood query optimization etc., but I'd like to know for sure. We have had trouble benchmarking it ourselves due to caching. Do you know if this is the case, or should we avoid select *
here and specify the columns we really want from the very beginning?
答案1
得分: 2
CTE中的select *
习语在查询中多次引用CTE时可能导致不太优化的计划。
原因是Snowflake将在查询中两次引用CTE时始终将CTE“材料化”。这意味着您不会将谓词下推到CTE中。
与Snowflake相反,Postgres具有通过其NOT MATERIALIZED
子句控制此类公共表达式材料化的能力。
> 使用WITH查询的一个有用特性是,它们通常仅在父查询的执行期间评估一次,即使它们被父查询或兄弟WITH查询多次引用。因此,需要在多个地方使用的昂贵计算可以放在WITH查询中,以避免冗余工作。
> 但是,这个硬币的另一面是,优化器不能将父查询中的限制条件下推到多次引用WITH查询中,因为这可能会影响WITH查询输出的所有使用情况,而应该只影响一个使用情况。
> 您可以通过指定MATERIALIZED来覆盖该决策,以强制对WITH查询进行单独计算,或者通过指定NOT MATERIALIZED来强制将其合并到父查询中。后一种选择会冒重复计算WITH查询的风险,但如果每个使用WITH查询的地方仅需要WITH查询的一小部分输出,仍然可以节省开销。
因此,除非Snowflake支持执行以下操作:
with
orders as NOT MATERIALIZED (
select * from {{ ref('stg_jaffle_shop__orders') }}
)
...
(或更好的是,直到Snowflake优化器自动选择何时“材料化”CTE或不“材料化”),如果您的查询的其余部分多次使用orders
并对其进行过滤,您应该考虑在CTE中重复select *
或手动将过滤器移入CTE。
P.S. 即使在CTE中多次引用时,Snowflake始终会修剪不需要的列。查询计划目前会给出不发生列修剪的不正确印象,尽管它在执行时确实发生列修剪。
英文:
The select *
idiom in CTEs can cause less than optimal plans when a CTE is referenced more than once in a query.
The reason is that Snowflake will always "materialize" CTEs when referenced twice. This means that you won't get push-down of predicates into the CTE.
In contrast to Snowflake, Postgres has the ability to control such Common Table Expression Materialization via it's NOT MATERIALIZED
clause.
> A useful property of WITH queries is that they are normally evaluated only once per execution of the parent query, even if they are referred to more than once by the parent query or sibling WITH queries. Thus, expensive calculations that are needed in multiple places can be placed within a WITH query to avoid redundant work
> However, the other side of this coin is that the optimizer is not able to push restrictions from the parent query down into a multiply-referenced WITH query, since that might affect all uses of the WITH query's output when it should affect only one.
> You can override that decision by specifying MATERIALIZED to force separate calculation of the WITH query, or by specifying NOT MATERIALIZED to force it to be merged into the parent query. The latter choice risks duplicate computation of the WITH query, but it can still give a net savings if each usage of the WITH query needs only a small part of the WITH query's full output.
So, until/unless Snowflake support doing e.g.:
with
orders as NOT MATERIALIZED (
select * from {{ ref('stg_jaffle_shop__orders' )}}
)
...
(or, better, until the Snowflake optimizer automatically picks when to "materialize" CTEs or not), then if the rest of your query uses orders
more than once and filters it, you should consider repeating the select *
in your CTEs or manually moving the filter(s) into the CTE.
P.S. Snowflake will always prune away unneeded columns in a CTE even if it is referenced more than once. The query plan currently gives the incorrect impression that column pruning does not happen, even though it does occur at execution time.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论