英文:
Snowflake delete query scanning all partitions
问题
我有一个ETL过程,它正在使用一个类似于:1801b08dd8731d35bb561943e708f7e3 的唯一哈希替代键,从一个包含180亿行的表中删除几十万行。
delete from CUSTOMER_CONFORM_PROD.c360.engagement
where (
engagement_surrogate_key) in (
select (engagement_surrogate_key)
from CUSTOMER_CONFORM_PROD.c360.engagement__dbt_tmp
);
每次在一个小型数据仓库上需要4到6分钟。我已经在engagement_surrogate_key
上添加了一个聚簇键,但由于它具有高基数且是唯一的,没有帮助。我还启用了搜索优化服务,但也没有帮助,它仍然在扫描所有分区。如何加速删除操作?
英文:
I have an ETL process that it's deleting a couple hundred thousand rows from a table with 18 billion rows using a unique hashed surrogate key like: 1801b08dd8731d35bb561943e708f7e3
delete from CUSTOMER_CONFORM_PROD.c360.engagement
where (
engagement_surrogate_key) in (
select (engagement_surrogate_key)
from CUSTOMER_CONFORM_PROD.c360.engagement__dbt_tmp
);
This is taking from 4 to 6 minutes each time on a Small warehouse. I have added a clustering key on the engagement_surrogate_key but since it's unique with high cardinality it didn't help. I have also enabled search optimization service but that also didn't help and it's still scanning all partitions. How can I speed up the deletion?
答案1
得分: 0
以下是您要翻译的部分:
"deletion can be speed up limiting the scan on the destination table by adding a date range, for example, filtering for only the past month worth of data: loaded_date>=dateadd(MM, -1, current_date)
. If you are using dbt they have implemented that functionality using this macro:
{% macro default__get_incremental_merge_sql(arg_dict) %}
{% do return(get_merge_sql(arg_dict["target_relation"], arg_dict["temp_relation"], arg_dict["unique_key"], arg_dict["dest_columns"], arg_dict["predicates"])) %}
{% endmacro %}
So you can add the predicate to the dbt incremental model config like this:
{{ config(materialized= 'incremental', unique_key='engagement_surrogate_key', predicates=['loaded_date>=dateadd(M, -1, current_date)'])}}
When you run your model, the code generated will be this:
delete from CUSTOMER_CONFORM_PROD.c360.engagement
where (
engagement_surrogate_key) in (
select (engagement_surrogate_key)
from CUSTOMER_CONFORM_PROD.c360.engagement__dbt_tmp
)
and loaded_date>=dateadd(MM, -1, current_date);"
希望这对您有所帮助。
英文:
The deletion can be speed up limiting the scan on the destination table by adding a date range, for example, filtering for only the past month worth of data: loaded_date>=dateadd(MM, -1, current_date)
. If you are using dbt they have implemented that functionality using this macro:
{% macro default__get_incremental_merge_sql(arg_dict) %}
{% do return(get_merge_sql(arg_dict["target_relation"], arg_dict["temp_relation"], arg_dict["unique_key"], arg_dict["dest_columns"], arg_dict["predicates"])) %}
{% endmacro %}
So you can add the predicate to the dbt incremental model config like this:
{{ config(materialized= 'incremental', unique_key='engagement_surrogate_key', predicates=['loaded_date>=dateadd(M, -1, current_date)'])}}
When you run your model, the code generated will be this:
delete from CUSTOMER_CONFORM_PROD.c360.engagement
where (
engagement_surrogate_key) in (
select (engagement_surrogate_key)
from CUSTOMER_CONFORM_PROD.c360.engagement__dbt_tmp
)
and loaded_date>=dateadd(MM, -1, current_date);
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论