英文:
Generating a flag with respect to a change in consecutive records within partition
问题
I've translated the provided text. Here it is:
我有一个名为 original_table 的 snowflake 表,包括以下字段:case_num、code、project_name、sp_id、updated_date。对于每个 case_num,code、project_name、sp_id 的组合将是唯一的。
还有另一个表,名为 additional_table,只包含 case_num 和 timestamp 列。它不包括像 original_table 中的其他关键列。它是每日截断和加载的表。
我应该检查两个条件来标记 D 标志(软删除):
- 如果当前行与前一行相比,针对特定 case_num,code、project_name 或 sp_id 任一发生更改,则应使用 d 标志软删除前一行。(并且)
- 如果 case_num 今天未在 additional_table 中到达,则我需要将记录标记为软删除(D)。
我面临的问题是准确标记删除,因为 additional_table 没有其他关键列 code、project_name、sp_id。因此,我手动尝试使用以下逻辑标记已删除的记录。
在下面的查询中,第 3、6 和 8 行对于特定的 case_num 具有更改的关键值(任意一个/全部)。因此,更新的值/当前行之前的第 2、5 和 7 行应该被标记为 D。请参考输出。
【以下表格未能保留,若需请重新提供】
根据上述逻辑,我得到了将 D 记录到具有更改值的记录行,但没有到其之前的行的记录。
请问是否可以帮助修改这个 Snowflake 查询或提供其他方法以获得预期的输出。谢谢。
【更新】:
输入场景:
【以下表格未能保留,若需请重新提供】
在上述场景中,在某些 updated_date 上插入了 2 条记录。因此,在 3 个不同日期上共有 3 对记录。根据条件,不应有任何删除的记录,所有记录应为 0。但是输出如下,其中一个最新日期的记录中包括一个删除的记录。
【以下表格未能保留,若需请重新提供】
英文:
I have a snowflake table (called as original_table) with following fields case_num, code, project_name, sp_id, updated_date. For each case_num the combination of code, project_name, sp_id will be unique.
There is another table (called as additional_table) that has only the case_num, timestamp column. It doesnt have the other key columns like original_table. Its a truncate and loaded table daily.
I should check two conditions to mark the D flag (soft delete):
-
If there is any change in either of code OR project_name OR sp_id of
the current_row vs previous_row for a particular case_num then the previous_row should be soft
deleted with d flag. (AND) -
If the case_num did not arrive in additional_table for today's load,
then I will need to mark the record as soft delete (D).
The problem I face is that it is hard to mark the deletes accurately since the additional_table doesnt have the other key columns code, project_name, sp_id .
So, I manually tried to mark the deleted record using below logic.
WITH src AS (
SELECT *,
LAG(code) OVER (PARTITION BY case_num ORDER BY updated_date) AS prev_code,
LAG(project_name) OVER (PARTITION BY case_num ORDER BY updated_date) AS prev_project_name,
LAG(sp_id) OVER (PARTITION BY case_num ORDER BY updated_date) AS prev_sp_id,
LEAD(code) OVER (PARTITION BY case_num ORDER BY updated_date) AS next_code,
LEAD(project_name) OVER (PARTITION BY case_num ORDER BY updated_date) AS next_project_name,
LEAD(sp_id) OVER (PARTITION BY case_num ORDER BY updated_date) AS next_sp_id
FROM original_table
),
act AS (
SELECT case_num, MAX(updated_date) AS max_updated_date
FROM additional_table
GROUP BY case_num
)
SELECT src.*,
CASE WHEN (
(
(prev_code IS NOT NULL AND (code <> prev_code OR code <> next_code))
OR (prev_project_name IS NOT NULL AND (project_name <> prev_project_name OR project_name <> next_project_name))
OR (prev_sp_id IS NOT NULL AND (sp_id <> prev_sp_id OR sp_id <> next_sp_id))
)
AND updated_date <> act.max_updated_date
) OR (ACT.case_num IS NULL)
THEN LAG('D') OVER (PARTITION BY case_num ORDER BY updated_date) ELSE 'N' END AS soft_delete_flag
FROM src
LEFT JOIN act ON src.case_num = act.case_num;
In the below, the 3rd, 6th and 8th rows have changed key values (either one/all) for a particular case_num. So, the 2nd, 5th and 7th row that is previous to the updated valur/current row should be marked as D. Please refer output.
case_num | code | project_name | sp_id | updated_date |
---|---|---|---|---|
1234 | NULL | ABC | 123 | 2023-01-01 |
1234 | NULL | ABC | 123 | 2023-01-02 |
1234 | Value | ABC | 123 | 2023-01-03 |
2345 | NULL | NULL | 456 | 2023-01-01 |
2345 | NULL | ABC | 456 | 2023-01-02 |
2345 | NULL | DEF | 456 | 2023-01-03 |
7890 | NULL | NULL | NULL | 2023-01-01 |
7890 | New_value | BBB | 678 | 2023-01-02 |
"Expected Output":
case_num | code | project_name | sp_id | updated_date | soft_delete_flag |
---|---|---|---|---|---|
1234 | NULL | ABC | 123 | 2023-01-01 | N |
1234 | NULL | ABC | 123 | 2023-01-02 | D |
1234 | Value | ABC | 123 | 2023-01-03 | N |
2345 | NULL | NULL | 456 | 2023-01-01 | N |
2345 | NULL | ABC | 456 | 2023-01-02 | D |
2345 | NULL | DEF | 456 | 2023-01-03 | N |
7890 | NULL | NULL | NULL | 2023-01-01 | D |
7890 | New_value | BBB | 678 | 2023-01-02 | N |
"Current Output":
case_num | code | project_name | sp_id | updated_date | soft_delete_flag |
---|---|---|---|---|---|
1234 | NULL | ABC | 123 | 2023-01-01 | N |
1234 | NULL | ABC | 123 | 2023-01-02 | N |
1234 | Value | ABC | 123 | 2023-01-03 | D |
2345 | NULL | NULL | 456 | 2023-01-01 | N |
2345 | NULL | NULL | 456 | 2023-01-02 | N |
2345 | NULL | DEF | 456 | 2023-01-03 | D |
7890 | NULL | NULL | NULL | 2023-01-01 | D |
7890 | New_value | BBB | 678 | 2023-01-02 | D |
As per above logic, I am getting the D records to the record row that has changed values, but not to the row previous to it.
Please could anyone help to modify this SNOWFLAKE query or any other approach to get the expected output. TIA.
**
- Updated
**:
Input scenario:
case_num | code | project_name | sp_id | updated_date |
---|---|---|---|---|
1234 | PGK123 | ABC | NO_VALUE | 2023-01-01 |
1234 | PGK456 | ABC | NO_VALUE | 2023-01-01 |
1234 | PGK123 | ABC | NO_VALUE | 2023-01-05 |
1234 | PGK456 | ABC | NO_VALUE | 2023-01-05 |
1234 | PGK123 | ABC | NO_VALUE | 2023-01-08 |
1234 | PGK456 | ABC | NO_VALUE | 2023-01-08 |
In the above scenario, 2 records are inserted on certain updated_date. So total of 3 pairs on 3 different dates. As per the conditions, there should not be any Deleted record and all of them should be 0. But the output shows like below with a deleted record with one of the record of the latest date
case_num | code | project_name | sp_id | updated_date | soft_delete_flag |
---|---|---|---|---|---|
1234 | PGK123 | ABC | NO_VALUE | 2023-01-01 | 0 |
1234 | PGK456 | ABC | NO_VALUE | 2023-01-01 | 0 |
1234 | PGK123 | ABC | NO_VALUE | 2023-01-05 | 0 |
1234 | PGK456 | ABC | NO_VALUE | 2023-01-05 | 0 |
1234 | PGK123 | ABC | NO_VALUE | 2023-01-08 | 1 |
1234 | PGK456 | ABC | NO_VALUE | 2023-01-08 | 0 |
答案1
得分: 1
问题在于,当连续的空值出现时,您的NULL值应该被视为相等。您可以采取的一种方法是将您的空值合并为临时字符串"NULL"(对于字符串)和整数"-1"(假设-1是列"sp_id"不可能具有的值)。
COALESCE( code, 'NULL') AS coal_c,
COALESCE(project_name, 'NULL') AS coal_pn,
COALESCE( sp_id, -1) AS coal_spid
然后,根据下一行的值计算值。
LEAD(COALESCE( code, 'NULL'))
OVER(PARTITION BY case_num ORDER BY updated_date) AS next_coal_c,
LEAD(COALESCE(project_name, 'NULL'))
OVER(PARTITION BY case_num ORDER BY updated_date) AS next_coal_pn,
LEAD(COALESCE( sp_id, -1))
OVER(PARTITION BY case_num ORDER BY updated_date) AS next_coal_spid
完成后,您可以使用CASE
表达式检查您的三条记录的连续值之间是否有差异。
CASE WHEN NOT coal_c = next_coal_c
OR NOT coal_pn = next_coal_pn
OR NOT coal_spid = next_coal_spid
THEN 1 ELSE 0 END DESC
但是我们只想将"soft_delete_flag"列的最新更改设置为1,因此我们使用ROW_NUMBER
根据以下顺序对您的记录进行排序:
- 先前生成的标志降序排列(值为1的排在前面)
- 日期降序排列
ROW_NUMBER() OVER(
PARTITION BY case_num
ORDER BY CASE WHEN ...
THEN 1 ELSE 0 END DESC,
updated_date DESC
)
当行号等于1时,我们将设置为1,否则设置为0,使用CASE
表达式。
CASE WHEN ROW_NUMBER() OVER(
) = 1
THEN 'D'
ELSE 'N'
END AS soft_delete_flag
以下是完整的代码:
WITH cte AS (
SELECT *,
COALESCE( code, 'NULL') AS coal_c,
COALESCE(project_name, 'NULL') AS coal_pn,
COALESCE( sp_id, -1) AS coal_spid,
LEAD(COALESCE( code, 'NULL'))
OVER(PARTITION BY case_num ORDER BY updated_date) AS next_coal_c,
LEAD(COALESCE(project_name, 'NULL'))
OVER(PARTITION BY case_num ORDER BY updated_date) AS next_coal_pn,
LEAD(COALESCE( sp_id, -1))
OVER(PARTITION BY case_num ORDER BY updated_date) AS next_coal_spid
FROM tab
)
SELECT case_num,
code,
project_name,
sp_id,
updated_date,
CASE WHEN ROW_NUMBER() OVER(
PARTITION BY case_num
ORDER BY CASE WHEN NOT coal_c = next_coal_c
OR NOT coal_pn = next_coal_pn
OR NOT coal_spid = next_coal_spid
THEN 1 ELSE 0 END DESC,
updated_date DESC
) = 1
THEN 'D'
ELSE 'N'
END AS soft_delete_flag
FROM cte
英文:
The problem here is that your NULL values shall be matched as equal when consecutive nulls are found. One thing you can do to approach this issue is to coalesce your null values to temporary "NULL" for strings and "-1" for integers (assuming -1 is a value that the column "sp_id" cannot have).
COALESCE( code, 'NULL') AS coal_c,
COALESCE(project_name, 'NULL') AS coal_pn,
COALESCE( sp_id, -1) AS coal_spid
And compute the values from the next rows accordingly.
LEAD(COALESCE( code, 'NULL'))
OVER(PARTITION BY case_num ORDER BY updated_date) AS next_coal_c,
LEAD(COALESCE(project_name, 'NULL'))
OVER(PARTITION BY case_num ORDER BY updated_date) AS next_coal_pn,
LEAD(COALESCE( sp_id, -1))
OVER(PARTITION BY case_num ORDER BY updated_date) AS next_coal_spid
Once you've done that, you can check when there's a difference between consecutive values for your three records, with a CASE
expression.
CASE WHEN NOT coal_c = next_coal_c
OR NOT coal_pn = next_coal_pn
OR NOT coal_spid = next_coal_spid
THEN 1 ELSE 0 END DESC
But we want to set only the latest change to 1 for the "soft_delete_flag" column, hence we use a ROW_NUMBER
to order your records according to:
- the previous generated flag descendently (values with 1 come first)
- the date descendently
ROW_NUMBER() OVER(
PARTITION BY case_num
ORDER BY CASE WHEN ...
THEN 1 ELSE 0 END DESC,
updated_date DESC
)
And when row number will be equal to 1, we will set 1, otherwise 0, with a CASE
expression.
CASE WHEN ROW_NUMBER() OVER(
) = 1
THEN 'D'
ELSE 'N'
END AS soft_delete_flag
Here's the full code:
WITH cte AS (
SELECT *,
COALESCE( code, 'NULL') AS coal_c,
COALESCE(project_name, 'NULL') AS coal_pn,
COALESCE( sp_id, -1) AS coal_spid,
LEAD(COALESCE( code, 'NULL'))
OVER(PARTITION BY case_num ORDER BY updated_date) AS next_coal_c,
LEAD(COALESCE(project_name, 'NULL'))
OVER(PARTITION BY case_num ORDER BY updated_date) AS next_coal_pn,
LEAD(COALESCE( sp_id, -1))
OVER(PARTITION BY case_num ORDER BY updated_date) AS next_coal_spid
FROM tab
)
SELECT case_num,
code,
project_name,
sp_id,
updated_date,
CASE WHEN ROW_NUMBER() OVER(
PARTITION BY case_num
ORDER BY CASE WHEN NOT coal_c = next_coal_c
OR NOT coal_pn = next_coal_pn
OR NOT coal_spid = next_coal_spid
THEN 1 ELSE 0 END DESC,
updated_date DESC
) = 1
THEN 'D'
ELSE 'N'
END AS soft_delete_flag
FROM cte
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论