2023年6月6日 11:56:31go评论170阅读模式

英文:

Generating a flag with respect to a change in consecutive records within partition

问题

I've translated the provided text. Here it is:

我有一个名为 original_table 的 snowflake 表，包括以下字段：case_num、code、project_name、sp_id、updated_date。对于每个 case_num，code、project_name、sp_id 的组合将是唯一的。

还有另一个表，名为 additional_table，只包含 case_num 和 timestamp 列。它不包括像 original_table 中的其他关键列。它是每日截断和加载的表。

我应该检查两个条件来标记 D 标志（软删除）：

- 如果当前行与前一行相比，针对特定 case_num，code、project_name 或 sp_id 任一发生更改，则应使用 d 标志软删除前一行。(并且)
  
- 如果 case_num 今天未在 additional_table 中到达，则我需要将记录标记为软删除（D）。

我面临的问题是准确标记删除，因为 additional_table 没有其他关键列 code、project_name、sp_id。因此，我手动尝试使用以下逻辑标记已删除的记录。

在下面的查询中，第 3、6 和 8 行对于特定的 case_num 具有更改的关键值（任意一个/全部）。因此，更新的值/当前行之前的第 2、5 和 7 行应该被标记为 D。请参考输出。

【以下表格未能保留，若需请重新提供】

根据上述逻辑，我得到了将 D 记录到具有更改值的记录行，但没有到其之前的行的记录。

请问是否可以帮助修改这个 Snowflake 查询或提供其他方法以获得预期的输出。谢谢。

【更新】：

输入场景：

【以下表格未能保留，若需请重新提供】

在上述场景中，在某些 updated_date 上插入了 2 条记录。因此，在 3 个不同日期上共有 3 对记录。根据条件，不应有任何删除的记录，所有记录应为 0。但是输出如下，其中一个最新日期的记录中包括一个删除的记录。

【以下表格未能保留，若需请重新提供】

英文:

I have a snowflake table (called as original_table) with following fields case_num, code, project_name, sp_id, updated_date. For each case_num the combination of code, project_name, sp_id will be unique.

There is another table (called as additional_table) that has only the case_num, timestamp column. It doesnt have the other key columns like original_table. Its a truncate and loaded table daily.

I should check two conditions to mark the D flag (soft delete):

If there is any change in either of code OR project_name OR sp_id of
the current_row vs previous_row for a particular case_num then the previous_row should be soft
deleted with d flag. (AND)
If the case_num did not arrive in additional_table for today's load,
then I will need to mark the record as soft delete (D).

The problem I face is that it is hard to mark the deletes accurately since the additional_table doesnt have the other key columns code, project_name, sp_id .
So, I manually tried to mark the deleted record using below logic.

WITH src AS (
  SELECT *,
         LAG(code) OVER (PARTITION BY case_num ORDER BY updated_date) AS prev_code,
         LAG(project_name) OVER (PARTITION BY case_num ORDER BY updated_date) AS prev_project_name,
         LAG(sp_id) OVER (PARTITION BY case_num ORDER BY updated_date) AS prev_sp_id,
         LEAD(code) OVER (PARTITION BY case_num ORDER BY updated_date) AS next_code,
         LEAD(project_name) OVER (PARTITION BY case_num ORDER BY updated_date) AS next_project_name,
         LEAD(sp_id) OVER (PARTITION BY case_num ORDER BY updated_date) AS next_sp_id
  FROM original_table
),
act AS (
  SELECT case_num, MAX(updated_date) AS max_updated_date
  FROM additional_table
  GROUP BY case_num
)
SELECT src.*,
       CASE WHEN (
            (
              (prev_code IS NOT NULL AND (code &lt;&gt; prev_code OR code &lt;&gt; next_code))
              OR (prev_project_name IS NOT NULL AND (project_name &lt;&gt; prev_project_name OR project_name &lt;&gt; next_project_name))
              OR (prev_sp_id IS NOT NULL AND (sp_id &lt;&gt; prev_sp_id OR sp_id &lt;&gt; next_sp_id))
            )
            AND updated_date &lt;&gt; act.max_updated_date
			
           ) OR (ACT.case_num IS NULL)
           THEN LAG(&#39;D&#39;) OVER (PARTITION BY case_num ORDER BY updated_date) ELSE &#39;N&#39; END AS soft_delete_flag
FROM src
LEFT JOIN act ON src.case_num = act.case_num;

In the below, the 3rd, 6th and 8th rows have changed key values (either one/all) for a particular case_num. So, the 2nd, 5th and 7th row that is previous to the updated valur/current row should be marked as D. Please refer output.

case_num	code	project_name	sp_id	updated_date
1234	NULL	ABC	123	2023-01-01
1234	NULL	ABC	123	2023-01-02
1234	Value	ABC	123	2023-01-03
2345	NULL	NULL	456	2023-01-01
2345	NULL	ABC	456	2023-01-02
2345	NULL	DEF	456	2023-01-03
7890	NULL	NULL	NULL	2023-01-01
7890	New_value	BBB	678	2023-01-02

"Expected Output":

case_num	code	project_name	sp_id	updated_date	soft_delete_flag
1234	NULL	ABC	123	2023-01-01	N
1234	NULL	ABC	123	2023-01-02	D
1234	Value	ABC	123	2023-01-03	N
2345	NULL	NULL	456	2023-01-01	N
2345	NULL	ABC	456	2023-01-02	D
2345	NULL	DEF	456	2023-01-03	N
7890	NULL	NULL	NULL	2023-01-01	D
7890	New_value	BBB	678	2023-01-02	N

"Current Output":

case_num	code	project_name	sp_id	updated_date	soft_delete_flag
1234	NULL	ABC	123	2023-01-01	N
1234	NULL	ABC	123	2023-01-02	N
1234	Value	ABC	123	2023-01-03	D
2345	NULL	NULL	456	2023-01-01	N
2345	NULL	NULL	456	2023-01-02	N
2345	NULL	DEF	456	2023-01-03	D
7890	NULL	NULL	NULL	2023-01-01	D
7890	New_value	BBB	678	2023-01-02	D

As per above logic, I am getting the D records to the record row that has changed values, but not to the row previous to it.

Please could anyone help to modify this SNOWFLAKE query or any other approach to get the expected output. TIA.

Updated

**:
Input scenario:

case_num	code	project_name	sp_id	updated_date
1234	PGK123	ABC	NO_VALUE	2023-01-01
1234	PGK456	ABC	NO_VALUE	2023-01-01
1234	PGK123	ABC	NO_VALUE	2023-01-05
1234	PGK456	ABC	NO_VALUE	2023-01-05
1234	PGK123	ABC	NO_VALUE	2023-01-08
1234	PGK456	ABC	NO_VALUE	2023-01-08

In the above scenario, 2 records are inserted on certain updated_date. So total of 3 pairs on 3 different dates. As per the conditions, there should not be any Deleted record and all of them should be 0. But the output shows like below with a deleted record with one of the record of the latest date

case_num	code	project_name	sp_id	updated_date	soft_delete_flag
1234	PGK123	ABC	NO_VALUE	2023-01-01	0
1234	PGK456	ABC	NO_VALUE	2023-01-01	0
1234	PGK123	ABC	NO_VALUE	2023-01-05	0
1234	PGK456	ABC	NO_VALUE	2023-01-05	0
1234	PGK123	ABC	NO_VALUE	2023-01-08	1
1234	PGK456	ABC	NO_VALUE	2023-01-08	0

答案1

得分: 1

问题在于，当连续的空值出现时，您的NULL值应该被视为相等。您可以采取的一种方法是将您的空值合并为临时字符串"NULL"（对于字符串）和整数"-1"（假设-1是列"sp_id"不可能具有的值）。

COALESCE(        code, 'NULL') AS coal_c,
COALESCE(project_name, 'NULL') AS coal_pn,
COALESCE(       sp_id,     -1) AS coal_spid

然后，根据下一行的值计算值。

LEAD(COALESCE(        code, 'NULL')) 
     OVER(PARTITION BY case_num ORDER BY updated_date) AS next_coal_c,
LEAD(COALESCE(project_name, 'NULL')) 
     OVER(PARTITION BY case_num ORDER BY updated_date) AS next_coal_pn,
LEAD(COALESCE(       sp_id,     -1)) 
     OVER(PARTITION BY case_num ORDER BY updated_date) AS next_coal_spid

完成后，您可以使用CASE表达式检查您的三条记录的连续值之间是否有差异。

CASE WHEN NOT coal_c = next_coal_c 
       OR NOT coal_pn = next_coal_pn 
       OR NOT coal_spid = next_coal_spid
     THEN 1 ELSE 0 END DESC

但是我们只想将"soft_delete_flag"列的最新更改设置为1，因此我们使用ROW_NUMBER根据以下顺序对您的记录进行排序：

先前生成的标志降序排列（值为1的排在前面）
日期降序排列

ROW_NUMBER() OVER(
    PARTITION BY case_num 
    ORDER BY CASE WHEN ...
                  THEN 1 ELSE 0 END DESC, 
             updated_date DESC
)

当行号等于1时，我们将设置为1，否则设置为0，使用CASE表达式。

CASE WHEN ROW_NUMBER() OVER(
          ) = 1
     THEN 'D' 
     ELSE 'N' 
END AS soft_delete_flag

以下是完整的代码：

WITH cte AS (
    SELECT *, 
           COALESCE(        code, 'NULL') AS coal_c,
           COALESCE(project_name, 'NULL') AS coal_pn,
           COALESCE(       sp_id,     -1) AS coal_spid,
           LEAD(COALESCE(        code, 'NULL')) 
               OVER(PARTITION BY case_num ORDER BY updated_date) AS next_coal_c,
           LEAD(COALESCE(project_name, 'NULL')) 
               OVER(PARTITION BY case_num ORDER BY updated_date) AS next_coal_pn,
           LEAD(COALESCE(       sp_id,     -1)) 
               OVER(PARTITION BY case_num ORDER BY updated_date) AS next_coal_spid
    FROM tab
)
SELECT case_num,
       code,
       project_name,
       sp_id,
       updated_date,
       CASE WHEN ROW_NUMBER() OVER(
                     PARTITION BY case_num 
                     ORDER BY CASE WHEN NOT coal_c = next_coal_c 
                                     OR NOT coal_pn = next_coal_pn 
                                     OR NOT coal_spid = next_coal_spid
                                   THEN 1 ELSE 0 END DESC, 
                              updated_date DESC
                 ) = 1
            THEN 'D' 
            ELSE 'N' 
       END AS soft_delete_flag
FROM cte

英文:

The problem here is that your NULL values shall be matched as equal when consecutive nulls are found. One thing you can do to approach this issue is to coalesce your null values to temporary "NULL" for strings and "-1" for integers (assuming -1 is a value that the column "sp_id" cannot have).

COALESCE(        code, &#39;NULL&#39;) AS coal_c,
COALESCE(project_name, &#39;NULL&#39;) AS coal_pn,
COALESCE(       sp_id,     -1) AS coal_spid

And compute the values from the next rows accordingly.

LEAD(COALESCE(        code, &#39;NULL&#39;)) 
     OVER(PARTITION BY case_num ORDER BY updated_date) AS next_coal_c,
LEAD(COALESCE(project_name, &#39;NULL&#39;)) 
     OVER(PARTITION BY case_num ORDER BY updated_date) AS next_coal_pn,
LEAD(COALESCE(       sp_id,     -1)) 
     OVER(PARTITION BY case_num ORDER BY updated_date) AS next_coal_spid

Once you've done that, you can check when there's a difference between consecutive values for your three records, with a CASE expression.

CASE WHEN NOT coal_c = next_coal_c 
       OR NOT coal_pn = next_coal_pn 
       OR NOT coal_spid = next_coal_spid
     THEN 1 ELSE 0 END DESC

But we want to set only the latest change to 1 for the "soft_delete_flag" column, hence we use a ROW_NUMBER to order your records according to:

the previous generated flag descendently (values with 1 come first)
the date descendently

ROW_NUMBER() OVER(
    PARTITION BY case_num 
    ORDER BY CASE WHEN ...
                  THEN 1 ELSE 0 END DESC, 
             updated_date DESC
)

And when row number will be equal to 1, we will set 1, otherwise 0, with a CASE expression.

CASE WHEN ROW_NUMBER() OVER(
          ) = 1
     THEN &#39;D&#39; 
     ELSE &#39;N&#39; 
END AS soft_delete_flag

Here's the full code:

WITH cte AS (
    SELECT *, 
           COALESCE(        code, &#39;NULL&#39;) AS coal_c,
           COALESCE(project_name, &#39;NULL&#39;) AS coal_pn,
           COALESCE(       sp_id,     -1) AS coal_spid,
           LEAD(COALESCE(        code, &#39;NULL&#39;)) 
               OVER(PARTITION BY case_num ORDER BY updated_date) AS next_coal_c,
           LEAD(COALESCE(project_name, &#39;NULL&#39;)) 
               OVER(PARTITION BY case_num ORDER BY updated_date) AS next_coal_pn,
           LEAD(COALESCE(       sp_id,     -1)) 
               OVER(PARTITION BY case_num ORDER BY updated_date) AS next_coal_spid
    FROM tab
)
SELECT case_num,
       code,
       project_name,
       sp_id,
       updated_date,
       CASE WHEN ROW_NUMBER() OVER(
                     PARTITION BY case_num 
                     ORDER BY CASE WHEN NOT coal_c = next_coal_c 
                                     OR NOT coal_pn = next_coal_pn 
                                     OR NOT coal_spid = next_coal_spid
                                   THEN 1 ELSE 0 END DESC, 
                              updated_date DESC
                 ) = 1
            THEN &#39;D&#39; 
            ELSE &#39;N&#39; 
       END AS soft_delete_flag
FROM cte

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

生成一个标志，以反映分区内连续记录的变化。

问题

答案1

在Spring中，规范中对于”like”的使用不起作用。

如何在Snowflake中每个ID只显示1行，而不是2行？

冗余数据在规范化的数据库结构中是否可接受？

如何在Python的if语句中将PostgreSQL布尔值用作参数？

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论