Azure Databricks Delta Live Table stored as SCD 2 is creating new records when no data changes

huangapple go评论89阅读模式
英文:

Azure Databricks Delta Live Table stored as SCD 2 is creating new records when no data changes

问题

我有一个流水线,从数据湖中摄取 JSON 文件。这些文件定期被倒入那里。大多数情况下,文件包含重复的数据,但偶尔会有变化。
我正在尝试使用 Delta Live 表流水线的 STORED AS SCD 2 选项将这些文件处理到数据仓库中。
一切看起来都运行正常,但我在 scd 表中得到了重复的行,尽管没有数据发生变化。

在流水线集群配置中,我添加了 `pipelines.enableTrackHistory True`
[pipeline 配置](https://i.stack.imgur.com/THOaR.png)

这是 SCD 表的 SQL:

CREATE OR REFRESH STREAMING LIVE TABLE currStudents_SCD;

APPLY CHANGES INTO
live.currStudents_SCD
FROM
stream(live.currStudents_ingest)
KEYS
(id)
SEQUENCE BY
file_modification_time
STORED AS
SCD TYPE 2
TRACK HISTORY ON * EXCEPT (file_modification_time)
;


为什么会出现这种情况以及如何阻止这些重复行,有什么想法吗?

我在流水线集群配置中添加了 `pipelines.enableTrackHistory True`
[pipeline 配置](https://i.stack.imgur.com/THOaR.png)

这是 SCD 表的 SQL:

CREATE OR REFRESH STREAMING LIVE TABLE currStudents_SCD;

APPLY CHANGES INTO
live.currStudents_SCD
FROM
stream(live.currStudents_ingest)
KEYS
(id)
SEQUENCE BY
file_modification_time
STORED AS
SCD TYPE 2
TRACK HISTORY ON * EXCEPT (file_modification_time)
;


为什么会出现这种情况以及如何阻止这些重复行,有什么想法吗?

我无法在文档中找到任何有关此问题的信息。除非可能的模糊建议,即在 SCD 2 中,当没有列发生变化时会有新的记录。但跟踪历史文档似乎表明,只有对受监视列的更改才会导致新的记录....

------------------------
@Shivam Shukla 下面是正确的。然而,对于那些想要解决这个问题的人,Databricks 能够帮助我。我现在在流式传输和 SCD2 表之间添加了一个新的阶段。这个阶段使用了选项:

    df.dropDuplicates([col for col in df.columns if col != "file_modification_time"])

以确保只传递发生更改的内容到 SCD2 跟踪,而不包括我的 file_modification_time 列在重复项评估中。
英文:

I have a streaming pipeline that ingests json files from a data lake. These files are dumped there periodically. Mostly the files contain duplicate data, but there are occasional changes.
I am trying to process these files into a data warehouse using the STORED AS SCD 2 option of the Delta Live tables pipeline.
Everything looks to be working fine, but I am getting duplicate rows in the scd table - even though no data is changing.

In the pipeline cluster configuration, I have added pipelines.enableTrackHistory True
pipeline config

Here is the SQL for the SCD table:

CREATE OR REFRESH STREAMING LIVE TABLE currStudents_SCD;

APPLY CHANGES INTO
  live.currStudents_SCD
FROM
  stream(live.currStudents_ingest)
KEYS
  (id)
SEQUENCE BY
  file_modification_time
STORED AS
  SCD TYPE 2
TRACK HISTORY ON * EXCEPT (file_modification_time)
;

Any ideas why I get this and how I can stop these duplicates?

I can't find any information in the documentation to help with this. Except possibly, the vague suggestion that, in SCD 2, there will be new records when no columns change. But the Track History documentation seems to indicate that only changes to monitored columns will result in a new record....


@Shivam Shukla below is correct. However, for those who would like to address this issue, Databricks were able to help me out. I have now added a new stage between the ingestion of the stream and the SCD2 table. This stage uses the option:

df.dropDuplicates([col for col in df.columns if col != "file_modification_time"])

option to ensure that only changes are past to the SCD2 tracking, without including my file_modification_time column in the assessment of duplicates

答案1

得分: 0

这是 Databricks Delta Live Table 预期的行为。如果使用 SCD 2 模式,不管记录发生了什么变化,都会保留记录的历史。跟踪历史可以帮助您指定要用于跟踪更新的列。如果不指定任何列,将报告所有历史记录。如果提供了跟踪列,只有在这些列发生更改时才会创建历史记录,否则它将简单地覆盖表,就像 SCD 1 模式一样。

英文:

This is a expected behaviour of databricks delta live table. If you use SCD 2 pattern irrespective of the changes the history of the records are maintained. Track history helps you with specifying which columns to consider for tracking updates. If you don't specify any columns all history is reported. If tracking columns are provided this will basically only create history if there are changes in this columns else it will simply overwrite the table like in SCD 1 pattern.

huangapple
  • 本文由 发表于 2023年3月9日 22:56:28
  • 转载请务必保留本文链接:https://go.coder-hub.com/75686288.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定