当使用 “>” 和 “<=" 时,加载增量值的适当逻辑是什么?

huangapple go评论60阅读模式
英文:

What is the appropriate logic to load delta values when using > and <=?

问题

我正在编写一个用于初始完整加载和后续增量加载的ETL。

在水印表中,我存储表名和日期。每次ETL运行时,我将这个值加载到变量(@last_run)中。

从源表中,我选择最大的日期时间值并存储在SQL变量(@current_max_date)中。

这两者都是日期时间数据类型。

我使用以下逻辑来加载增量值:

INSERT INTO SINK_TBL 
SELECT * FROM Data_Source_Table WHERE TIMESTAMP_Column > @last_run and TIMESTAMP_Column <= @current_max_date
INSERT INTO watermark(tablename,dt) values ('table_name',@current_max_date)

这个方法运行良好,但是我越想越觉得,是否可能出现这样的情况,即在获取增量数据(上面的查询)的同时,会发生插入操作,导致查询可能错过那个记录。

我正在考虑的备选方案有:

  1. @current_max_date减去1秒
  2. 或者将查询更改为SELECT * FROM Data_Source_Table WHERE TIMESTAMP_Column >= @last_run and TIMESTAMP_Column < @current_max_date

哪种安全选项可以确保增量加载不会错过任何数据记录。

英文:

I am writing an ETL for initial full load and subsequent delta loads.

In the watermark table I am storing the table name, date. At each ETL run I load this value into variable (@last_run).

From the source table I select the max datetime value into a SQL variable (@current_max_date).

Both these are datetime data type.

I use the following logic to load delta values:

INSERT INTO SINK_TBL 
SELECT * FROM Data_Source_Table WHERE TIMESTAMP_Column &gt; @last_run and TIMESTAMP_Column &lt;= @current_max_date
INSERT INTO watermark(tablename,dt) values (&#39;table_name&#39;,@current_max_date)

This works fine, but the more I think about this, I am wondering whether there can be a situation where-in while the delta data (above query) is fetched, an insert might occur during that split second and the query may miss that record.

The alternate options I am thinking of are:

  1. Subtract 1 second from @current_max_date
  2. Or change the query to SELECT * FROM Data_Source_Table WHERE TIMESTAMP_Column &gt;= @last_run and TIMESTAMP_Column &lt; @current_max_date

What is the safe option to ensure delta load doesn't miss out on any data records.

答案1

得分: 1

如果您希望绝对保证不会错过任何记录,理论上最好的解决方案是:

  1. 读取数据库日志文件,而不是表格,以获取更改。然而,如果您的数据库管理系统不提供这样的机制,这可能需要使用商业CDC工具。
  2. 每次提取完整的表格 - 尽管如果数据量很大,这可能不切实际/具有成本效益。

实际上,最佳解决方案是使用您列出的选项1。您应该向后移动窗口的起始时间(1秒、1分钟、1小时)取决于您对系统时间同步的信心以及写入TIMESTAMP_Column的逻辑(即,在记录写入时是否将其设置为sysdate,还是在(可能运行时间很长的)事务的开始时定义该值)。

英文:

If you want an absolute guarantee that no records are missed then the best solutions, in theory, would be:

  1. Read the DB log files, rather than the tables, to get changes. However this is likely to require a commercial CDC tool if your DBMS doesn't provide a mechanism for doing this
  2. Extract the full table each time - though if the data volumes are of any significant size this is unlikely to be practicable/cost-effective

In reality, the best solution is to use the option 1 you listed. How far back you move the start of the window (1 second, 1 minute, 1 hour) depends on how confident you are that your systems are in sync time-wise and what the logic is for writing to TIMESTAMP_Column (i.e. does it get set to sysdate at the point the record is written or is the value being defined at the start of a (potentially long-running) transaction)

huangapple
  • 本文由 发表于 2023年7月27日 18:28:57
  • 转载请务必保留本文链接:https://go.coder-hub.com/76778839.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定