英文:
What is the appropriate logic to load delta values when using > and <=?
问题
我正在编写一个用于初始完整加载和后续增量加载的ETL。
在水印表中,我存储表名和日期。每次ETL运行时,我将这个值加载到变量(@last_run)中。
从源表中,我选择最大的日期时间值并存储在SQL变量(@current_max_date)中。
这两者都是日期时间数据类型。
我使用以下逻辑来加载增量值:
INSERT INTO SINK_TBL
SELECT * FROM Data_Source_Table WHERE TIMESTAMP_Column > @last_run and TIMESTAMP_Column <= @current_max_date
INSERT INTO watermark(tablename,dt) values ('table_name',@current_max_date)
这个方法运行良好,但是我越想越觉得,是否可能出现这样的情况,即在获取增量数据(上面的查询)的同时,会发生插入操作,导致查询可能错过那个记录。
我正在考虑的备选方案有:
- 从
@current_max_date
减去1秒 - 或者将查询更改为
SELECT * FROM Data_Source_Table WHERE TIMESTAMP_Column >= @last_run and TIMESTAMP_Column < @current_max_date
哪种安全选项可以确保增量加载不会错过任何数据记录。
英文:
I am writing an ETL for initial full load and subsequent delta loads.
In the watermark table I am storing the table name, date. At each ETL run I load this value into variable (@last_run).
From the source table I select the max datetime value into a SQL variable (@current_max_date).
Both these are datetime data type.
I use the following logic to load delta values:
INSERT INTO SINK_TBL
SELECT * FROM Data_Source_Table WHERE TIMESTAMP_Column > @last_run and TIMESTAMP_Column <= @current_max_date
INSERT INTO watermark(tablename,dt) values ('table_name',@current_max_date)
This works fine, but the more I think about this, I am wondering whether there can be a situation where-in while the delta data (above query) is fetched, an insert might occur during that split second and the query may miss that record.
The alternate options I am thinking of are:
- Subtract 1 second from
@current_max_date
- Or change the query to
SELECT * FROM Data_Source_Table WHERE TIMESTAMP_Column >= @last_run and TIMESTAMP_Column < @current_max_date
What is the safe option to ensure delta load doesn't miss out on any data records.
答案1
得分: 1
如果您希望绝对保证不会错过任何记录,理论上最好的解决方案是:
- 读取数据库日志文件,而不是表格,以获取更改。然而,如果您的数据库管理系统不提供这样的机制,这可能需要使用商业CDC工具。
- 每次提取完整的表格 - 尽管如果数据量很大,这可能不切实际/具有成本效益。
实际上,最佳解决方案是使用您列出的选项1。您应该向后移动窗口的起始时间(1秒、1分钟、1小时)取决于您对系统时间同步的信心以及写入TIMESTAMP_Column的逻辑(即,在记录写入时是否将其设置为sysdate,还是在(可能运行时间很长的)事务的开始时定义该值)。
英文:
If you want an absolute guarantee that no records are missed then the best solutions, in theory, would be:
- Read the DB log files, rather than the tables, to get changes. However this is likely to require a commercial CDC tool if your DBMS doesn't provide a mechanism for doing this
- Extract the full table each time - though if the data volumes are of any significant size this is unlikely to be practicable/cost-effective
In reality, the best solution is to use the option 1 you listed. How far back you move the start of the window (1 second, 1 minute, 1 hour) depends on how confident you are that your systems are in sync time-wise and what the logic is for writing to TIMESTAMP_Column (i.e. does it get set to sysdate at the point the record is written or is the value being defined at the start of a (potentially long-running) transaction)
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论