2023年7月27日 18:28:57go评论92阅读模式

英文:

What is the appropriate logic to load delta values when using > and <=?

问题

我正在编写一个用于初始完整加载和后续增量加载的ETL。

在水印表中，我存储表名和日期。每次ETL运行时，我将这个值加载到变量(@last_run)中。

从源表中，我选择最大的日期时间值并存储在SQL变量(@current_max_date)中。

这两者都是日期时间数据类型。

我使用以下逻辑来加载增量值：

INSERT INTO SINK_TBL 
SELECT * FROM Data_Source_Table WHERE TIMESTAMP_Column > @last_run and TIMESTAMP_Column <= @current_max_date
INSERT INTO watermark(tablename,dt) values ('table_name',@current_max_date)

这个方法运行良好，但是我越想越觉得，是否可能出现这样的情况，即在获取增量数据（上面的查询）的同时，会发生插入操作，导致查询可能错过那个记录。

我正在考虑的备选方案有：

从@current_max_date减去1秒
或者将查询更改为SELECT * FROM Data_Source_Table WHERE TIMESTAMP_Column >= @last_run and TIMESTAMP_Column < @current_max_date

哪种安全选项可以确保增量加载不会错过任何数据记录。

英文:

I am writing an ETL for initial full load and subsequent delta loads.

In the watermark table I am storing the table name, date. At each ETL run I load this value into variable (@last_run).

From the source table I select the max datetime value into a SQL variable (@current_max_date).

Both these are datetime data type.

I use the following logic to load delta values:

INSERT INTO SINK_TBL 
SELECT * FROM Data_Source_Table WHERE TIMESTAMP_Column &gt; @last_run and TIMESTAMP_Column &lt;= @current_max_date
INSERT INTO watermark(tablename,dt) values (&#39;table_name&#39;,@current_max_date)

This works fine, but the more I think about this, I am wondering whether there can be a situation where-in while the delta data (above query) is fetched, an insert might occur during that split second and the query may miss that record.

The alternate options I am thinking of are:

Subtract 1 second from @current_max_date
Or change the query to SELECT * FROM Data_Source_Table WHERE TIMESTAMP_Column >= @last_run and TIMESTAMP_Column < @current_max_date

What is the safe option to ensure delta load doesn't miss out on any data records.

答案1

得分: 1

如果您希望绝对保证不会错过任何记录，理论上最好的解决方案是：

读取数据库日志文件，而不是表格，以获取更改。然而，如果您的数据库管理系统不提供这样的机制，这可能需要使用商业CDC工具。
每次提取完整的表格 - 尽管如果数据量很大，这可能不切实际/具有成本效益。

实际上，最佳解决方案是使用您列出的选项1。您应该向后移动窗口的起始时间（1秒、1分钟、1小时）取决于您对系统时间同步的信心以及写入TIMESTAMP_Column的逻辑（即，在记录写入时是否将其设置为sysdate，还是在（可能运行时间很长的）事务的开始时定义该值）。

英文:

If you want an absolute guarantee that no records are missed then the best solutions, in theory, would be:

Read the DB log files, rather than the tables, to get changes. However this is likely to require a commercial CDC tool if your DBMS doesn't provide a mechanism for doing this
Extract the full table each time - though if the data volumes are of any significant size this is unlikely to be practicable/cost-effective

In reality, the best solution is to use the option 1 you listed. How far back you move the start of the window (1 second, 1 minute, 1 hour) depends on how confident you are that your systems are in sync time-wise and what the logic is for writing to TIMESTAMP_Column (i.e. does it get set to sysdate at the point the record is written or is the value being defined at the start of a (potentially long-running) transaction)

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

当使用 “>” 和 “<=" 时，加载增量值的适当逻辑是什么？

问题

答案1

如何选择具有一个列中相同值但另一个列中不同值的行？

使用GS Query()将文本值转换为数字

Access second row of MySQL query result in Go

计算元组中的行数

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。