英文:
Efficient way to compare millions of data between 2 tables that have exact count of rows and same data types in SQL Server
问题
我有两个拥有相同字段名和数据类型的不同表格。甚至每个表的记录数量都相同。它们位于同一个数据库中。每个表格里都有数百万条数据。有没有一种有效的方法来比较这两个表格之间的数据?
当我用一个小数据集进行测试时,我注意到字段内的值存在差异。例如:表格1的字段1存储了空值,而来自表格2的相同字段1存储了空白值。我需要找出这些差异,基本上找出这两个表格之间的确切区别,以便我们可以进行修复。
我尝试使用除了和联合所有命令,但执行查询需要很长时间。
英文:
I have 2 different tables that have same field names and data types. Even the count of records from each table are the same. They are in the same database. Each table has millions of data in them. Is there an efficient way to compare data between these 2 tables?
When I tested with a small subset of data, I noticed that there are differences in the values within the fields. Ex: field1 from table 1 stored null value and same field1 from table2 stored blank value. I need to find the differences like these , basically find what is exactly different between these 2 tables so that we can fix it.
I tried to use except and union all commands but it takes forever to execute the query
答案1
得分: 1
我发布了一条评论,但我觉得有一个答案很好。
"EXCEPT" 是您可以使用的运算符来查找差异。
SELECT * FROM YourTable1
EXCEPT
SELECT * FROM YourTable2
此查询将返回与第一个查询不匹配的第二个查询的结果。
英文:
I posted a comment but i figured an answer is good to.
EXCEPT is the operator you can use to find differences.
SELECT * FROM YourTable1
EXCEPT
SELECT * FROM YourTable2
This query will return results from the 2nd query not matching those in the first query.
答案2
得分: 0
我相信 @Doug-Coats 的回答可能适用于您的用例,就我理解而言。但是您的情况可能更复杂。
请注意,当使用标识列并且相同的数据存在于每个表的不同行中时,该方法可能会失败。
请注意,一个表中唯一的单词是“an”,而另一表中没有,但是使用 EXCEPT
时带回了其他行,因为ID列(我定义为主键和标识列)不同。您拥有相同的基本数据,只是在表中相对于另一个表的不同位置上,而 EXCEPT
操作符在考虑 ID 和/或标识列时将其视为不同。
因此,如果您的用例在表格之间具有具有不同顺序的标识/ID列的数据,那么它可能无法给出您正在寻找的结果。
有两种解决方法。第一种是在使用 EXCEPT
时选择排除主键和/或标识列的表数据 - 但这可能会使您无法追溯到源数据。
对于另一种方法...
您应该考虑从您的角度看,是什么使您的数据与众不同。使用不是主键/标识且可用于生成校验和的任何列。这些列是您可以手动检查以确定行是否包含不同内容的列(如果您真的必须这样做),而不考虑行上的 PK/标识值。
您可以使用 BINARY_CHECKSUM(<column-list>) AS [chksum]
计算校验和值。
您可以选择带有校验和的数据到适当的表。然后,您要查找 Table_1 中存在但 Table_2 中不存在的校验和,反之亦然,并在输出结果时标识数据的源。
我个人更喜欢存储过程来处理这类事务,其中将声明两个与您的表结构匹配的表变量,除了有一个适用于您的校验和的额外的 'chksum' 列。然后,我将创建第三个表变量,其中包含一个 'source' 列,用于存储结果。
然后,我会像这样操作:
-- 上面的代码中已有
-- 然后进行最终选择
SELECT * FROM @FilteredResults -- 用您的列列表替换 *
然后,您就知道哪些数据是在任一表中但不在另一表中的“唯一”数据,而不考虑不同的主键或标识,或表格内的顺序。带有“source”为 "Table_1" 的项目具有独特于 "Table_1" 且不在 "Table_2" 中的值,反之亦然。
如果在表中校验和不太可能重复,最好将您的表变量的 [chksum] 列定义为非唯一索引,以加速 JOIN 查询。
英文:
I believe @Doug-Coats answer will probably be apt for your use-case as I understand it. But your scenario could be more complex.
Consider that approach can fail where identity columns are used and the same data exists in each table but in a row with a different ID
Notice how the only word in one, but not both tables, is "an", but the use of EXCEPT
brought other rows back because the ID column (which I defined as PK and an identity) differ. You have the same fundamental data, just at a different position in the table relative to the other, which the ID and/or identity column when considered by the EXCEPT
operator will treat as different.
So if your use-case has identity/ID columns with data in different order between the tables, then it may not give the results you are looking for.
There are 2 ways around this. The first is to SELECT your table data excluding your PK and/or identity column when using EXCEPT
- but this might lose you the ability to trace back to your source data.
For the other...
You should consider what makes your data unique from your point of view. Use whatever columns are not a PK/identity and could be used to generate a checksum. These are the collection of columns you would manually inspect to determine if rows contain different content (if you really had to), regardless of PK/identity value on the row.
You can compute a checksum value using BINARY_CHECKSUM(<column-list>) AS [chksum]
You can SELECT the data with checksum in to a suitable table. Then you are looking for where there are checksums in Table_1 not existing in Table_2 and vice-versa, together with a means of identifying the table the data is sourced from when the result is output.
I personally prefer stored procedures for such things, which would declare two table variables which match your table schemas, except there is an additional 'chksum' column suitable for your checksum. I would then create a third table variable with a 'source' column for containing the results.
I would then do something like this:
INSERT @Table1WithChksum
SELECT *, BINARY_CHECKSUM(<column-list>) AS [chksum] FROM dbo.Table_1; -- replace * and <column-list> with your actual column list
INSERT @Table2WithChksum
SELECT *, BINARY_CHECKSUM(<column-list>) AS [chksum] FROM dbo.Table_2; -- replace * and <column-list> with your actual column list
INSERT @FilteredResults
SELECT *, 'In Table_1, not Table_2' AS [source]
FROM @Table1WithChksum t1
LEFT JOIN @Table2WithChksum t2
ON t1.[chksum] = t2.[chksum]
WHERE t2.[chksum] IS NULL;
INSERT @FilteredResults
SELECT *, 'In Table_2, not Table_1' AS [source]
FROM @Table2WithChksum t2
LEFT JOIN @Table1WithChksum t1
ON t2.[chksum] = t1.[chksum]
WHERE t1.[chksum] IS NULL;
-- then do a final select
SELECT * FROM @FilteredResults -- replace * with your column list
You then know which 'unique' data is in either table but not both, regardless of differing PK or identity, or order within the table. Items with 'source' as "Table_1" have values unique to "Table_1" and not in "Table_2" and vice-versa.
If checksums in a table are unlikely to be duplicated, it is probably best to define your table variable [chksum] column with a non-unique index to speed up the JOIN queries.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论