英文:
How to find duplicate records by full name and similar DOB?
问题
我正在寻找如何在数据库中找到可能只在出生日期(DOB)方面有差异的记录的重复条目的最佳方法的想法。如果能找到所有具有相同全名并且出生日期相差不超过一年的重复条目,那将非常好。
示例:
Tom Johnson | 1990-12-01
Tom Johnson | 1991-12-01
Ted Janson | 1992-01-01
Tom Johnson | 2000-02-02
Bob Burke | 2002-06-12
我想要识别前两行作为潜在的重复项,以便进一步调查,因为它们的出生日期如此相似,并且它们的全名匹配。我们假设我们的数据中的某些条目可能是重复的,这是基于普通的打字错误和最终用户的错误。
最佳方法是如何识别和分组这些记录?
编辑:我不是在寻找其他查找重复项的方法,我只是想知道是否有一种直接的方法来识别具有类似出生日期的记录,如上所述。
在寻找完全匹配时,使用类似以下的方法非常常见:
SELECT Name, DOB
FROM table
GROUP BY Name, DOB
HAVING COUNT(*) > 1
但是,这个解决方案不适用于我在这里寻找的情况。我们已经采用了这种技术。
谢谢!
英文:
I'm looking for ideas on how best to find duplicate entries in the database for records that may differ only in Date of Birth (DOB). It would be great to find all duplicate entries that share the same Full Name and whose DOB is within a year of one another?
Example:
Tom Johnson | 1990-12-01
Tom Johnson | 1991-12-01
Ted Janson | 1992-01-01
Tom Johnson | 2000-02-02
Bob Burke | 2002-06-12
I would like to identify the first two rows as potential duplicates so we can further investigate, given their DOB is so similar and their full name matches. We are assuming that some subset of entries in our data could be duplicates based on common typos and mistakes end users make.
What's the best way to identify and group these records?
Edit: I am not looking for other ways to find duplicates, I am inquiring as to whether or not there's a straightforward way to accomplish identifying records with similar birthdates as mentioned above.
It's very common to use something like this when looking for identical matches:
SELECT Name, DOB
FROM table
GROUP BY Name, DOB
HAVING COUNT(*) > 1
But this solution is not what I am looking for here. We already employ this technique.
Thanks!
答案1
得分: 1
以下是翻译好的内容:
在数据库中查找可能只在出生日期(DOB)方面不同的记录的重复条目
以下是使用自连接的一种方法:
select t1.name name1, t1.dob dob1, t2.name name2, t2.dob dob2
from mytable t1
inner join mytable t2
on t2.name = t1.name
and t2.dob > t1.dob
and t2.dob <= date_add(t1.dob, interval 1 year)
这将在同一行上放置具有相同姓名和相差不超过一年的DOB的用户元组。
您还可以使用带有“range”帧的窗口函数,这可能更有效。这将检索所有行,其中存在具有相同姓名的另一行,其DOB在过去或未来一年内(实际上是365天内):
select *
from (
select t.*,
count(*) over(
partition by name
order by unix_date(dob)
range between 365 preceding and 365 following
) cnt
from mytable t
) t
where cnt > 1
英文:
> find duplicate entries in the database for records that may differ only in Date of Birth (DOB)
Here is one way to do it with a self-join:
select t1.name name1, t1.dob dob1, t2.name name2, t2.dob dbo2
from mytable t1
inner join mytable t2
on t2.name = t1.name
and t2.dob > t1.dob
and t2.dob <= date_add(t1.dob, interval 1 year)
This puts on the same row tuples of users that have the same name and a DOB within one year of each other.
You could also use window functions with a range
frame, which might be more efficient. This brings all rows for which another row exist with the same name within the last or next year (actually, 365 days):
select *
from (
select t.*,
count(*) over(
partition by name
order by unix_date(dob)
range between 365 preceding and 365 following
) cnt
from mytable t
) t
where cnt > 1
答案2
得分: -1
表名 = My_table
选择 full_name,count() 作为 duplicate_dob
从 My_table
分组按 my_name
筛选 count() > 1
英文:
Table name = My_table
Select full_name , count(*) as duplicate_dob
from My_table
group by my_name
having count(*) >1
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论