英文:
How to cluster rows in a postgresql table that match an input value or match a value from any of the other matching rows?
问题
在我的PostgreSQL数据库中,我有一个看起来像这样的表格:
如何返回一个联系人集群,其中每个联系人在集群中与另一个联系人共享contact_id_a或contact_id_b值(或两者都有)?
在上面截图图像的示例中,行1-6将属于同一个集群,而行8将不属于任何集群。
如何使用SQL查询或SQL查询结合Java代码来实现这一目标?
为了背景信息,这个表格列出了联系人列表中所有潜在的重复联系人。我们希望向列表所有者呈现所有潜在的重复联系人,以便用户可以手动管理这些重复项。
以下是我的起始代码:
DuplicateCandidate firstDuplicate = db.sql("select * from duplicates where list_id = "+list_id+ " and ignore_duplicate is not true").first(DuplicateCandidate);
String sql = "select * from duplicates where list_id = "+list_id+ " and ignore_duplicate is not true "
+ "and (contact_id_a = ? or contact_id_b = ? or contact_id_a = ? or contact_id_b = ?";
List<DuplicateCandidate> groupOfDuplicates = db.sql(sql, firstDuplicate.contact_id_a,firstDuplicate.contact_id_a, firstDuplicate.contact_id_b, firstDuplicate.contact_id_b).results(DuplicateCandidate.class);
这将返回第一行以及包含16247096或16247097的任何其他行,但不会返回与第二个查询结果中的contact_ids匹配的其他重要行。
英文:
I have a table that looks like this in my postgresql database
How can I bring back a cluster of contacts where each contact in the cluster shares either the contact_id_a or contact_id_b value (or both) with another contact in the cluster?
In the example in the screenshot image above, rows 1-6 would be in the same cluster and row 8 would belong to no cluster.
How can this be achieved using either a SQL query or a SQL query in combination with Java code?
For context, this table lists all potential duplicate contacts in a list of contacts. We want to present to the list owner all of the contacts that are potential duplicates so that the user can manually manage these duplicates.
Here is my starting code:
DuplicateCandidate firstDuplicate = db.sql("select * from duplicates where list_id = "+list_id+ " and ignore_duplicate is not true").first(DuplicateCandidate);
String sql = "select * from duplicates where list_id = "+list_id+ "and ignore_duplicate is not true "
+ "and (contact_id_a = ? or contact_id_b = ? or contact_id_a = ? or contact_id_b = ?";
List<DuplicateCandidate> groupOfDuplicates = db.sql(sql, firstDuplicate.contact_id_a,firstDuplicate.contact_id_a, firstDuplicate.contact_id_b, firstDuplicate.contact_id_b).results(DuplicateCandidate.class);
This will bring back the first row and any other rows containing 16247096 or 16247097, but not other essential rows matching the contact_ids from the second query's results.
Cheers.
答案1
得分: 2
你可以使用递归公共表达式(CTE)来实现。这会遍历图形,然后为每一行分配图中的最小标识符。请注意,你的数据没有为每一行提供唯一标识符,因此首先生成一个:
with recursive d as (
select row_number() over (order by contact_id_a, contact_id_b) as id, d.*
from duplicates d
),
cte (id, contact_id_a, contact_id_b, min_id, ids, lev) as (
select id, contact_id_a, contact_id_b, id as min_id, array[id] as ids, 1 as lev
from d
union all
select d.id, d.contact_id_a, d.contact_id_b, least(d.id, cte.min_id), ids || d.id, lev + 1
from cte join
d
on cte.contact_id_a = d.contact_id_a or cte.contact_id_b = d.contact_id_b
where d.id <> ALL (cte.ids)
)
select distinct on (id) cte.*
from cte
order by id, min_id;
min_id
列包含你想要的分组信息。
这里是一个演示该代码的db<>fiddle链接。
英文:
You can use a recursive CTE. This walks the graph and then assigns the minimum identifier in the graph for each row. Note that your data does not have a unique identifier for each row so this starts by generating one:
with recursive d as (
select row_number() over (order by contact_id_a, contact_id_b) as id, d.*
from duplicates d
),
cte (id, contact_id_a, contact_id_b, min_id, ids, lev) as (
select id, contact_id_a, contact_id_b, id as min_id, array[id] as ids, 1 as lev
from d
union all
select d.id, d.contact_id_a, d.contact_id_b, least(d.id, cte.min_id), ids || d.id, lev + 1
from cte join
d
on cte.contact_id_a = d.contact_id_a or cte.contact_id_b = d.contact_id_b
where d.id <> ALL (cte.ids)
)
select distinct on (id) cte.*
from cte
order by id, min_id;
The column min_id
contains the grouping you want.
Here is a db<>fiddle illustrating the code.
答案2
得分: 0
像这样的聚类是一个迭代过程,步骤数量未知。我从未找到一个可以在递归查询中完成的解决方案。
我已经有六年没有在CRM上工作了,但以下函数类似于我们以前用来生成匹配组的方式。逐行执行这个操作对我们的工作负载来说性能不够好,而通过宿主语言(例如Java HashMap()
和 HashSet()
以及倒排索引)来实现会创建非常混乱的代码。
假设这个模式:
\d contact_info
表 "public.contact_info"
列 | 类型 | 排序规则 | 可空 | 默认值
------------------+---------+-----------+----------+---------
contact_id_a | bigint | | |
contact_id_b | bigint | | |
ignore_duplicate | boolean | | | false
list_id | integer | | | 496
select * from contact_info ;
contact_id_a | contact_id_b | ignore_duplicate | list_id
--------------+--------------+------------------+---------
16247096 | 16247097 | f | 496
16247096 | 16247098 | f | 496
16247096 | 16247099 | f | 496
16247097 | 16247098 | f | 496
16247097 | 16247099 | f | 496
16247098 | 16247099 | f | 496
16247094 | 16247095 | f | 496
(7 rows)
这个函数创建两个临时表来保存中间聚类,然后一旦不再可能进行聚类,就返回结果。
create or replace function cluster_contact()
returns table (clust_id bigint, contact_id bigint)
language plpgsql as $$
declare
last_count bigint := 1;
this_count bigint := 0;
begin
create temp table contact_match (clust_id bigint, contact_id bigint) on commit drop;
create index cm_1 on contact_match (contact_id, clust_id);
create index cm_2 on contact_match using hash (clust_id);
create temp table contact_hold (clust_id bigint, contact_id bigint) on commit drop;
with dedup as (
select distinct least(ci.contact_id_a) as clust_id,
greatest(ci.contact_id_b) as contact_id
from contact_info ci
where not ci.ignore_duplicate
)
insert into contact_match
select d.clust_id, d.clust_id from dedup d
union
select d.clust_id, d.contact_id from dedup d;
while last_count > this_count loop
if this_count = 0 then
select count(distinct cm.clust_id) into last_count from contact_match cm;
else
last_count := this_count;
end if;
with new_cid as (
select cm.contact_id as clust_id_old,
min(cm.clust_id) as clust_id_new
from contact_match cm
group by cm.contact_id
)
update contact_match
set clust_id = nc.clust_id_new
from new_cid nc
where contact_match.clust_id = nc.clust_id_old;
truncate table contact_hold;
insert into contact_hold
select distinct * from contact_match;
truncate table contact_match;
insert into contact_match
select * from contact_hold;
select count(distinct cm.clust_id) into this_count from contact_match cm;
end loop;
return query select * from contact_match order by clust_id, contact_id;
end $$;
我看到开发人员面临的最大心理障碍之一是忽略了contact_id
与自身的关系。这会导致不连贯的处理和一个不必要地复杂的左侧和右侧的心理模型。
select * from cluster_contact();
clust_id | contact_id
----------+------------
16247094 | 16247094
16247094 | 16247095
16247096 | 16247096
16247096 | 16247097
16247096 | 16247098
16247096 | 16247099
(6 rows)
如果您需要对这个解决方案的任何步骤进行澄清,或者如果它不适用于您,请留下评论。
另外,知道Levenshtein可以在fuzzystrmatch
中使用,而且效果很好。
如果您希望clust_id
从1
开始连续,只需将函数中的return query
更改为以下内容:
return query
select dense_rank() over (order by cm.clust_id) as clust_id,
cm.contact_id
from contact_match cm
order by clust_id, contact_id;
它将产生以下结果:
select * from cluster_contact();
clust_id | contact_id
----------+------------
1 | 16247094
1 | 16247095
2 | 16247096
2 | 16247097
2 | 16247098
2 | 16247099
(6 rows)
英文:
Clustering like this is an iterative process with an unknown number of steps. I have never found a solution that can be done within a recursive query.
I have not worked on CRM in over six years, but the following function is similar to how we used to generate match groups. Doing this row-by-row did not perform well enough for our workload, and accomplishing this via host language using e.g. Java HashMap()
and HashSet()
and inverted indexing creates very messy code.
Assuming this schema:
\d contact_info
Table "public.contact_info"
Column | Type | Collation | Nullable | Default
------------------+---------+-----------+----------+---------
contact_id_a | bigint | | |
contact_id_b | bigint | | |
ignore_duplicate | boolean | | | false
list_id | integer | | | 496
select * from contact_info ;
contact_id_a | contact_id_b | ignore_duplicate | list_id
--------------+--------------+------------------+---------
16247096 | 16247097 | f | 496
16247096 | 16247098 | f | 496
16247096 | 16247099 | f | 496
16247097 | 16247098 | f | 496
16247097 | 16247099 | f | 496
16247098 | 16247099 | f | 496
16247094 | 16247095 | f | 496
(7 rows)
This function creates two temp tables to hold intermediate clusters and then returns the result once there is no more clustering possible.
create or replace function cluster_contact()
returns table (clust_id bigint, contact_id bigint)
language plpgsql as $$
declare
last_count bigint := 1;
this_count bigint := 0;
begin
create temp table contact_match (clust_id bigint, contact_id bigint) on commit drop;
create index cm_1 on contact_match (contact_id, clust_id);
create index cm_2 on contact_match using hash (clust_id);
create temp table contact_hold (clust_id bigint, contact_id bigint) on commit drop;
with dedup as (
select distinct least(ci.contact_id_a) as clust_id,
greatest(ci.contact_id_b) as contact_id
from contact_info ci
where not ci.ignore_duplicate
)
insert into contact_match
select d.clust_id, d.clust_id from dedup d
union
select d.clust_id, d.contact_id from dedup d;
while last_count > this_count loop
if this_count = 0 then
select count(distinct cm.clust_id) into last_count from contact_match cm;
else
last_count := this_count;
end if;
with new_cid as (
select cm.contact_id as clust_id_old,
min(cm.clust_id) as clust_id_new
from contact_match cm
group by cm.contact_id
)
update contact_match
set clust_id = nc.clust_id_new
from new_cid nc
where contact_match.clust_id = nc.clust_id_old;
truncate table contact_hold;
insert into contact_hold
select distinct * from contact_match;
truncate table contact_match;
insert into contact_match
select * from contact_hold;
select count(distinct cm.clust_id) into this_count from contact_match cm;
end loop;
return query select * from contact_match order by clust_id, contact_id;
end $$;
One of the biggest mental blocks I have seen developers face is neglecting to include the relationship of a contact_id
to itself. This leads to disjoint handling and a mental model needlessly complicated by a left-side and a right-side.
select * from cluster_contact();
clust_id | contact_id
----------+------------
16247094 | 16247094
16247094 | 16247095
16247096 | 16247096
16247096 | 16247097
16247096 | 16247098
16247096 | 16247099
(6 rows)
Please comment if you need clarification on any of the steps in this solution or if it does not work for you.
Also, know that Levenshtein is available in fuzzystrmatch
, and it works well.
If you would rather have sequential clust_id
starting at 1
, change your return query
in the function to this:
return query
select dense_rank() over (order by cm.clust_id) as clust_id,
cm.contact_id
from contact_match cm
order by clust_id, contact_id;
It will yield:
select * from cluster_contact();
clust_id | contact_id
----------+------------
1 | 16247094
1 | 16247095
2 | 16247096
2 | 16247097
2 | 16247098
2 | 16247099
(6 rows)
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论