2020年8月4日 10:38:40go评论123阅读模式

英文:

How to cluster rows in a postgresql table that match an input value or match a value from any of the other matching rows?

问题

在我的PostgreSQL数据库中，我有一个看起来像这样的表格：

如何返回一个联系人集群，其中每个联系人在集群中与另一个联系人共享contact_id_a或contact_id_b值（或两者都有）？

在上面截图图像的示例中，行1-6将属于同一个集群，而行8将不属于任何集群。

如何使用SQL查询或SQL查询结合Java代码来实现这一目标？

为了背景信息，这个表格列出了联系人列表中所有潜在的重复联系人。我们希望向列表所有者呈现所有潜在的重复联系人，以便用户可以手动管理这些重复项。

以下是我的起始代码：

DuplicateCandidate firstDuplicate = db.sql("select * from duplicates where list_id = "+list_id+ " and ignore_duplicate is not true").first(DuplicateCandidate);
String sql = "select * from duplicates where list_id = "+list_id+ " and ignore_duplicate is not true "
        + "and (contact_id_a = ? or contact_id_b = ? or contact_id_a = ? or contact_id_b = ?";
List<DuplicateCandidate> groupOfDuplicates  = db.sql(sql, firstDuplicate.contact_id_a,firstDuplicate.contact_id_a, firstDuplicate.contact_id_b, firstDuplicate.contact_id_b).results(DuplicateCandidate.class);

这将返回第一行以及包含16247096或16247097的任何其他行，但不会返回与第二个查询结果中的contact_ids匹配的其他重要行。

英文:

I have a table that looks like this in my postgresql database

How can I bring back a cluster of contacts where each contact in the cluster shares either the contact_id_a or contact_id_b value (or both) with another contact in the cluster?

In the example in the screenshot image above, rows 1-6 would be in the same cluster and row 8 would belong to no cluster.

How can this be achieved using either a SQL query or a SQL query in combination with Java code?

For context, this table lists all potential duplicate contacts in a list of contacts. We want to present to the list owner all of the contacts that are potential duplicates so that the user can manually manage these duplicates.

Here is my starting code:

DuplicateCandidate firstDuplicate = db.sql(&quot;select * from duplicates where list_id = &quot;+list_id+ &quot; and ignore_duplicate is not true&quot;).first(DuplicateCandidate);
		String sql = &quot;select * from duplicates where list_id = &quot;+list_id+ &quot;and ignore_duplicate is not true &quot;
				+ &quot;and (contact_id_a = ? or contact_id_b = ? or contact_id_a = ? or contact_id_b = ?&quot;;
		List&lt;DuplicateCandidate&gt; groupOfDuplicates  = db.sql(sql, firstDuplicate.contact_id_a,firstDuplicate.contact_id_a, firstDuplicate.contact_id_b, firstDuplicate.contact_id_b).results(DuplicateCandidate.class);

This will bring back the first row and any other rows containing 16247096 or 16247097, but not other essential rows matching the contact_ids from the second query's results.

Cheers.

答案1

得分: 2

你可以使用递归公共表达式（CTE）来实现。这会遍历图形，然后为每一行分配图中的最小标识符。请注意，你的数据没有为每一行提供唯一标识符，因此首先生成一个：

with recursive d as (
      select row_number() over (order by contact_id_a, contact_id_b) as id, d.*
      from duplicates d
     ),
     cte (id, contact_id_a, contact_id_b, min_id, ids, lev) as (
      select id, contact_id_a, contact_id_b, id as min_id, array[id] as ids, 1 as lev
      from d
      union all
      select d.id, d.contact_id_a, d.contact_id_b, least(d.id, cte.min_id), ids || d.id, lev + 1
      from cte join
           d
           on cte.contact_id_a = d.contact_id_a or cte.contact_id_b = d.contact_id_b
      where d.id <> ALL (cte.ids)
     )
select distinct on (id) cte.*
from cte
order by id, min_id;

min_id列包含你想要的分组信息。

这里是一个演示该代码的db<>fiddle链接。

英文:

You can use a recursive CTE. This walks the graph and then assigns the minimum identifier in the graph for each row. Note that your data does not have a unique identifier for each row so this starts by generating one:

with recursive d as (
      select row_number() over (order by contact_id_a, contact_id_b) as id, d.*
      from duplicates d
     ),
     cte (id, contact_id_a, contact_id_b, min_id, ids, lev) as (
      select id, contact_id_a, contact_id_b, id as min_id, array[id] as ids, 1 as lev
      from d
      union all
      select d.id, d.contact_id_a, d.contact_id_b, least(d.id, cte.min_id), ids || d.id, lev + 1
      from cte join
           d
           on cte.contact_id_a = d.contact_id_a or cte.contact_id_b = d.contact_id_b
      where d.id &lt;&gt; ALL (cte.ids)
     )
select distinct on (id) cte.*
from cte
order by id, min_id;

The column min_id contains the grouping you want.

Here is a db<>fiddle illustrating the code.

答案2

得分: 0

像这样的聚类是一个迭代过程，步骤数量未知。我从未找到一个可以在递归查询中完成的解决方案。

我已经有六年没有在CRM上工作了，但以下函数类似于我们以前用来生成匹配组的方式。逐行执行这个操作对我们的工作负载来说性能不够好，而通过宿主语言（例如Java HashMap() 和 HashSet() 以及倒排索引）来实现会创建非常混乱的代码。

假设这个模式：

\d contact_info 
                 表 "public.contact_info"
      列         |  类型   | 排序规则  | 可空 | 默认值 
------------------+---------+-----------+----------+---------
 contact_id_a     | bigint  |           |          | 
 contact_id_b     | bigint  |           |          | 
 ignore_duplicate | boolean |           |          | false
 list_id          | integer |           |          | 496
select * from contact_info ;
 contact_id_a | contact_id_b | ignore_duplicate | list_id 
--------------+--------------+------------------+---------
     16247096 |     16247097 | f                |     496
     16247096 |     16247098 | f                |     496
     16247096 |     16247099 | f                |     496
     16247097 |     16247098 | f                |     496
     16247097 |     16247099 | f                |     496
     16247098 |     16247099 | f                |     496
     16247094 |     16247095 | f                |     496
(7 rows)

这个函数创建两个临时表来保存中间聚类，然后一旦不再可能进行聚类，就返回结果。

create or replace function cluster_contact() 
  returns table (clust_id bigint, contact_id bigint) 
  language plpgsql as $$
declare 
  last_count bigint := 1;
  this_count bigint := 0;
begin
  create temp table contact_match (clust_id bigint, contact_id bigint) on commit drop;
  create index cm_1 on contact_match (contact_id, clust_id);
  create index cm_2 on contact_match using hash (clust_id);
  create temp table contact_hold (clust_id bigint, contact_id bigint) on commit drop;
  with dedup as (
    select distinct least(ci.contact_id_a) as clust_id,
           greatest(ci.contact_id_b) as contact_id
      from contact_info ci
     where not ci.ignore_duplicate
  )
  insert into contact_match
    select d.clust_id, d.clust_id from dedup d
    union
    select d.clust_id, d.contact_id from dedup d;
  while last_count > this_count loop
    if this_count = 0 then 
      select count(distinct cm.clust_id) into last_count from contact_match cm;
    else 
      last_count := this_count;
    end if;
    with new_cid as (
      select cm.contact_id as clust_id_old,
             min(cm.clust_id) as clust_id_new
        from contact_match cm
       group by cm.contact_id
    )
    update contact_match
       set clust_id = nc.clust_id_new
      from new_cid nc
     where contact_match.clust_id = nc.clust_id_old;
    truncate table contact_hold;
    insert into contact_hold 
      select distinct * from contact_match;
 
    truncate table contact_match;
    insert into contact_match
      select * from contact_hold;
    select count(distinct cm.clust_id) into this_count from contact_match cm;
  end loop;
  return query select * from contact_match order by clust_id, contact_id;
end $$;

我看到开发人员面临的最大心理障碍之一是忽略了contact_id与自身的关系。这会导致不连贯的处理和一个不必要地复杂的左侧和右侧的心理模型。

select * from cluster_contact();
 clust_id | contact_id 
----------+------------
 16247094 |   16247094
 16247094 |   16247095
 16247096 |   16247096
 16247096 |   16247097
 16247096 |   16247098
 16247096 |   16247099
(6 rows)

如果您需要对这个解决方案的任何步骤进行澄清，或者如果它不适用于您，请留下评论。

另外，知道Levenshtein可以在fuzzystrmatch中使用，而且效果很好。

如果您希望clust_id从1开始连续，只需将函数中的return query更改为以下内容：

  return query 
    select dense_rank() over (order by cm.clust_id) as clust_id, 
           cm.contact_id 
      from contact_match cm 
     order by clust_id, contact_id;

它将产生以下结果：

select * from cluster_contact();
 clust_id | contact_id 
----------+------------
        1 |   16247094
        1 |   16247095
        2 |   16247096
        2 |   16247097
        2 |   16247098
        2 |   16247099
(6 rows)

英文:

Clustering like this is an iterative process with an unknown number of steps. I have never found a solution that can be done within a recursive query.

I have not worked on CRM in over six years, but the following function is similar to how we used to generate match groups. Doing this row-by-row did not perform well enough for our workload, and accomplishing this via host language using e.g. Java HashMap() and HashSet() and inverted indexing creates very messy code.

Assuming this schema:

\d contact_info 
                 Table &quot;public.contact_info&quot;
      Column      |  Type   | Collation | Nullable | Default 
------------------+---------+-----------+----------+---------
 contact_id_a     | bigint  |           |          | 
 contact_id_b     | bigint  |           |          | 
 ignore_duplicate | boolean |           |          | false
 list_id          | integer |           |          | 496
select * from contact_info ;
 contact_id_a | contact_id_b | ignore_duplicate | list_id 
--------------+--------------+------------------+---------
     16247096 |     16247097 | f                |     496
     16247096 |     16247098 | f                |     496
     16247096 |     16247099 | f                |     496
     16247097 |     16247098 | f                |     496
     16247097 |     16247099 | f                |     496
     16247098 |     16247099 | f                |     496
     16247094 |     16247095 | f                |     496
(7 rows)

This function creates two temp tables to hold intermediate clusters and then returns the result once there is no more clustering possible.

create or replace function cluster_contact() 
  returns table (clust_id bigint, contact_id bigint) 
  language plpgsql as $$
declare 
  last_count bigint := 1;
  this_count bigint := 0;
begin
  create temp table contact_match (clust_id bigint, contact_id bigint) on commit drop;
  create index cm_1 on contact_match (contact_id, clust_id);
  create index cm_2 on contact_match using hash (clust_id);
  create temp table contact_hold (clust_id bigint, contact_id bigint) on commit drop;
  with dedup as (
    select distinct least(ci.contact_id_a) as clust_id,
           greatest(ci.contact_id_b) as contact_id
      from contact_info ci
     where not ci.ignore_duplicate
  )
  insert into contact_match
    select d.clust_id, d.clust_id from dedup d
    union
    select d.clust_id, d.contact_id from dedup d;
  while last_count &gt; this_count loop
    if this_count = 0 then 
      select count(distinct cm.clust_id) into last_count from contact_match cm;
    else 
      last_count := this_count;
    end if;
    with new_cid as (
      select cm.contact_id as clust_id_old,
             min(cm.clust_id) as clust_id_new
        from contact_match cm
       group by cm.contact_id
    )
    update contact_match
       set clust_id = nc.clust_id_new
      from new_cid nc
     where contact_match.clust_id = nc.clust_id_old;
    truncate table contact_hold;
    insert into contact_hold 
      select distinct * from contact_match;
 
    truncate table contact_match;
    insert into contact_match
      select * from contact_hold;
    select count(distinct cm.clust_id) into this_count from contact_match cm;
  end loop;
  return query select * from contact_match order by clust_id, contact_id;
end $$;

One of the biggest mental blocks I have seen developers face is neglecting to include the relationship of a contact_id to itself. This leads to disjoint handling and a mental model needlessly complicated by a left-side and a right-side.

select * from cluster_contact();
 clust_id | contact_id 
----------+------------
 16247094 |   16247094
 16247094 |   16247095
 16247096 |   16247096
 16247096 |   16247097
 16247096 |   16247098
 16247096 |   16247099
(6 rows)

Please comment if you need clarification on any of the steps in this solution or if it does not work for you.

Also, know that Levenshtein is available in fuzzystrmatch, and it works well.

If you would rather have sequential clust_id starting at 1, change your return query in the function to this:

  return query 
    select dense_rank() over (order by cm.clust_id) as clust_id, 
           cm.contact_id 
      from contact_match cm 
     order by clust_id, contact_id;

It will yield:

select * from cluster_contact();
 clust_id | contact_id 
----------+------------
        1 |   16247094
        1 |   16247095
        2 |   16247096
        2 |   16247097
        2 |   16247098
        2 |   16247099
(6 rows)

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

如何在 PostgreSQL 表中聚类匹配输入值或与其他匹配行中的任何值匹配的行？

问题

答案1

答案2

Java反射在运行时实例化泛型

我正在使用 JSP Ajax 发送表单数据到 Servlet，但是 Ajax 响应在几纳秒内消失。

SpringBoot的 @Autowire 为什么这个示例有效？

如何在不使用getter和setter的情况下显示来自另一个类的变量？

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。