算法化的方法来合并同一联系人的不同联系电话和电子邮件

huangapple go评论74阅读模式
英文:

Algorithmic way to combine different contact number and emails for same contact

问题

我有以下的tibble,

contact <- tribble(
  ~name, ~phone, ~email,
  'John', 123, 'john_abc@gmail.com',
  'John', 456, 'john_abc@gmail.com',
  'John', 456, 'john_xyz@gmail.com',
  'John', 789, 'john_pqr@gmail.com'
)

我想要如果电话号码或电子邮件相同,将电话号码和电子邮件合并在一起,期望的输出如下,

contact_combined <- tribble(
  ~name, ~phone, ~email,
  'John', '123;456', 'john_abc@gmail.com;john_xyz@gmail.com',
  'John', '789', 'john_pqr@gmail.com'
)

我尝试首先按姓名和电话分组,然后按姓名和电子邮件分组,但这没有给我期望的结果。我陷在寻找解决这个问题的算法方法上,有人可以给我建议吗?

注意:这里不涉及列值的合并问题,问题是选择要合并的记录。

英文:

I have the following tibble,

contact <- tribble(
  ~name, ~phone, ~email,
  'John', 123, 'john_abc@gmail.com',
  'John', 456, 'john_abc@gmail.com',
  'John', 456, 'john_xyz@gmail.com',
  'John', 789, 'john_pqr@gmail.com'
)

I'd like to combine the phone numbers and emails if phone or email are the same, the desired output is the following,

contact_combined <- tribble(
  ~name, ~phone, ~email,
  'John', '123;456', 'john_abc@gmail.com;john_xyz@gmail.com',
  'John', '789', 'john_pqr@gmail.com'
)

I've tried grouping it first by name and phone and then by name and emails but it's not giving me the expected results. I'm stuck on finding an algorithmic way to solve this problem, could someone please give me an advice?

Note: The collapsing of the values in a column is not the question here. It's about selecting the records for the collapsing.

答案1

得分: 10

图表可以帮助解决这个问题。

library(igraph)

# 创建一个矩阵,告诉我们向量元素对是否相等
equal_mat <- function(x) {
  
  outer(x, x, '==')
}

m.adj <- equal_mat(contact$phone) | equal_mat(contact$email)
g <- graph_from_adjacency_matrix(m.adj, mode='undir')

t(sapply(split(contact, components(g)$membership), function(group)
  sapply(group, function(column)
    paste(sort(unique(column)), collapse=';')))) %>%
  as_tibble()

# # A tibble: 2 × 3
#   name  phone   email                                 
#   <chr> <chr>   <chr>                                 
# 1 John  123;456 john_abc@gmail.com;john_xyz@gmail.com
# 2 John  789     john_pqr@gmail.com                   

你可以将原始联系方式视为一个图,即contact中的每一行都对应一个顶点,如果两个联系方式有相同的电话号码或电子邮件,则它们之间有边相连。在你的情况下,这个图看起来像这样,plot(g):
算法化的方法来合并同一联系人的不同联系电话和电子邮件

联系方式1-3形成一个连接组件,而没有连接的联系方式4是另一个组件。最终输出中应将每个这样的组件合并为一个联系方式。

我们从一个邻接矩阵m.adj创建图,该矩阵告诉我们哪些顶点(节点)相连,图的组件使用以下代码进行识别:

components(g)$membership
[1] 1 1 1 2

这告诉我们,正如上面所见,联系方式1-3形成组件1,联系方式4是组件2。现在我们可以在每个组件内部合并值。

英文:

Graphs can help with this.

library(igraph)

# creates a matrix which tells whether pairs of vector elements are equal or not
equal_mat &lt;- function(x) {
  
  outer(x, x, &#39;==&#39;)
}

m.adj &lt;- equal_mat(contact$phone) | equal_mat(contact$email)
g &lt;- graph_from_adjacency_matrix(m.adj, mode=&#39;undir&#39;)

t(sapply(split(contact, components(g)$membership), function(group)
  sapply(group, function(column)
    paste(sort(unique(column)), collapse=&#39;;&#39;)))) %&gt;%
  as_tibble()

# # A tibble: 2 &#215; 3
#   name  phone   email                                
#   &lt;chr&gt; &lt;chr&gt;   &lt;chr&gt;                                
# 1 John  123;456 john_abc@gmail.com;john_xyz@gmail.com
# 2 John  789     john_pqr@gmail.com                   

You can think of your original contacts as a graph, i.e. a set of vertices, one for each row in contact, which are connected by edges if two contacts have the same phone number or email. In your case the graph looks like this, plot(g):
算法化的方法来合并同一联系人的不同联系电话和电子邮件

Contacts 1–3 form one connected component, while the contact number 4 which has no connections is another component. Each such component should be merged into one contact in the final output.

We create the graph from an adjacency matrix m.adj that tells which vertices (nodes) are connected and the graph components are identified using

components(g)$membership
[1] 1 1 1 2

which tells us exactly what we saw above: contacts 1–3 form component one, contact number 4 is component 2. Now we can just collapse the values within each components.

答案2

得分: 7

我认为使用igraph库会是一个不错的开始(通过它你可以使用decompose来对连接的子群进行聚类)

contact %>%
  select(c(2, 3, 1)) %>%
  graph_from_data_frame() %>%
  decompose() %>%
  lapply(function(x) {
    aggregate(
      . ~ name, get.data.frame(x),
      function(v) toString(unique(v))
    )
  }) %>%
  bind_rows() %>%
  setNames(names(contact))

这将得到以下结果:

  name    phone                                  email
1 John 123, 456 john_abc@gmail.com, john_xyz@gmail.com
2 John      789                     john_pqr@gmail.com

更加符合"tidyverse"风格的方法(感谢@akrun的评论)

contact %>%
  relocate(name, .after = last_col()) %>%
  graph_from_data_frame() %>%
  decompose() %>%
  map(~ .x %>%
    get.data.frame() %>%
    reframe(across(everything(), ~ str_c(unique(.x), collapse = ";")), .by = "name")) %>%
  list_rbind() %>%
  setNames(names(contact))
英文:

I guess igraph would be a good start (by which you can use decompose to cluster connected subgroups)

contact %&gt;%
  select(c(2, 3, 1)) %&gt;%
  graph_from_data_frame() %&gt;%
  decompose() %&gt;%
  lapply(function(x) {
    aggregate(
      . ~ name, get.data.frame(x),
      function(v) toString(unique(v))
    )
  }) %&gt;%
  bind_rows() %&gt;%
  setNames(names(contact))

which gives

  name    phone                                  email
1 John 123, 456 john_abc@gmail.com, john_xyz@gmail.com
2 John      789                     john_pqr@gmail.com

A more tidyverse way (thank @akrun's comment)

contact %&gt;%
  relocate(name, .after = last_col()) %&gt;%
  graph_from_data_frame() %&gt;%
  decompose() %&gt;%
  map(~ .x %&gt;%
    get.data.frame() %&gt;%
    reframe(across(everything(), ~ str_c(unique(.x), collapse = &quot;;&quot;)), .by = &quot;name&quot;)) %&gt;%
  list_rbind() %&gt;%
  setNames(names(contact))

答案3

得分: 4

这是一个使用data.table的方法。

setDT(contact)
# 设置键
setkey(contact, name, phone, email)
# 对每个唯一键进行自连接,在操作过程中进行筛选和汇总
ans <- contact[contact, c("phone2", "email2") := {
  temp <- contact[name == i.name & 
                  (phone %in% contact[name == i.name & email == i.email, ]$phone | 
                   email %in% contact[name == i.name & phone == i.phone, ]$email), ]
  email_temp <- paste0(unique(temp$email), collapse = ";")
  phone_temp <- paste0(unique(temp$phone), collapse = ";")
  list(phone_temp, email_temp)
}, by = .EACHI]
# 最后一步
unique(ans, by = c("name", "phone2", "email2"))[, .(name, phone = phone2, email = email2)]

解释

对于第一行,变量'temp'的计算如下:

contact[name == 'John' &
        (phone %in% contact[name == 'John' & email == 'john_abc@gmail.com', ]$phone | 
         email %in% contact[name == 'John' & phone == 123, ]$email), ]

然后,使用以下方法将唯一的电子邮件放在一个字符串中:

email_temp <- paste0(unique(temp$email), collapse = ";")

并使用相同的方法将电话号码放在一个字符串中:

phone_temp <- paste0(unique(temp$phone), collapse = ";")

然后将这两个字符串返回到"phone2"和"email2"列。

为每个唯一的键组合重复此操作(.EACHI)。

英文:

here is s data.table approach

setDT(contact)
# set keys
setkey(contact, name, phone, email)
# self join on each unique key, filter and summarise on the fly 
ans &lt;- contact[contact, c(&quot;phone2&quot;, &quot;email2&quot;) := {
  temp &lt;- contact[ name == i.name &amp; 
                     (phone %in% contact[name == i.name &amp; email == i.email, ]$phone | 
                        email %in% contact[name == i.name &amp; phone == i.phone, ]$email), ]
  email_temp &lt;- paste0(unique(temp$email), collapse = &quot;;&quot;)
  phone_temp &lt;- paste0(unique(temp$phone), collapse = &quot;;&quot;)
  list(phone_temp, email_temp)
}, by = .EACHI]
# final step
unique(ans, by = c(&quot;name&quot;, &quot;phone2&quot;, &quot;email2&quot;))[, .(name, phone = phone2, email = email2)]
#    name   phone                                 email
# 1: John 123;456 john_abc@gmail.com;john_xyz@gmail.com
# 2: John     789                    john_pqr@gmail.com

explanation

# so, for the first row, the variable &#39;temp&#39; is calculated as follows
contact[ name == &#39;John&#39; &amp;
          (phone %in% contact[name == &#39;John&#39; &amp; email == &#39;john_abc@gmail.com&#39;, ]$phone | 
           email %in% contact[name == &#39;John&#39; &amp; phone == 123, ]$email), ]
#    name phone              email
# 1: John   123 john_abc@gmail.com
# 2: John   456 john_abc@gmail.com
# 3: John   456 john_xyz@gmail.com

# then, put the unique emails together in a string using
#     email_temp &lt;- paste0(unique(temp$email), collapse = &quot;;&quot;)
# and do the same for the phones using 
#     phone_temp &lt;- paste0(unique(temp$phone), collapse = &quot;;&quot;)

# and return there two strings to the columns &quot;phone2&quot; ans &quot;email2&quot;

#repeat for each unique key-combination (.EACHI)

答案4

得分: 4

以下是代码的翻译部分:

使用 `powerjoin` 包的不同方法:

    contact <- tribble(
      ~name, ~phone, ~email,
      "John", 123, "john_abc@gmail.com",
      "John", 456, "john_abc@gmail.com",
      "John", 456, "john_xyz@gmail.com",
      "John", 789, "john_pqr@gmail.com") |> 
      mutate(row_id = row_number())


    library(powerjoin)
    library(dplyr)
    # 检查电话列中的重复条目
    phone_check <- contact |> 
      power_right_join(filter(contact, duplicated(phone)),
                       by = c("name", "phone"),
                       conflict = ~ paste(.x, .y, sep = ";")
      ) |> 
      group_by(phone) |> 
      slice(1) |> 
      tidyr::separate_rows(row_id) |> 
      ungroup() |> 
      select(name, email, row_id)
    
    
    # 检查电子邮件列中的重复条目
    email_check <- contact |> 
      power_right_join(filter(contact, duplicated(email)),
                       by = c("name", "email"),
                       conflict = ~ paste(.x, .y, sep = ";") 
      ) |> 
      group_by(email) |> 
      slice(1) |> 
      tidyr::separate_rows(row_id) |> 
      ungroup() |> 
      select(name, phone, row_id)
    
    
    email_check |> select(name, phone, row_id) |> 
      inner_join(phone_check, by = c("name", "row_id")) |> 
      bind_rows(
        contact |> 
          mutate(phone = as.character(phone), 
                 row_id = as.character(row_id)) |> 
          filter(!row_id %in% c(phone_check$row_id, email_check$row_id))
      ) |> 
      select(-row_id)


    # 一个 tibble:2 × 3
      name  phone   email                                
      <chr> <chr>   <chr>                                
    1 John  123;456 john_abc@gmail.com;john_xyz@gmail.com
    2 John  789     john_pqr@gmail.com 
英文:

A different approach using the powerjoin package:

contact &lt;- tribble(
~name, ~phone, ~email,
&quot;John&quot;, 123, &quot;john_abc@gmail.com&quot;,
&quot;John&quot;, 456, &quot;john_abc@gmail.com&quot;,
&quot;John&quot;, 456, &quot;john_xyz@gmail.com&quot;,
&quot;John&quot;, 789, &quot;john_pqr@gmail.com&quot;) |&gt; 
mutate(row_id = row_number())
library(powerjoin)
library(dplyr)
# check duplicated entries in phone column
phone_check &lt;- contact |&gt;
power_right_join(filter(contact, duplicated(phone)),
by = c(&quot;name&quot;, &quot;phone&quot;),
conflict = ~ paste(.x, .y, sep = &quot;;&quot;)
) |&gt;
group_by(phone) |&gt;
slice(1) |&gt;
tidyr::separate_rows(row_id) |&gt; 
ungroup() |&gt; 
select(name, email, row_id)
# check duplicated entries in email column
email_check &lt;- contact |&gt;
power_right_join(filter(contact, duplicated(email)),
by = c(&quot;name&quot;, &quot;email&quot;),
conflict = ~ paste(.x, .y, sep = &quot;;&quot;) 
) |&gt;
group_by(email) |&gt;
slice(1) |&gt;
tidyr::separate_rows(row_id) |&gt; 
ungroup() |&gt; 
select(name, phone, row_id)
email_check |&gt; select(name, phone, row_id) |&gt; 
inner_join(phone_check, by = c(&quot;name&quot;, &quot;row_id&quot;)) |&gt; 
bind_rows(
contact |&gt; 
mutate(phone = as.character(phone), 
row_id = as.character(row_id)) |&gt; 
filter(!row_id %in% c(phone_check$row_id, email_check$row_id))
) |&gt; 
select(-row_id)
# A tibble: 2 &#215; 3
name  phone   email                                
&lt;chr&gt; &lt;chr&gt;   &lt;chr&gt;                                
1 John  123;456 john_abc@gmail.com;john_xyz@gmail.com
2 John  789     john_pqr@gmail.com                   

huangapple
  • 本文由 发表于 2023年2月18日 01:31:20
  • 转载请务必保留本文链接:https://go.coder-hub.com/75487491.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定