英文:
Algorithmic way to combine different contact number and emails for same contact
问题
我有以下的tibble,
contact <- tribble(
~name, ~phone, ~email,
'John', 123, 'john_abc@gmail.com',
'John', 456, 'john_abc@gmail.com',
'John', 456, 'john_xyz@gmail.com',
'John', 789, 'john_pqr@gmail.com'
)
我想要如果电话号码或电子邮件相同,将电话号码和电子邮件合并在一起,期望的输出如下,
contact_combined <- tribble(
~name, ~phone, ~email,
'John', '123;456', 'john_abc@gmail.com;john_xyz@gmail.com',
'John', '789', 'john_pqr@gmail.com'
)
我尝试首先按姓名和电话分组,然后按姓名和电子邮件分组,但这没有给我期望的结果。我陷在寻找解决这个问题的算法方法上,有人可以给我建议吗?
注意:这里不涉及列值的合并问题,问题是选择要合并的记录。
英文:
I have the following tibble,
contact <- tribble(
~name, ~phone, ~email,
'John', 123, 'john_abc@gmail.com',
'John', 456, 'john_abc@gmail.com',
'John', 456, 'john_xyz@gmail.com',
'John', 789, 'john_pqr@gmail.com'
)
I'd like to combine the phone numbers and emails if phone or email are the same, the desired output is the following,
contact_combined <- tribble(
~name, ~phone, ~email,
'John', '123;456', 'john_abc@gmail.com;john_xyz@gmail.com',
'John', '789', 'john_pqr@gmail.com'
)
I've tried grouping it first by name and phone and then by name and emails but it's not giving me the expected results. I'm stuck on finding an algorithmic way to solve this problem, could someone please give me an advice?
Note: The collapsing of the values in a column is not the question here. It's about selecting the records for the collapsing.
答案1
得分: 10
图表可以帮助解决这个问题。
library(igraph)
# 创建一个矩阵,告诉我们向量元素对是否相等
equal_mat <- function(x) {
outer(x, x, '==')
}
m.adj <- equal_mat(contact$phone) | equal_mat(contact$email)
g <- graph_from_adjacency_matrix(m.adj, mode='undir')
t(sapply(split(contact, components(g)$membership), function(group)
sapply(group, function(column)
paste(sort(unique(column)), collapse=';')))) %>%
as_tibble()
# # A tibble: 2 × 3
# name phone email
# <chr> <chr> <chr>
# 1 John 123;456 john_abc@gmail.com;john_xyz@gmail.com
# 2 John 789 john_pqr@gmail.com
你可以将原始联系方式视为一个图,即contact
中的每一行都对应一个顶点,如果两个联系方式有相同的电话号码或电子邮件,则它们之间有边相连。在你的情况下,这个图看起来像这样,plot(g)
:
联系方式1-3形成一个连接组件,而没有连接的联系方式4是另一个组件。最终输出中应将每个这样的组件合并为一个联系方式。
我们从一个邻接矩阵m.adj
创建图,该矩阵告诉我们哪些顶点(节点)相连,图的组件使用以下代码进行识别:
components(g)$membership
[1] 1 1 1 2
这告诉我们,正如上面所见,联系方式1-3形成组件1,联系方式4是组件2。现在我们可以在每个组件内部合并值。
英文:
Graphs can help with this.
library(igraph)
# creates a matrix which tells whether pairs of vector elements are equal or not
equal_mat <- function(x) {
outer(x, x, '==')
}
m.adj <- equal_mat(contact$phone) | equal_mat(contact$email)
g <- graph_from_adjacency_matrix(m.adj, mode='undir')
t(sapply(split(contact, components(g)$membership), function(group)
sapply(group, function(column)
paste(sort(unique(column)), collapse=';')))) %>%
as_tibble()
# # A tibble: 2 × 3
# name phone email
# <chr> <chr> <chr>
# 1 John 123;456 john_abc@gmail.com;john_xyz@gmail.com
# 2 John 789 john_pqr@gmail.com
You can think of your original contacts as a graph, i.e. a set of vertices, one for each row in contact
, which are connected by edges if two contacts have the same phone number or email. In your case the graph looks like this, plot(g)
:
Contacts 1–3 form one connected component, while the contact number 4 which has no connections is another component. Each such component should be merged into one contact in the final output.
We create the graph from an adjacency matrix m.adj
that tells which vertices (nodes) are connected and the graph components are identified using
components(g)$membership
[1] 1 1 1 2
which tells us exactly what we saw above: contacts 1–3 form component one, contact number 4 is component 2. Now we can just collapse the values within each components.
答案2
得分: 7
我认为使用igraph
库会是一个不错的开始(通过它你可以使用decompose
来对连接的子群进行聚类)
contact %>%
select(c(2, 3, 1)) %>%
graph_from_data_frame() %>%
decompose() %>%
lapply(function(x) {
aggregate(
. ~ name, get.data.frame(x),
function(v) toString(unique(v))
)
}) %>%
bind_rows() %>%
setNames(names(contact))
这将得到以下结果:
name phone email
1 John 123, 456 john_abc@gmail.com, john_xyz@gmail.com
2 John 789 john_pqr@gmail.com
更加符合"tidyverse"风格的方法(感谢@akrun的评论)
contact %>%
relocate(name, .after = last_col()) %>%
graph_from_data_frame() %>%
decompose() %>%
map(~ .x %>%
get.data.frame() %>%
reframe(across(everything(), ~ str_c(unique(.x), collapse = ";")), .by = "name")) %>%
list_rbind() %>%
setNames(names(contact))
英文:
I guess igraph
would be a good start (by which you can use decompose
to cluster connected subgroups)
contact %>%
select(c(2, 3, 1)) %>%
graph_from_data_frame() %>%
decompose() %>%
lapply(function(x) {
aggregate(
. ~ name, get.data.frame(x),
function(v) toString(unique(v))
)
}) %>%
bind_rows() %>%
setNames(names(contact))
which gives
name phone email
1 John 123, 456 john_abc@gmail.com, john_xyz@gmail.com
2 John 789 john_pqr@gmail.com
A more tidyverse
way (thank @akrun's comment)
contact %>%
relocate(name, .after = last_col()) %>%
graph_from_data_frame() %>%
decompose() %>%
map(~ .x %>%
get.data.frame() %>%
reframe(across(everything(), ~ str_c(unique(.x), collapse = ";")), .by = "name")) %>%
list_rbind() %>%
setNames(names(contact))
答案3
得分: 4
这是一个使用data.table
的方法。
setDT(contact)
# 设置键
setkey(contact, name, phone, email)
# 对每个唯一键进行自连接,在操作过程中进行筛选和汇总
ans <- contact[contact, c("phone2", "email2") := {
temp <- contact[name == i.name &
(phone %in% contact[name == i.name & email == i.email, ]$phone |
email %in% contact[name == i.name & phone == i.phone, ]$email), ]
email_temp <- paste0(unique(temp$email), collapse = ";")
phone_temp <- paste0(unique(temp$phone), collapse = ";")
list(phone_temp, email_temp)
}, by = .EACHI]
# 最后一步
unique(ans, by = c("name", "phone2", "email2"))[, .(name, phone = phone2, email = email2)]
解释
对于第一行,变量'temp'的计算如下:
contact[name == 'John' &
(phone %in% contact[name == 'John' & email == 'john_abc@gmail.com', ]$phone |
email %in% contact[name == 'John' & phone == 123, ]$email), ]
然后,使用以下方法将唯一的电子邮件放在一个字符串中:
email_temp <- paste0(unique(temp$email), collapse = ";")
并使用相同的方法将电话号码放在一个字符串中:
phone_temp <- paste0(unique(temp$phone), collapse = ";")
然后将这两个字符串返回到"phone2"和"email2"列。
为每个唯一的键组合重复此操作(.EACHI)。
英文:
here is s data.table
approach
setDT(contact)
# set keys
setkey(contact, name, phone, email)
# self join on each unique key, filter and summarise on the fly
ans <- contact[contact, c("phone2", "email2") := {
temp <- contact[ name == i.name &
(phone %in% contact[name == i.name & email == i.email, ]$phone |
email %in% contact[name == i.name & phone == i.phone, ]$email), ]
email_temp <- paste0(unique(temp$email), collapse = ";")
phone_temp <- paste0(unique(temp$phone), collapse = ";")
list(phone_temp, email_temp)
}, by = .EACHI]
# final step
unique(ans, by = c("name", "phone2", "email2"))[, .(name, phone = phone2, email = email2)]
# name phone email
# 1: John 123;456 john_abc@gmail.com;john_xyz@gmail.com
# 2: John 789 john_pqr@gmail.com
explanation
# so, for the first row, the variable 'temp' is calculated as follows
contact[ name == 'John' &
(phone %in% contact[name == 'John' & email == 'john_abc@gmail.com', ]$phone |
email %in% contact[name == 'John' & phone == 123, ]$email), ]
# name phone email
# 1: John 123 john_abc@gmail.com
# 2: John 456 john_abc@gmail.com
# 3: John 456 john_xyz@gmail.com
# then, put the unique emails together in a string using
# email_temp <- paste0(unique(temp$email), collapse = ";")
# and do the same for the phones using
# phone_temp <- paste0(unique(temp$phone), collapse = ";")
# and return there two strings to the columns "phone2" ans "email2"
#repeat for each unique key-combination (.EACHI)
答案4
得分: 4
以下是代码的翻译部分:
使用 `powerjoin` 包的不同方法:
contact <- tribble(
~name, ~phone, ~email,
"John", 123, "john_abc@gmail.com",
"John", 456, "john_abc@gmail.com",
"John", 456, "john_xyz@gmail.com",
"John", 789, "john_pqr@gmail.com") |>
mutate(row_id = row_number())
library(powerjoin)
library(dplyr)
# 检查电话列中的重复条目
phone_check <- contact |>
power_right_join(filter(contact, duplicated(phone)),
by = c("name", "phone"),
conflict = ~ paste(.x, .y, sep = ";")
) |>
group_by(phone) |>
slice(1) |>
tidyr::separate_rows(row_id) |>
ungroup() |>
select(name, email, row_id)
# 检查电子邮件列中的重复条目
email_check <- contact |>
power_right_join(filter(contact, duplicated(email)),
by = c("name", "email"),
conflict = ~ paste(.x, .y, sep = ";")
) |>
group_by(email) |>
slice(1) |>
tidyr::separate_rows(row_id) |>
ungroup() |>
select(name, phone, row_id)
email_check |> select(name, phone, row_id) |>
inner_join(phone_check, by = c("name", "row_id")) |>
bind_rows(
contact |>
mutate(phone = as.character(phone),
row_id = as.character(row_id)) |>
filter(!row_id %in% c(phone_check$row_id, email_check$row_id))
) |>
select(-row_id)
# 一个 tibble:2 × 3
name phone email
<chr> <chr> <chr>
1 John 123;456 john_abc@gmail.com;john_xyz@gmail.com
2 John 789 john_pqr@gmail.com
英文:
A different approach using the powerjoin
package:
contact <- tribble(
~name, ~phone, ~email,
"John", 123, "john_abc@gmail.com",
"John", 456, "john_abc@gmail.com",
"John", 456, "john_xyz@gmail.com",
"John", 789, "john_pqr@gmail.com") |>
mutate(row_id = row_number())
library(powerjoin)
library(dplyr)
# check duplicated entries in phone column
phone_check <- contact |>
power_right_join(filter(contact, duplicated(phone)),
by = c("name", "phone"),
conflict = ~ paste(.x, .y, sep = ";")
) |>
group_by(phone) |>
slice(1) |>
tidyr::separate_rows(row_id) |>
ungroup() |>
select(name, email, row_id)
# check duplicated entries in email column
email_check <- contact |>
power_right_join(filter(contact, duplicated(email)),
by = c("name", "email"),
conflict = ~ paste(.x, .y, sep = ";")
) |>
group_by(email) |>
slice(1) |>
tidyr::separate_rows(row_id) |>
ungroup() |>
select(name, phone, row_id)
email_check |> select(name, phone, row_id) |>
inner_join(phone_check, by = c("name", "row_id")) |>
bind_rows(
contact |>
mutate(phone = as.character(phone),
row_id = as.character(row_id)) |>
filter(!row_id %in% c(phone_check$row_id, email_check$row_id))
) |>
select(-row_id)
# A tibble: 2 × 3
name phone email
<chr> <chr> <chr>
1 John 123;456 john_abc@gmail.com;john_xyz@gmail.com
2 John 789 john_pqr@gmail.com
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论