英文:
R: Within a group, retain entries based on multiple criteria
问题
以下是您要翻译的部分:
我有一个如下所示的表格:
| title | full_name | id |
| -------- | ---------------- | ----- |
|blublu | some title | dk29D |
|blabla | title **blabla1**| **dRAQx**|
|blabla | title **blabla1**| **Yprst**|
|blabla | title **blabla2**| **dRAQx**|
|blabla | title **blabla2**| **Yprst**|
|pips | some other title | qW45m |
表格应该最终如下所示:
| title | full_name | id |
| -------- | ---------------- | ----- |
|blublu | some title | dk29D |
|blabla | title **blabla1** | **dRAQx** |
|blabla | title **blabla2** | **Yprst** |
|pips | some other title | qW45m |
因此,我不在乎blabla1是否与dRAQx或Yprst相关(blabla2也一样)。关键是要保留标题的一个实例和id的一个实例。
sampledata <- data.frame(
title = c("blublu", rep("blabla", 4), "pips"),
full_name = c("some title", rep("title blabla1", 2), rep("title blabla2", 2), "some other title"),
id = c("dk29D", rep(c("dRAQx","Yprst"), 2), "qW45m")
)
我尝试了很多方法,但最佳结果是:
sampledata %>%
group_by(title) %>%
mutate(id = sample(id, length(id), replace = FALSE)) %>%
distinct(full_name, .keep_all = TRUE) %>%
mutate(same_id = any(duplicated(id)))
问题在于它是随机的,有时保留的ID相同(由same_id变量跟踪)。我希望我的问题能够理解。
期待您的回复。提前感谢您。
英文:
I have a table such as outlined below:
title | full_name | id |
---|---|---|
blublu | some title | dk29D |
blabla | title blabla1 | dRAQx |
blabla | title blabla1 | Yprst |
blabla | title blabla2 | dRAQx |
blabla | title blabla2 | Yprst |
pips | some other title | qW45m |
The table should end up looking as follows:
title | full_name | id |
---|---|---|
blublu | some title | dk29D |
blabla | title blabla1 | dRAQx |
blabla | title blabla2 | Yprst |
pips | some other title | qW45m |
Thereby, I dont care if blabla1 is associated with dRAQx or Yprst (same for blabla2). What is crucial is that one instance of the title is maintained and one instance of the id.
sampledata <- data.frame(
title = c("blublu", rep("blabla", 4), "pips"),
full_name = c("some title", rep("title blabla1", 2), rep("title blabla2", 2), "some other title"),
id = c("dk29D", rep(c("dRAQx","Yprst"), 2), "qW45m")
)
I've tried many things but the best result was:
sampledata %>%
group_by(title) %>%
mutate(id = sample(id, length(id), replace = FALSE)) %>%
distinct(full_name, .keep_all = TRUE) %>%
mutate(same_id = any(duplicated(id)))
The issue here is that it's random and sometimes the IDs that are retained are the same (as kept track of by the same_id variable).
I hope my issue is understandable.
I'm looking forward to your responses. Thank you in advance.
答案1
得分: 0
以下是翻译好的内容:
我们可以这样做:
关键是使用full_name
中的数字作为应该被筛选的row_number
:
library(dplyr)
library(parse_number)
sampledata %>%
group_by(title, full_name) %>%
mutate(x = coalesce(parse_number(full_name), 1)) %>%
filter(x == row_number()) %>%
ungroup() %>%
select(-x)
title full_name id
1 blublu some title dk29D
2 blabla title blabla1 dRAQx
3 blabla title blabla2 Yprst
4 pips some other title qW45m
警告信息:
在`mutate()`中有2个警告。
第一个警告是:
ℹ 在参数: `x = coalesce(parse_number(full_name), 1)`。
ℹ 在第3组:`title = "blublu"`, `full_name = "some title"。
由警告引起:
! 1个解析失败。
行 列 期望 实际
1 -- 一个数字 some title
ℹ 运行dplyr::last_dplyr_warnings()以查看剩下的1个警告。
英文:
We could do it this way:
The trick is to use the number in full_name
as the row_number that should be filtered:
library(dplyr)
library(parse_number)
sampledata %>%
group_by(title, full_name) %>%
mutate(x = coalesce(parse_number(full_name), 1)) %>%
filter(x == row_number()) %>%
ungroup() %>%
select(-x)
title full_name id
<chr> <chr> <chr>
1 blublu some title dk29D
2 blabla title blabla1 dRAQx
3 blabla title blabla2 Yprst
4 pips some other title qW45m
Warning message:
There were 2 warnings in `mutate()`.
The first warning was:
ℹ In argument: `x = coalesce(parse_number(full_name), 1)`.
ℹ In group 3: `title = "blublu"`, `full_name = "some title"`.
Caused by warning:
! 1 parsing failure.
row col expected actual
1 -- a number some title
ℹ Run dplyr::last_dplyr_warnings() to see the 1 remaining warning.
答案2
得分: 0
自我发布问题后,我想到了一种可能的解决方案,虽然不够优雅,但却能完成任务。也许对将来的其他人有所帮助:
sampledata <- sampledata %>%
group_by(title) %>%
mutate(id_count = n_distinct(id)) %>%
ungroup() %>%
arrange(title, id) %>%
group_by(title) %>%
group_modify(~ {
x <- .$id_count[1]
if (x != 1) {
x <- x+1
}
slice(., seq(1, nrow(.), x[1]))
}) %>%
ungroup() %>%
select(-id_count)
这段代码首先计算每个标题组中有多少个ID。然后,它根据标题和ID对表进行排序,以便在下一个步骤中选择正确的行。
group_modify保留每个行,如果标题只关联一个ID。如果标题关联两个ID,它仅保留每组的第1行和第4行(=1+2+1)。如果每组有三个ID,它保留第1行、第5行(1+3+1)和第9行(5+3+1)等。这样,所有标题都被保留,所有ID,每组的行数都是最大不同的(因为首先对表进行排序)。
随时使用2和3次重复进行测试:
sampledata <- data.frame(
title = c("blublu", rep("blabla", 4), "pips"),
full_name = c("some title", rep("title blabla1", 2), rep("title blabla2", 2), "some other title"),
id = c("dk29D", rep(c("dRAQx","Yprst"), 2), "qW45m")
)
sampledata <- data.frame(
title = c("blublu", rep("blabla", 9), "pips"),
full_name = c("some title", rep("title blabla1", 3), rep("title blabla2", 3), rep("title blabla3", 3), "some other title"),
id = c("dk29D", rep(c("dRAQx","Yprst","Aw3r8"), 3), "qW45m")
)
以上是你提供的代码和示例数据的翻译。
英文:
Since I posted my question I thought of a possible solution that is not elegant but does its job. Maybe it'll be helpful for others in future:
sampledata <- sampledata %>%
group_by(title) %>%
mutate(id_count = n_distinct(id)) %>%
ungroup() %>%
arrange(title, id) %>%
group_by(title) %>%
group_modify(~ {
x <- .$id_count[1]
if (x != 1) {
x <- x+1
}
slice(., seq(1, nrow(.), x[1]))
}) %>%
ungroup() %>%
select(-id_count)
This code first counts, how many IDs are found within each title-group. Then it orders the table based on title and id so that in the text step, the correct rows will be chosen.
group_modify keeps every row if there is only 1 ID associated with a title. If there are two IDs associated with the title, it keeps only rows 1 and 4 (=1+2+1) per group. If there are three IDs per group, it retains rows 1, 5 (1+3+1) and 9 (5+3+1) etc. This way, all titles are retained and all IDs, the rows per group are maximally different (due to ordering the table first).
Feel free to test with 2 and 3 repetitions:
sampledata <- data.frame(
title = c("blublu", rep("blabla", 4), "pips"),
full_name = c("some title", rep("title blabla1", 2), rep("title blabla2", 2), "some other title"),
id = c("dk29D", rep(c("dRAQx","Yprst"), 2), "qW45m")
)
sampledata <- data.frame(
title = c("blublu", rep("blabla", 9), "pips"),
full_name = c("some title", rep("title blabla1", 3), rep("title blabla2", 3), rep("title blabla3", 3), "some other title"),
id = c("dk29D", rep(c("dRAQx","Yprst","Aw3r8"), 3), "qW45m")
)
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论