R: 在一个群组中,根据多个条件保留条目

huangapple go评论61阅读模式
英文:

R: Within a group, retain entries based on multiple criteria

问题

以下是您要翻译的部分:

我有一个如下所示的表格:

| title     | full_name        | id    |
| --------  | ---------------- | ----- |
|blublu | some title       | dk29D |
|blabla | title **blabla1**| **dRAQx**|
|blabla | title **blabla1**| **Yprst**|
|blabla | title **blabla2**| **dRAQx**|
|blabla | title **blabla2**| **Yprst**|
|pips   | some other title | qW45m |

表格应该最终如下所示:

| title     | full_name        | id    |
| --------  | ---------------- | ----- |
|blublu | some title       | dk29D |
|blabla | title **blabla1**    | **dRAQx** |
|blabla | title **blabla2**    | **Yprst** |
|pips   | some other title | qW45m |

因此,我不在乎blabla1是否与dRAQx或Yprst相关(blabla2也一样)。关键是要保留标题的一个实例和id的一个实例。
sampledata <- data.frame(
  title = c("blublu", rep("blabla", 4), "pips"),
  full_name = c("some title", rep("title blabla1", 2), rep("title blabla2", 2), "some other title"),
  id = c("dk29D", rep(c("dRAQx","Yprst"), 2), "qW45m")
)

我尝试了很多方法,但最佳结果是:

sampledata %>% 
      group_by(title) %>% 
      mutate(id = sample(id, length(id), replace = FALSE)) %>%
      distinct(full_name, .keep_all = TRUE) %>%
      mutate(same_id = any(duplicated(id)))

问题在于它是随机的,有时保留的ID相同(由same_id变量跟踪)。我希望我的问题能够理解。

期待您的回复。提前感谢您。

英文:

I have a table such as outlined below:

title full_name id
blublu some title dk29D
blabla title blabla1 dRAQx
blabla title blabla1 Yprst
blabla title blabla2 dRAQx
blabla title blabla2 Yprst
pips some other title qW45m

The table should end up looking as follows:

title full_name id
blublu some title dk29D
blabla title blabla1 dRAQx
blabla title blabla2 Yprst
pips some other title qW45m

Thereby, I dont care if blabla1 is associated with dRAQx or Yprst (same for blabla2). What is crucial is that one instance of the title is maintained and one instance of the id.

sampledata &lt;- data.frame(
  title = c(&quot;blublu&quot;, rep(&quot;blabla&quot;, 4), &quot;pips&quot;),
  full_name = c(&quot;some title&quot;, rep(&quot;title blabla1&quot;, 2), rep(&quot;title blabla2&quot;, 2), &quot;some other title&quot;),
  id = c(&quot;dk29D&quot;, rep(c(&quot;dRAQx&quot;,&quot;Yprst&quot;), 2), &quot;qW45m&quot;)
)

I've tried many things but the best result was:

sampledata %&gt;% 
      group_by(title) %&gt;% 
      mutate(id = sample(id, length(id), replace = FALSE)) %&gt;%
      distinct(full_name, .keep_all = TRUE) %&gt;%
      mutate(same_id = any(duplicated(id)))

The issue here is that it's random and sometimes the IDs that are retained are the same (as kept track of by the same_id variable).

I hope my issue is understandable.

I'm looking forward to your responses. Thank you in advance.

答案1

得分: 0

以下是翻译好的内容:

我们可以这样做:

关键是使用full_name中的数字作为应该被筛选的row_number

library(dplyr)
library(parse_number)

sampledata %>%
  group_by(title, full_name) %>%
  mutate(x = coalesce(parse_number(full_name), 1)) %>%
  filter(x == row_number()) %>%
  ungroup() %>%
  select(-x)
  title   full_name        id
1 blublu  some title      dk29D
2 blabla  title blabla1   dRAQx
3 blabla  title blabla2   Yprst
4 pips    some other title qW45m
警告信息:
`mutate()`中有2个警告。
第一个警告是:
ℹ 在参数: `x = coalesce(parse_number(full_name), 1)`。
ℹ 在第3组:`title = "blublu"`, `full_name = "some title"由警告引起:
! 1个解析失败。
行 列  期望      实际
  1   --   一个数字  some title
ℹ 运行dplyr::last_dplyr_warnings()以查看剩下的1个警告。
英文:

We could do it this way:

The trick is to use the number in full_name as the row_number that should be filtered:

library(dplyr)
library(parse_number)

sampledata %&gt;%  
  group_by(title, full_name) %&gt;% 
  mutate(x = coalesce(parse_number(full_name), 1)) %&gt;% 
  filter(x == row_number()) %&gt;% 
  ungroup() %&gt;% 
  select(-x)

  title  full_name        id   
  &lt;chr&gt;  &lt;chr&gt;            &lt;chr&gt;
1 blublu some title       dk29D
2 blabla title blabla1    dRAQx
3 blabla title blabla2    Yprst
4 pips   some other title qW45m
Warning message:
There were 2 warnings in `mutate()`.
The first warning was:
ℹ In argument: `x = coalesce(parse_number(full_name), 1)`.
ℹ In group 3: `title = &quot;blublu&quot;`, `full_name = &quot;some title&quot;`.
Caused by warning:
! 1 parsing failure.
row col expected     actual
  1  -- a number some title
ℹ Run dplyr::last_dplyr_warnings() to see the 1 remaining warning.

答案2

得分: 0

自我发布问题后,我想到了一种可能的解决方案,虽然不够优雅,但却能完成任务。也许对将来的其他人有所帮助:

sampledata <- sampledata %>%
  group_by(title) %>%
  mutate(id_count = n_distinct(id)) %>%
  ungroup() %>%
  arrange(title, id) %>%
  group_by(title) %>%
  group_modify(~ {
    x <- .$id_count[1]
    if (x != 1) {
      x <- x+1
    }
    slice(., seq(1, nrow(.), x[1]))
  }) %>%
  ungroup() %>%
  select(-id_count)

这段代码首先计算每个标题组中有多少个ID。然后,它根据标题和ID对表进行排序,以便在下一个步骤中选择正确的行。

group_modify保留每个行,如果标题只关联一个ID。如果标题关联两个ID,它仅保留每组的第1行和第4行(=1+2+1)。如果每组有三个ID,它保留第1行、第5行(1+3+1)和第9行(5+3+1)等。这样,所有标题都被保留,所有ID,每组的行数都是最大不同的(因为首先对表进行排序)。

随时使用2和3次重复进行测试:

sampledata <- data.frame(
  title = c("blublu", rep("blabla", 4), "pips"),
  full_name = c("some title", rep("title blabla1", 2), rep("title blabla2", 2), "some other title"),
  id = c("dk29D", rep(c("dRAQx","Yprst"), 2), "qW45m")
)

sampledata <- data.frame(
  title = c("blublu", rep("blabla", 9), "pips"),
  full_name = c("some title", rep("title blabla1", 3), rep("title blabla2", 3), rep("title blabla3", 3), "some other title"),
  id = c("dk29D", rep(c("dRAQx","Yprst","Aw3r8"), 3), "qW45m")
)

以上是你提供的代码和示例数据的翻译。

英文:

Since I posted my question I thought of a possible solution that is not elegant but does its job. Maybe it'll be helpful for others in future:

sampledata &lt;- sampledata %&gt;% 
  group_by(title) %&gt;% 
  mutate(id_count = n_distinct(id)) %&gt;%
  ungroup() %&gt;%
  arrange(title, id) %&gt;%
  group_by(title) %&gt;%
  group_modify(~ {
    x &lt;- .$id_count[1]
    if (x != 1) {
      x &lt;- x+1
    }
    slice(., seq(1, nrow(.), x[1]))
  }) %&gt;%
  ungroup() %&gt;%
  select(-id_count)

This code first counts, how many IDs are found within each title-group. Then it orders the table based on title and id so that in the text step, the correct rows will be chosen.

group_modify keeps every row if there is only 1 ID associated with a title. If there are two IDs associated with the title, it keeps only rows 1 and 4 (=1+2+1) per group. If there are three IDs per group, it retains rows 1, 5 (1+3+1) and 9 (5+3+1) etc. This way, all titles are retained and all IDs, the rows per group are maximally different (due to ordering the table first).

Feel free to test with 2 and 3 repetitions:

sampledata &lt;- data.frame(
  title = c(&quot;blublu&quot;, rep(&quot;blabla&quot;, 4), &quot;pips&quot;),
  full_name = c(&quot;some title&quot;, rep(&quot;title blabla1&quot;, 2), rep(&quot;title blabla2&quot;, 2), &quot;some other title&quot;),
  id = c(&quot;dk29D&quot;, rep(c(&quot;dRAQx&quot;,&quot;Yprst&quot;), 2), &quot;qW45m&quot;)
)

sampledata &lt;- data.frame(
  title = c(&quot;blublu&quot;, rep(&quot;blabla&quot;, 9), &quot;pips&quot;),
  full_name = c(&quot;some title&quot;, rep(&quot;title blabla1&quot;, 3), rep(&quot;title blabla2&quot;, 3), rep(&quot;title blabla3&quot;, 3), &quot;some other title&quot;),
  id = c(&quot;dk29D&quot;, rep(c(&quot;dRAQx&quot;,&quot;Yprst&quot;,&quot;Aw3r8&quot;), 3), &quot;qW45m&quot;)
)

huangapple
  • 本文由 发表于 2023年3月31日 22:56:08
  • 转载请务必保留本文链接:https://go.coder-hub.com/75899957.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定