2023年3月31日 22:56:08go评论87阅读模式

英文:

R: Within a group, retain entries based on multiple criteria

问题

以下是您要翻译的部分：

我有一个如下所示的表格：
| title     | full_name        | id    |
| --------  | ---------------- | ----- |
|blublu | some title       | dk29D |
|blabla | title **blabla1**| **dRAQx**|
|blabla | title **blabla1**| **Yprst**|
|blabla | title **blabla2**| **dRAQx**|
|blabla | title **blabla2**| **Yprst**|
|pips   | some other title | qW45m |
表格应该最终如下所示：
| title     | full_name        | id    |
| --------  | ---------------- | ----- |
|blublu | some title       | dk29D |
|blabla | title **blabla1**    | **dRAQx** |
|blabla | title **blabla2**    | **Yprst** |
|pips   | some other title | qW45m |
因此，我不在乎blabla1是否与dRAQx或Yprst相关（blabla2也一样）。关键是要保留标题的一个实例和id的一个实例。

sampledata <- data.frame(
  title = c("blublu", rep("blabla", 4), "pips"),
  full_name = c("some title", rep("title blabla1", 2), rep("title blabla2", 2), "some other title"),
  id = c("dk29D", rep(c("dRAQx","Yprst"), 2), "qW45m")
)

我尝试了很多方法，但最佳结果是：

sampledata %>% 
      group_by(title) %>% 
      mutate(id = sample(id, length(id), replace = FALSE)) %>%
      distinct(full_name, .keep_all = TRUE) %>%
      mutate(same_id = any(duplicated(id)))

问题在于它是随机的，有时保留的ID相同（由same_id变量跟踪）。我希望我的问题能够理解。

期待您的回复。提前感谢您。

英文:

I have a table such as outlined below:

title	full_name	id
blublu	some title	dk29D
blabla	title blabla1	dRAQx
blabla	title blabla1	Yprst
blabla	title blabla2	dRAQx
blabla	title blabla2	Yprst
pips	some other title	qW45m

The table should end up looking as follows:

title	full_name	id
blublu	some title	dk29D
blabla	title blabla1	dRAQx
blabla	title blabla2	Yprst
pips	some other title	qW45m

Thereby, I dont care if blabla1 is associated with dRAQx or Yprst (same for blabla2). What is crucial is that one instance of the title is maintained and one instance of the id.

sampledata &lt;- data.frame(
  title = c(&quot;blublu&quot;, rep(&quot;blabla&quot;, 4), &quot;pips&quot;),
  full_name = c(&quot;some title&quot;, rep(&quot;title blabla1&quot;, 2), rep(&quot;title blabla2&quot;, 2), &quot;some other title&quot;),
  id = c(&quot;dk29D&quot;, rep(c(&quot;dRAQx&quot;,&quot;Yprst&quot;), 2), &quot;qW45m&quot;)
)

I've tried many things but the best result was:

sampledata %&gt;% 
      group_by(title) %&gt;% 
      mutate(id = sample(id, length(id), replace = FALSE)) %&gt;%
      distinct(full_name, .keep_all = TRUE) %&gt;%
      mutate(same_id = any(duplicated(id)))

The issue here is that it's random and sometimes the IDs that are retained are the same (as kept track of by the same_id variable).

I hope my issue is understandable.

I'm looking forward to your responses. Thank you in advance.

答案1

得分: 0

以下是翻译好的内容：

我们可以这样做：

关键是使用full_name中的数字作为应该被筛选的row_number：

library(dplyr)
library(parse_number)
sampledata %>%
  group_by(title, full_name) %>%
  mutate(x = coalesce(parse_number(full_name), 1)) %>%
  filter(x == row_number()) %>%
  ungroup() %>%
  select(-x)

  title   full_name        id
1 blublu  some title      dk29D
2 blabla  title blabla1   dRAQx
3 blabla  title blabla2   Yprst
4 pips    some other title qW45m
警告信息：
在`mutate()`中有2个警告。
第一个警告是：
ℹ 在参数: `x = coalesce(parse_number(full_name), 1)`。
ℹ 在第3组：`title = "blublu"`, `full_name = "some title"。
由警告引起：
! 1个解析失败。
行 列  期望      实际
  1   --   一个数字  some title
ℹ 运行dplyr::last_dplyr_warnings()以查看剩下的1个警告。

英文:

We could do it this way:

The trick is to use the number in full_name as the row_number that should be filtered:

library(dplyr)
library(parse_number)
sampledata %&gt;%  
  group_by(title, full_name) %&gt;% 
  mutate(x = coalesce(parse_number(full_name), 1)) %&gt;% 
  filter(x == row_number()) %&gt;% 
  ungroup() %&gt;% 
  select(-x)

  title  full_name        id   
  &lt;chr&gt;  &lt;chr&gt;            &lt;chr&gt;
1 blublu some title       dk29D
2 blabla title blabla1    dRAQx
3 blabla title blabla2    Yprst
4 pips   some other title qW45m
Warning message:
There were 2 warnings in `mutate()`.
The first warning was:
ℹ In argument: `x = coalesce(parse_number(full_name), 1)`.
ℹ In group 3: `title = &quot;blublu&quot;`, `full_name = &quot;some title&quot;`.
Caused by warning:
! 1 parsing failure.
row col expected     actual
  1  -- a number some title
ℹ Run dplyr::last_dplyr_warnings() to see the 1 remaining warning.

答案2

得分: 0

自我发布问题后，我想到了一种可能的解决方案，虽然不够优雅，但却能完成任务。也许对将来的其他人有所帮助：

sampledata <- sampledata %>%
  group_by(title) %>%
  mutate(id_count = n_distinct(id)) %>%
  ungroup() %>%
  arrange(title, id) %>%
  group_by(title) %>%
  group_modify(~ {
    x <- .$id_count[1]
    if (x != 1) {
      x <- x+1
    }
    slice(., seq(1, nrow(.), x[1]))
  }) %>%
  ungroup() %>%
  select(-id_count)

这段代码首先计算每个标题组中有多少个ID。然后，它根据标题和ID对表进行排序，以便在下一个步骤中选择正确的行。

group_modify保留每个行，如果标题只关联一个ID。如果标题关联两个ID，它仅保留每组的第1行和第4行（=1+2+1）。如果每组有三个ID，它保留第1行、第5行（1+3+1）和第9行（5+3+1）等。这样，所有标题都被保留，所有ID，每组的行数都是最大不同的（因为首先对表进行排序）。

随时使用2和3次重复进行测试：

sampledata <- data.frame(
  title = c("blublu", rep("blabla", 4), "pips"),
  full_name = c("some title", rep("title blabla1", 2), rep("title blabla2", 2), "some other title"),
  id = c("dk29D", rep(c("dRAQx","Yprst"), 2), "qW45m")
)
sampledata <- data.frame(
  title = c("blublu", rep("blabla", 9), "pips"),
  full_name = c("some title", rep("title blabla1", 3), rep("title blabla2", 3), rep("title blabla3", 3), "some other title"),
  id = c("dk29D", rep(c("dRAQx","Yprst","Aw3r8"), 3), "qW45m")
)

以上是你提供的代码和示例数据的翻译。

英文:

Since I posted my question I thought of a possible solution that is not elegant but does its job. Maybe it'll be helpful for others in future:

sampledata &lt;- sampledata %&gt;% 
  group_by(title) %&gt;% 
  mutate(id_count = n_distinct(id)) %&gt;%
  ungroup() %&gt;%
  arrange(title, id) %&gt;%
  group_by(title) %&gt;%
  group_modify(~ {
    x &lt;- .$id_count[1]
    if (x != 1) {
      x &lt;- x+1
    }
    slice(., seq(1, nrow(.), x[1]))
  }) %&gt;%
  ungroup() %&gt;%
  select(-id_count)

This code first counts, how many IDs are found within each title-group. Then it orders the table based on title and id so that in the text step, the correct rows will be chosen.

group_modify keeps every row if there is only 1 ID associated with a title. If there are two IDs associated with the title, it keeps only rows 1 and 4 (=1+2+1) per group. If there are three IDs per group, it retains rows 1, 5 (1+3+1) and 9 (5+3+1) etc. This way, all titles are retained and all IDs, the rows per group are maximally different (due to ordering the table first).

Feel free to test with 2 and 3 repetitions:

sampledata &lt;- data.frame(
  title = c(&quot;blublu&quot;, rep(&quot;blabla&quot;, 4), &quot;pips&quot;),
  full_name = c(&quot;some title&quot;, rep(&quot;title blabla1&quot;, 2), rep(&quot;title blabla2&quot;, 2), &quot;some other title&quot;),
  id = c(&quot;dk29D&quot;, rep(c(&quot;dRAQx&quot;,&quot;Yprst&quot;), 2), &quot;qW45m&quot;)
)
sampledata &lt;- data.frame(
  title = c(&quot;blublu&quot;, rep(&quot;blabla&quot;, 9), &quot;pips&quot;),
  full_name = c(&quot;some title&quot;, rep(&quot;title blabla1&quot;, 3), rep(&quot;title blabla2&quot;, 3), rep(&quot;title blabla3&quot;, 3), &quot;some other title&quot;),
  id = c(&quot;dk29D&quot;, rep(c(&quot;dRAQx&quot;,&quot;Yprst&quot;,&quot;Aw3r8&quot;), 3), &quot;qW45m&quot;)
)

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

R: 在一个群组中，根据多个条件保留条目

问题

答案1

答案2

无法编译 ‘sass’，当尝试使用R安装包时。

随机抽样一个数据框，直到检测到所有个体。

如何在R中创建一个类似Excel中具有多个标题行和列的数据透视表？

cor_auto giving different results for missing = 'listwise' vs 'pairwise' for correlation with two variables

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。