2023年5月22日 05:41:18go评论101阅读模式

英文:

Filter only duplicated rows based on condition

问题

Sure, here's the dynamically generated code to achieve your desired output using the tidyverse approach:

library(dplyr)
df %>%
  group_by(ID) %>%
  filter(all(c("NoPlan", "Plan") %in% Q40) | n() == 1) %>%
  ungroup()

This code groups the data by the "ID" column, then filters the groups to keep rows where either all unique values in the "Q40" column are "NoPlan" and "Plan" or there is only one row in the group. Finally, it ungroups the data to get the desired output.

英文:

In the data below, I want to filter the duplicated participants (Ana, Paul) based on their answer to Q40. Participants should have answered only "NoPlan" or "Plan", but sometimes they've answered both. I want to keep only the rows in each the duplicated participants have answered "Plan"

The data is:

  ID     Q40   
  &lt;chr&gt;  &lt;chr&gt; 
1 Maria  Plan  
2 Joseph Plan  
3 Cyntia NoPlan
4 Ana    NoPlan   ## first noPlan
5 Paul   Plan  ## first Plan
6 Ana    Plan  ## then, Plan 
7 Paul   NoPlan ## then, noPlan

Desired Output:

  ID     Q40   
  &lt;chr&gt;  &lt;chr&gt; 
1 Maria  Plan  
2 Joseph Plan  
3 Cyntia NoPlan
5 Paul   Plan  
6 Ana    Plan

I was doing it manually, like this:

df %&gt;% 
  filter(Q40 %in% c(&#39;noPlan&#39;, &#39;Plan&#39;)) %&gt;% 
  ungroup() %&gt;% 
  distinct()
then I was removing manually the duplicated ones with (slice()) and their corresponding position.

QUESTION

1: How can I dinamically remove the rows only of the duplicated participants in ID (who answered "noPlan" and "plan" at the same time), keeping only their "Plan" row? I'm sure this is easy to do with tidyverse , but I'm struggling with it. I really need a tidyverse approach.

Important note:

1 I don't want to filter all "NoPlan" rows, I only want to filter the "Plan" rows of the participants who have answered both NoPlan AND Plan. Only the duplicated ones, not the whole sample.

2 I don't want to keep only the first row such as with distinc() , I want to be able to set the condition myself (in this case, to filter "NoPlan" from the duplicated participants

3 I cannot use distinct(starwars, pick(contains("color"))) because this would remove "NoPlan" from all participants and I need to filter only the duplicated ones

data:

structure(list(ID = c(&quot;Joseph&quot;, &quot;Cyntia&quot;, &quot;Paul&quot;, &quot;Ana&quot;, &quot;Paul&quot;, 
&quot;Maria&quot;, &quot;Ana&quot;), Q40 = c(&quot;Plan&quot;, &quot;NoPlan&quot;, &quot;Plan&quot;, &quot;Plan&quot;, &quot;NoPlan&quot;, 
&quot;Plan&quot;, &quot;NoPlan&quot;)), row.names = c(NA, -7L), class = c(&quot;tbl_df&quot;, 
&quot;tbl&quot;, &quot;data.frame&quot;))

答案1

得分: 2

?duplicated应该在这里有帮助：

dat %>% 
    filter(!((duplicated(ID) | duplicated(ID, fromLast=TRUE)) & Q40 == "NoPlan"))

或者用基本的R语言：

dat[with(dat, !((duplicated(ID) | duplicated(ID, fromLast=TRUE)) & Q40 == "NoPlan")),]

英文:

?duplicated should aid you here:

dat %&gt;%
    filter(!((duplicated(ID) | duplicated(ID, fromLast=TRUE)) &amp; Q40 == &quot;NoPlan&quot;))

Or in base R speak:

dat[with(dat, !((duplicated(ID) | duplicated(ID, fromLast=TRUE)) &amp; Q40 == &quot;NoPlan&quot;)),]

答案2

得分: 1

这是你想要的翻译：

听起来你更喜欢使用 `tidyverse` 的方法，但这里提供了一个使用基本的 R 语言实现的分割-操作-合并的解决方案。你通过 `ID` 进行分割；在具有多行的子集中，选择包含 `Q40` 中“Plan”的行；然后使用 `do.call("rbind", ...)` 将其余的行重新组合。
dat <- structure(list(ID = c("Joseph", "Cyntia", "Paul", "Ana", "Paul", "Maria", "Ana"), Q40 = c("Plan", "NoPlan", "Plan", "Plan", "NoPlan", "Plan", "NoPlan")), row.names = c(NA, -7L), class = c("tbl_df", "tbl", "data.frame"))
do.call("rbind", lapply(split(dat, dat$ID), function(x) {
  if(nrow(x) > 1) {
    x[x$Q40 == "Plan",]
  } else {
    x
  }
}))

以下是其产生的结果：

  ID     Q40   
* <chr>  <chr> 
1 Ana    Plan  
2 Cyntia NoPlan
3 Joseph Plan  
4 Maria  Plan  
5 Paul   Plan

英文:

It sounds like you'd prefer a tidyverse answer, but here's a base R solution using the split-operate-recombine approach. You split by ID; in subsets with multiple rows, select the rows with "Plan" in Q40; then use do.call("rbind", ...) to put the remaining rows back together.

dat &lt;- structure(list(ID = c(&quot;Joseph&quot;, &quot;Cyntia&quot;, &quot;Paul&quot;, &quot;Ana&quot;, &quot;Paul&quot;, 
&quot;Maria&quot;, &quot;Ana&quot;), Q40 = c(&quot;Plan&quot;, &quot;NoPlan&quot;, &quot;Plan&quot;, &quot;Plan&quot;, &quot;NoPlan&quot;, 
&quot;Plan&quot;, &quot;NoPlan&quot;)), row.names = c(NA, -7L), class = c(&quot;tbl_df&quot;, 
&quot;tbl&quot;, &quot;data.frame&quot;))
do.call(&quot;rbind&quot;, lapply(split(dat, dat$ID), function(x) {
  if(nrow(x) &gt; 1) {
    x[x$Q40 == &quot;Plan&quot;,]
  } else {
    x
  }
}))

And here's what that produces:

  ID     Q40   
* &lt;chr&gt;  &lt;chr&gt; 
1 Ana    Plan  
2 Cyntia NoPlan
3 Joseph Plan  
4 Maria  Plan  
5 Paul   Plan

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

仅基于条件筛选重复的行。

问题

答案1

答案2

运行 sapply 函数，其中有两个输入（变量和数据框）。

R: 在JSON中识别地理坐标

R中用于“连接”/“拼接”单词列表的函数。

条形图和点图，带有重复数据。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。