仅基于条件筛选重复的行。

huangapple go评论71阅读模式
英文:

Filter only duplicated rows based on condition

问题

Sure, here's the dynamically generated code to achieve your desired output using the tidyverse approach:

library(dplyr)

df %>%
  group_by(ID) %>%
  filter(all(c("NoPlan", "Plan") %in% Q40) | n() == 1) %>%
  ungroup()

This code groups the data by the "ID" column, then filters the groups to keep rows where either all unique values in the "Q40" column are "NoPlan" and "Plan" or there is only one row in the group. Finally, it ungroups the data to get the desired output.

英文:

In the data below, I want to filter the duplicated participants (Ana, Paul) based on their answer to Q40. Participants should have answered only "NoPlan" or "Plan", but sometimes they've answered both. I want to keep only the rows in each the duplicated participants have answered "Plan"

  • The data is:
  ID     Q40   
  <chr>  <chr> 
1 Maria  Plan  
2 Joseph Plan  
3 Cyntia NoPlan
4 Ana    NoPlan   ## first noPlan
5 Paul   Plan  ## first Plan
6 Ana    Plan  ## then, Plan 
7 Paul   NoPlan ## then, noPlan 
  • Desired Output:
  ID     Q40   
  <chr>  <chr> 
1 Maria  Plan  
2 Joseph Plan  
3 Cyntia NoPlan
5 Paul   Plan  
6 Ana    Plan  
  • I was doing it manually, like this:
df %>% 
  filter(Q40 %in% c('noPlan', 'Plan')) %>% 
  ungroup() %>% 
  distinct()

then I was removing manually the duplicated ones with (slice()) and their corresponding position.

QUESTION

1: How can I dinamically remove the rows only of the duplicated participants in ID (who answered "noPlan" and "plan" at the same time), keeping only their "Plan" row? I'm sure this is easy to do with tidyverse , but I'm struggling with it. I really need a tidyverse approach.

Important note:

1 I don't want to filter all "NoPlan" rows, I only want to filter the "Plan" rows of the participants who have answered both NoPlan AND Plan. Only the duplicated ones, not the whole sample.

2 I don't want to keep only the first row such as with distinc() , I want to be able to set the condition myself (in this case, to filter "NoPlan" from the duplicated participants

3 I cannot use distinct(starwars, pick(contains("color"))) because this would remove "NoPlan" from all participants and I need to filter only the duplicated ones

  • data:
structure(list(ID = c("Joseph", "Cyntia", "Paul", "Ana", "Paul", 
"Maria", "Ana"), Q40 = c("Plan", "NoPlan", "Plan", "Plan", "NoPlan", 
"Plan", "NoPlan")), row.names = c(NA, -7L), class = c("tbl_df", 
"tbl", "data.frame"))

答案1

得分: 2

?duplicated应该在这里有帮助:

dat %>% 
    filter(!((duplicated(ID) | duplicated(ID, fromLast=TRUE)) & Q40 == "NoPlan"))

或者用基本的R语言:

dat[with(dat, !((duplicated(ID) | duplicated(ID, fromLast=TRUE)) & Q40 == "NoPlan")),]
英文:

?duplicated should aid you here:

dat %>%
    filter(!((duplicated(ID) | duplicated(ID, fromLast=TRUE)) & Q40 == "NoPlan"))

Or in base R speak:

dat[with(dat, !((duplicated(ID) | duplicated(ID, fromLast=TRUE)) & Q40 == "NoPlan")),]

答案2

得分: 1

这是你想要的翻译:

听起来你更喜欢使用 `tidyverse` 的方法,但这里提供了一个使用基本的 R 语言实现的分割-操作-合并的解决方案。你通过 `ID` 进行分割;在具有多行的子集中,选择包含 `Q40` 中“Plan”的行;然后使用 `do.call("rbind", ...)` 将其余的行重新组合。

dat <- structure(list(ID = c("Joseph", "Cyntia", "Paul", "Ana", "Paul", "Maria", "Ana"), Q40 = c("Plan", "NoPlan", "Plan", "Plan", "NoPlan", "Plan", "NoPlan")), row.names = c(NA, -7L), class = c("tbl_df", "tbl", "data.frame"))

do.call("rbind", lapply(split(dat, dat$ID), function(x) {

  if(nrow(x) > 1) {

    x[x$Q40 == "Plan",]

  } else {

    x

  }

}))

以下是其产生的结果:

  ID     Q40   
* <chr>  <chr> 
1 Ana    Plan  
2 Cyntia NoPlan
3 Joseph Plan  
4 Maria  Plan  
5 Paul   Plan
英文:

It sounds like you'd prefer a tidyverse answer, but here's a base R solution using the split-operate-recombine approach. You split by ID; in subsets with multiple rows, select the rows with "Plan" in Q40; then use do.call(&quot;rbind&quot;, ...) to put the remaining rows back together.

dat &lt;- structure(list(ID = c(&quot;Joseph&quot;, &quot;Cyntia&quot;, &quot;Paul&quot;, &quot;Ana&quot;, &quot;Paul&quot;, 
&quot;Maria&quot;, &quot;Ana&quot;), Q40 = c(&quot;Plan&quot;, &quot;NoPlan&quot;, &quot;Plan&quot;, &quot;Plan&quot;, &quot;NoPlan&quot;, 
&quot;Plan&quot;, &quot;NoPlan&quot;)), row.names = c(NA, -7L), class = c(&quot;tbl_df&quot;, 
&quot;tbl&quot;, &quot;data.frame&quot;))

do.call(&quot;rbind&quot;, lapply(split(dat, dat$ID), function(x) {

  if(nrow(x) &gt; 1) {

    x[x$Q40 == &quot;Plan&quot;,]

  } else {

    x

  }

}))

And here's what that produces:

  ID     Q40   
* &lt;chr&gt;  &lt;chr&gt; 
1 Ana    Plan  
2 Cyntia NoPlan
3 Joseph Plan  
4 Maria  Plan  
5 Paul   Plan

huangapple
  • 本文由 发表于 2023年5月22日 05:41:18
  • 转载请务必保留本文链接:https://go.coder-hub.com/76302039.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定