R + dplyr: Tibble中行的部分去重

huangapple go评论61阅读模式
英文:

R + dplyr: Partial Deduplication of Rows in a Tibble

问题

一个非常常见的问题是如何在R中删除数据框中的所有重复行,可以使用各种工具来完成(我喜欢dplyr+distinct)。

然而,如果您的数据集包含多个重复的行,但您不想删除所有重复行,只想删除某些变量的组合,该怎么办呢?

我不知道如何实现这一点,所以欢迎任何建议。

请查看帖子末尾的reprex。

感谢!

library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union

df <- tibble(x=rep(seq(5), 3), y=rep(LETTERS[1:5],3),
             z=c(rep(c("h","j","k","t","u"), 2), LETTERS[1:5])
             )
df
#> # A tibble: 15 × 3
#>        x y     z    
#>    <int> <chr> <chr>
#>  1     1 A     h    
#>  2     2 B     j    
#>  3     3 C     k    
#>  4     4 D     t    
#>  5     5 E     u    
#>  6     1 A     h    
#>  7     2 B     j    
#>  8     3 C     k    
#>  9     4 D     t    
#> 10     5 E     u    
#> 11     1 A     A    
#> 12     2 B     B    
#> 13     3 C     C    
#> 14     4 D     D    
#> 15     5 E     E

df_ded <- df |>
    distinct()

df_ded
#> # A tibble: 10 × 3
#>        x y     z    
#>    <int> <chr> <chr>
#>  1     1 A     h    
#>  2     2 B     j    
#>  3     3 C     k    
#>  4     4 D     t    
#>  5     5 E     u    
#>  6     1 A     A    
#>  7     2 B     B    
#>  8     3 C     C    
#>  9     4 D     D    
#> 10     5 E     E

## 我只想要去重x==3和z=="k"的行。

df_ded_partial <- df |>
    distinct(x==3, z=="k") ## 但这不是我想要的。

## 如何实现呢?

df_ded_partial
#> # A tibble: 3 × 2
#>   `x == 3` `z == "k"`
#>   <lgl>    <lgl>     
#> 1 FALSE    FALSE     
#> 2 TRUE     TRUE      
#> 3 TRUE     FALSE

使用reprex v2.0.2于2023年02月14日创建

英文:

A very common question is how to remove all the duplicated lines in a data frame in R, something which can be done with a variety of tools (I like dplyr+distinct).

However, what if your dataset contains several duplicated lines, but you do not want to remove all of them, but only those for some combination of the variables?

I do not know how to achieve that, so any suggestion is welcome.

Please have a look at the reprex at the end of the post.

Thanks!

library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union


df <- tibble(x=rep(seq(5), 3), y=rep(LETTERS[1:5],3),
             z=c(rep(c("h","j","k","t","u"), 2), LETTERS[1:5])
             )
df
#> # A tibble: 15 × 3
#>        x y     z    
#>    <int> <chr> <chr>
#>  1     1 A     h    
#>  2     2 B     j    
#>  3     3 C     k    
#>  4     4 D     t    
#>  5     5 E     u    
#>  6     1 A     h    
#>  7     2 B     j    
#>  8     3 C     k    
#>  9     4 D     t    
#> 10     5 E     u    
#> 11     1 A     A    
#> 12     2 B     B    
#> 13     3 C     C    
#> 14     4 D     D    
#> 15     5 E     E

df_ded <- df |>
    distinct()

df_ded
#> # A tibble: 10 × 3
#>        x y     z    
#>    <int> <chr> <chr>
#>  1     1 A     h    
#>  2     2 B     j    
#>  3     3 C     k    
#>  4     4 D     t    
#>  5     5 E     u    
#>  6     1 A     A    
#>  7     2 B     B    
#>  8     3 C     C    
#>  9     4 D     D    
#> 10     5 E     E

## I want to deduplicate only the rows with x==3 and z=="k"

df_ded_partial <- df |>
    distinct(x==3, z=="k") ## but this is not what I mean.

## How to achieve it?

df_ded_partial
#> # A tibble: 3 × 2
#>   `x == 3` `z == "k"`
#>   <lgl>    <lgl>     
#> 1 FALSE    FALSE     
#> 2 TRUE     TRUE      
#> 3 TRUE     FALSE

<sup>Created on 2023-02-14 with reprex v2.0.2</sup>

答案1

得分: 5

我们可以使用 group_modify() 并使用 .y 参数来检查条件,该参数是当前分组的 tibble。所以我们可以说:如果条件满足,则返回 distinct(.x) 分组,否则返回整个分组 .x

library(dplyr)

df |&gt;
  group_by(x, z) |&gt;
  group_modify(~ if(.y$x == 3 &amp;&amp; .y$z == &quot;k&quot;) distinct(.x) else .x)

#&gt; # A tibble: 14 x 3
#&gt; # Groups:   x, z [10]
#&gt;        x z     y    
#&gt;    &lt;int&gt; &lt;chr&gt; &lt;chr&gt;
#&gt;  1     1 A     A    
#&gt;  2     1 h     A    
#&gt;  3     1 h     A    
#&gt;  4     2 B     B    
#&gt;  5     2 j     B    
#&gt;  6     2 j     B    
#&gt;  7     3 C     C    
#&gt;  8     3 k     C    
#&gt;  9     4 D     D    
#&gt; 10     4 t     D    
#&gt; 11     4 t     D    
#&gt; 12     5 E     E    
#&gt; 13     5 u     E    
#&gt; 14     5 u     E

来自 OP 的数据

df &lt;- tibble(x=rep(seq(5), 3), y=rep(LETTERS[1:5],3),
             z=c(rep(c(&quot;h&quot;,&quot;j&quot;,&quot;k&quot;,&quot;t&quot;,&quot;u&quot;), 2), LETTERS[1:5])
)

创建于 2023-02-14,由 reprex package (v2.0.1) 创建

英文:

We can use group_modify() and check for the condition using the .y argument which is a tibble of the current group. So we can say: if the condition is met return the distinct(.x) group otherwise return the whole group .x.

library(dplyr)

df |&gt;
  group_by(x, z) |&gt;
  group_modify(~ if(.y$x == 3 &amp;&amp; .y$z == &quot;k&quot;) distinct(.x) else .x)

#&gt; # A tibble: 14 x 3
#&gt; # Groups:   x, z [10]
#&gt;        x z     y    
#&gt;    &lt;int&gt; &lt;chr&gt; &lt;chr&gt;
#&gt;  1     1 A     A    
#&gt;  2     1 h     A    
#&gt;  3     1 h     A    
#&gt;  4     2 B     B    
#&gt;  5     2 j     B    
#&gt;  6     2 j     B    
#&gt;  7     3 C     C    
#&gt;  8     3 k     C    
#&gt;  9     4 D     D    
#&gt; 10     4 t     D    
#&gt; 11     4 t     D    
#&gt; 12     5 E     E    
#&gt; 13     5 u     E    
#&gt; 14     5 u     E

Data from OP

df &lt;- tibble(x=rep(seq(5), 3), y=rep(LETTERS[1:5],3),
             z=c(rep(c(&quot;h&quot;,&quot;j&quot;,&quot;k&quot;,&quot;t&quot;,&quot;u&quot;), 2), LETTERS[1:5])
)

<sup>Created on 2023-02-14 by the reprex package (v2.0.1)</sup>

huangapple
  • 本文由 发表于 2023年2月14日 19:07:59
  • 转载请务必保留本文链接:https://go.coder-hub.com/75446942.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定