R + dplyr: Tibble中行的部分去重

huangapple go评论92阅读模式
英文:

R + dplyr: Partial Deduplication of Rows in a Tibble

问题

一个非常常见的问题是如何在R中删除数据框中的所有重复行,可以使用各种工具来完成(我喜欢dplyr+distinct)。

然而,如果您的数据集包含多个重复的行,但您不想删除所有重复行,只想删除某些变量的组合,该怎么办呢?

我不知道如何实现这一点,所以欢迎任何建议。

请查看帖子末尾的reprex。

感谢!

  1. library(dplyr)
  2. #>
  3. #> Attaching package: 'dplyr'
  4. #> The following objects are masked from 'package:stats':
  5. #>
  6. #> filter, lag
  7. #> The following objects are masked from 'package:base':
  8. #>
  9. #> intersect, setdiff, setequal, union
  10. df <- tibble(x=rep(seq(5), 3), y=rep(LETTERS[1:5],3),
  11. z=c(rep(c("h","j","k","t","u"), 2), LETTERS[1:5])
  12. )
  13. df
  14. #> # A tibble: 15 × 3
  15. #> x y z
  16. #> <int> <chr> <chr>
  17. #> 1 1 A h
  18. #> 2 2 B j
  19. #> 3 3 C k
  20. #> 4 4 D t
  21. #> 5 5 E u
  22. #> 6 1 A h
  23. #> 7 2 B j
  24. #> 8 3 C k
  25. #> 9 4 D t
  26. #> 10 5 E u
  27. #> 11 1 A A
  28. #> 12 2 B B
  29. #> 13 3 C C
  30. #> 14 4 D D
  31. #> 15 5 E E
  32. df_ded <- df |>
  33. distinct()
  34. df_ded
  35. #> # A tibble: 10 × 3
  36. #> x y z
  37. #> <int> <chr> <chr>
  38. #> 1 1 A h
  39. #> 2 2 B j
  40. #> 3 3 C k
  41. #> 4 4 D t
  42. #> 5 5 E u
  43. #> 6 1 A A
  44. #> 7 2 B B
  45. #> 8 3 C C
  46. #> 9 4 D D
  47. #> 10 5 E E
  48. ## 我只想要去重x==3和z=="k"的行。
  49. df_ded_partial <- df |>
  50. distinct(x==3, z=="k") ## 但这不是我想要的。
  51. ## 如何实现呢?
  52. df_ded_partial
  53. #> # A tibble: 3 × 2
  54. #> `x == 3` `z == "k"`
  55. #> <lgl> <lgl>
  56. #> 1 FALSE FALSE
  57. #> 2 TRUE TRUE
  58. #> 3 TRUE FALSE

使用reprex v2.0.2于2023年02月14日创建

英文:

A very common question is how to remove all the duplicated lines in a data frame in R, something which can be done with a variety of tools (I like dplyr+distinct).

However, what if your dataset contains several duplicated lines, but you do not want to remove all of them, but only those for some combination of the variables?

I do not know how to achieve that, so any suggestion is welcome.

Please have a look at the reprex at the end of the post.

Thanks!

  1. library(dplyr)
  2. #>
  3. #> Attaching package: 'dplyr'
  4. #> The following objects are masked from 'package:stats':
  5. #>
  6. #> filter, lag
  7. #> The following objects are masked from 'package:base':
  8. #>
  9. #> intersect, setdiff, setequal, union
  10. df <- tibble(x=rep(seq(5), 3), y=rep(LETTERS[1:5],3),
  11. z=c(rep(c("h","j","k","t","u"), 2), LETTERS[1:5])
  12. )
  13. df
  14. #> # A tibble: 15 × 3
  15. #> x y z
  16. #> <int> <chr> <chr>
  17. #> 1 1 A h
  18. #> 2 2 B j
  19. #> 3 3 C k
  20. #> 4 4 D t
  21. #> 5 5 E u
  22. #> 6 1 A h
  23. #> 7 2 B j
  24. #> 8 3 C k
  25. #> 9 4 D t
  26. #> 10 5 E u
  27. #> 11 1 A A
  28. #> 12 2 B B
  29. #> 13 3 C C
  30. #> 14 4 D D
  31. #> 15 5 E E
  32. df_ded <- df |>
  33. distinct()
  34. df_ded
  35. #> # A tibble: 10 × 3
  36. #> x y z
  37. #> <int> <chr> <chr>
  38. #> 1 1 A h
  39. #> 2 2 B j
  40. #> 3 3 C k
  41. #> 4 4 D t
  42. #> 5 5 E u
  43. #> 6 1 A A
  44. #> 7 2 B B
  45. #> 8 3 C C
  46. #> 9 4 D D
  47. #> 10 5 E E
  48. ## I want to deduplicate only the rows with x==3 and z=="k"
  49. df_ded_partial <- df |>
  50. distinct(x==3, z=="k") ## but this is not what I mean.
  51. ## How to achieve it?
  52. df_ded_partial
  53. #> # A tibble: 3 × 2
  54. #> `x == 3` `z == "k"`
  55. #> <lgl> <lgl>
  56. #> 1 FALSE FALSE
  57. #> 2 TRUE TRUE
  58. #> 3 TRUE FALSE

<sup>Created on 2023-02-14 with reprex v2.0.2</sup>

答案1

得分: 5

我们可以使用 group_modify() 并使用 .y 参数来检查条件,该参数是当前分组的 tibble。所以我们可以说:如果条件满足,则返回 distinct(.x) 分组,否则返回整个分组 .x

  1. library(dplyr)
  2. df |&gt;
  3. group_by(x, z) |&gt;
  4. group_modify(~ if(.y$x == 3 &amp;&amp; .y$z == &quot;k&quot;) distinct(.x) else .x)
  5. #&gt; # A tibble: 14 x 3
  6. #&gt; # Groups: x, z [10]
  7. #&gt; x z y
  8. #&gt; &lt;int&gt; &lt;chr&gt; &lt;chr&gt;
  9. #&gt; 1 1 A A
  10. #&gt; 2 1 h A
  11. #&gt; 3 1 h A
  12. #&gt; 4 2 B B
  13. #&gt; 5 2 j B
  14. #&gt; 6 2 j B
  15. #&gt; 7 3 C C
  16. #&gt; 8 3 k C
  17. #&gt; 9 4 D D
  18. #&gt; 10 4 t D
  19. #&gt; 11 4 t D
  20. #&gt; 12 5 E E
  21. #&gt; 13 5 u E
  22. #&gt; 14 5 u E

来自 OP 的数据

  1. df &lt;- tibble(x=rep(seq(5), 3), y=rep(LETTERS[1:5],3),
  2. z=c(rep(c(&quot;h&quot;,&quot;j&quot;,&quot;k&quot;,&quot;t&quot;,&quot;u&quot;), 2), LETTERS[1:5])
  3. )

创建于 2023-02-14,由 reprex package (v2.0.1) 创建

英文:

We can use group_modify() and check for the condition using the .y argument which is a tibble of the current group. So we can say: if the condition is met return the distinct(.x) group otherwise return the whole group .x.

  1. library(dplyr)
  2. df |&gt;
  3. group_by(x, z) |&gt;
  4. group_modify(~ if(.y$x == 3 &amp;&amp; .y$z == &quot;k&quot;) distinct(.x) else .x)
  5. #&gt; # A tibble: 14 x 3
  6. #&gt; # Groups: x, z [10]
  7. #&gt; x z y
  8. #&gt; &lt;int&gt; &lt;chr&gt; &lt;chr&gt;
  9. #&gt; 1 1 A A
  10. #&gt; 2 1 h A
  11. #&gt; 3 1 h A
  12. #&gt; 4 2 B B
  13. #&gt; 5 2 j B
  14. #&gt; 6 2 j B
  15. #&gt; 7 3 C C
  16. #&gt; 8 3 k C
  17. #&gt; 9 4 D D
  18. #&gt; 10 4 t D
  19. #&gt; 11 4 t D
  20. #&gt; 12 5 E E
  21. #&gt; 13 5 u E
  22. #&gt; 14 5 u E

Data from OP

  1. df &lt;- tibble(x=rep(seq(5), 3), y=rep(LETTERS[1:5],3),
  2. z=c(rep(c(&quot;h&quot;,&quot;j&quot;,&quot;k&quot;,&quot;t&quot;,&quot;u&quot;), 2), LETTERS[1:5])
  3. )

<sup>Created on 2023-02-14 by the reprex package (v2.0.1)</sup>

huangapple
  • 本文由 发表于 2023年2月14日 19:07:59
  • 转载请务必保留本文链接:https://go.coder-hub.com/75446942.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定