英文:
R + dplyr: Partial Deduplication of Rows in a Tibble
问题
一个非常常见的问题是如何在R中删除数据框中的所有重复行,可以使用各种工具来完成(我喜欢dplyr+distinct)。
然而,如果您的数据集包含多个重复的行,但您不想删除所有重复行,只想删除某些变量的组合,该怎么办呢?
我不知道如何实现这一点,所以欢迎任何建议。
请查看帖子末尾的reprex。
感谢!
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
df <- tibble(x=rep(seq(5), 3), y=rep(LETTERS[1:5],3),
z=c(rep(c("h","j","k","t","u"), 2), LETTERS[1:5])
)
df
#> # A tibble: 15 × 3
#> x y z
#> <int> <chr> <chr>
#> 1 1 A h
#> 2 2 B j
#> 3 3 C k
#> 4 4 D t
#> 5 5 E u
#> 6 1 A h
#> 7 2 B j
#> 8 3 C k
#> 9 4 D t
#> 10 5 E u
#> 11 1 A A
#> 12 2 B B
#> 13 3 C C
#> 14 4 D D
#> 15 5 E E
df_ded <- df |>
distinct()
df_ded
#> # A tibble: 10 × 3
#> x y z
#> <int> <chr> <chr>
#> 1 1 A h
#> 2 2 B j
#> 3 3 C k
#> 4 4 D t
#> 5 5 E u
#> 6 1 A A
#> 7 2 B B
#> 8 3 C C
#> 9 4 D D
#> 10 5 E E
## 我只想要去重x==3和z=="k"的行。
df_ded_partial <- df |>
distinct(x==3, z=="k") ## 但这不是我想要的。
## 如何实现呢?
df_ded_partial
#> # A tibble: 3 × 2
#> `x == 3` `z == "k"`
#> <lgl> <lgl>
#> 1 FALSE FALSE
#> 2 TRUE TRUE
#> 3 TRUE FALSE
使用reprex v2.0.2于2023年02月14日创建
英文:
A very common question is how to remove all the duplicated lines in a data frame in R, something which can be done with a variety of tools (I like dplyr+distinct).
However, what if your dataset contains several duplicated lines, but you do not want to remove all of them, but only those for some combination of the variables?
I do not know how to achieve that, so any suggestion is welcome.
Please have a look at the reprex at the end of the post.
Thanks!
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
df <- tibble(x=rep(seq(5), 3), y=rep(LETTERS[1:5],3),
z=c(rep(c("h","j","k","t","u"), 2), LETTERS[1:5])
)
df
#> # A tibble: 15 × 3
#> x y z
#> <int> <chr> <chr>
#> 1 1 A h
#> 2 2 B j
#> 3 3 C k
#> 4 4 D t
#> 5 5 E u
#> 6 1 A h
#> 7 2 B j
#> 8 3 C k
#> 9 4 D t
#> 10 5 E u
#> 11 1 A A
#> 12 2 B B
#> 13 3 C C
#> 14 4 D D
#> 15 5 E E
df_ded <- df |>
distinct()
df_ded
#> # A tibble: 10 × 3
#> x y z
#> <int> <chr> <chr>
#> 1 1 A h
#> 2 2 B j
#> 3 3 C k
#> 4 4 D t
#> 5 5 E u
#> 6 1 A A
#> 7 2 B B
#> 8 3 C C
#> 9 4 D D
#> 10 5 E E
## I want to deduplicate only the rows with x==3 and z=="k"
df_ded_partial <- df |>
distinct(x==3, z=="k") ## but this is not what I mean.
## How to achieve it?
df_ded_partial
#> # A tibble: 3 × 2
#> `x == 3` `z == "k"`
#> <lgl> <lgl>
#> 1 FALSE FALSE
#> 2 TRUE TRUE
#> 3 TRUE FALSE
<sup>Created on 2023-02-14 with reprex v2.0.2</sup>
答案1
得分: 5
我们可以使用 group_modify()
并使用 .y
参数来检查条件,该参数是当前分组的 tibble。所以我们可以说:如果条件满足,则返回 distinct(.x)
分组,否则返回整个分组 .x
。
library(dplyr)
df |>
group_by(x, z) |>
group_modify(~ if(.y$x == 3 && .y$z == "k") distinct(.x) else .x)
#> # A tibble: 14 x 3
#> # Groups: x, z [10]
#> x z y
#> <int> <chr> <chr>
#> 1 1 A A
#> 2 1 h A
#> 3 1 h A
#> 4 2 B B
#> 5 2 j B
#> 6 2 j B
#> 7 3 C C
#> 8 3 k C
#> 9 4 D D
#> 10 4 t D
#> 11 4 t D
#> 12 5 E E
#> 13 5 u E
#> 14 5 u E
来自 OP 的数据
df <- tibble(x=rep(seq(5), 3), y=rep(LETTERS[1:5],3),
z=c(rep(c("h","j","k","t","u"), 2), LETTERS[1:5])
)
创建于 2023-02-14,由 reprex package (v2.0.1) 创建
英文:
We can use group_modify()
and check for the condition using the .y
argument which is a tibble of the current group. So we can say: if the condition is met return the distinct(.x)
group otherwise return the whole group .x
.
library(dplyr)
df |>
group_by(x, z) |>
group_modify(~ if(.y$x == 3 && .y$z == "k") distinct(.x) else .x)
#> # A tibble: 14 x 3
#> # Groups: x, z [10]
#> x z y
#> <int> <chr> <chr>
#> 1 1 A A
#> 2 1 h A
#> 3 1 h A
#> 4 2 B B
#> 5 2 j B
#> 6 2 j B
#> 7 3 C C
#> 8 3 k C
#> 9 4 D D
#> 10 4 t D
#> 11 4 t D
#> 12 5 E E
#> 13 5 u E
#> 14 5 u E
Data from OP
df <- tibble(x=rep(seq(5), 3), y=rep(LETTERS[1:5],3),
z=c(rep(c("h","j","k","t","u"), 2), LETTERS[1:5])
)
<sup>Created on 2023-02-14 by the reprex package (v2.0.1)</sup>
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论