英文:
Saving only the last duplicate by group in R
问题
我有一个示例数据集:
example <- data.frame(
date = c("6/1/22", "6/2/22", "6/3/22",
"6/1/22", "6/2/22", "6/2/22", "6/3/22",
"6/2/22", "6/2/22", "6/2/22", "6/3/22", "6/4/22"),
sub = c(1101, 1101, 1101,
1102, 1102, 1102, 1102,
1103, 1103, 1103, 1103, 1103),
text = c("a", "b", "c",
"d","e", "f", "g",
"h", "i", "j", "k", "l"))
有一些sub
在date
列中有重复的条目(例如1102,1103)。如果在date
列中有重复的条目,我只想保留每个sub
的最底部行。这是我想要从这个数据框中得到的两个示例输出。
输出1: 一个数据框,其中每个sub
都有唯一的日期
output1 <- data.frame(
date = c("6/1/22", "6/2/22", "6/3/22",
"6/1/22", "6/2/22", "6/3/22",
"6/2/22", "6/3/22", "6/4/22"),
sub = c(1101, 1101, 1101,
1102, 1102, 1102,
1103, 1103, 1103),
text = c("a", "b", "c",
"d", "f", "g",
"j", "k", "l")
)
输出2: 一个数据框,其中包含所有sub
和date
的条目,其中存在多个副本。
output2 <- data.frame(
date = c("6/2/22", "6/2/22",
"6/2/22", "6/2/22", "6/2/22"),
sub = c(1102, 1102,
1103, 1103, 1103),
text = c("e", "f",
"h", "i", "j")
)
我已经看到使用distinct()
来解决这个问题的解决方案,但通常只保留具有重复值的第一行。我想要的是最新的值(例如,重复条目的底部行)。有谁知道如何做到这一点吗?非常感谢!
英文:
I have an example dataset:
example <- data.frame(
date = c("6/1/22", "6/2/22", "6/3/22",
"6/1/22", "6/2/22", "6/2/22", "6/3/22",
"6/2/22", "6/2/22", "6/2/22", "6/3/22", "6/4/22"),
sub = c(1101, 1101, 1101,
1102, 1102, 1102, 1102,
1103, 1103, 1103, 1103, 1103),
text = c("a", "b", "c",
"d","e", "f", "g",
"h", "i", "j", "k", "l"))
There are some sub
s that have repeated entries for some dates (e.g. 1102, 1103). I want to keep ONLY the bottom most row for each sub if there are duplicated entries in the date
column. These are the two example outputs I want from this dataframe.
Output 1: A dataframe where there are unique dates for each sub
output1 <- data.frame(
date = c("6/1/22", "6/2/22", "6/3/22",
"6/1/22", "6/2/22", "6/3/22",
"6/2/22", "6/3/22", "6/4/22"),
sub = c(1101, 1101, 1101,
1102, 1102, 1102,
1103, 1103, 1103),
text = c("a", "b", "c",
"d","f", "g",
"j", "k", "l")
)
Output 2: A dataframe with ALL entries of the sub
s and date
s where there are multiple copies.
output2 <- data.frame(
date = c("6/2/22", "6/2/22",
"6/2/22", "6/2/22", "6/2/22"),
sub = c(1102, 1102,
1103, 1103, 1103),
text = c("e", "f",
"h", "i", "j")
)
I have seen solutions for this using distinct()
, but that usually only keeps the first row with the duplicated value. I would like the latest value (e.g. the bottom most row of the duplicated entry). Does anyone know how to do this? Thank you so much!
答案1
得分: 2
任务 1
使用 dplyr
,你可以通过 sub
和 date
来使用 slice_tail()
:
example %>%
slice_tail(by = c(sub, date))
# date sub text
# 1 6/1/22 1101 a
# 2 6/2/22 1101 b
# 3 6/3/22 1101 c
# 4 6/1/22 1102 d
# 5 6/2/22 1102 f
# 6 6/3/22 1102 g
# 7 6/2/22 1103 j
# 8 6/3/22 1103 k
# 9 6/4/22 1103 l
任务 2
使用 filter(n() > 1)
来按照 sub
和 date
进行筛选:
example %>%
filter(n() > 1, .by = c(sub, date))
# date sub text
# 1 6/2/22 1102 e
# 2 6/2/22 1102 f
# 3 6/2/22 1103 h
# 4 6/2/22 1103 i
# 5 6/2/22 1103 j
英文:
Task 1
With dplyr
, you can use slice_tail()
by sub
and date
:
example %>%
slice_tail(by = c(sub, date))
# date sub text
# 1 6/1/22 1101 a
# 2 6/2/22 1101 b
# 3 6/3/22 1101 c
# 4 6/1/22 1102 d
# 5 6/2/22 1102 f
# 6 6/3/22 1102 g
# 7 6/2/22 1103 j
# 8 6/3/22 1103 k
# 9 6/4/22 1103 l
Task 2
Use filter(n() > 1)
by sub
and date
:
example %>%
filter(n() > 1, .by = c(sub, date))
# date sub text
# 1 6/2/22 1102 e
# 2 6/2/22 1102 f
# 3 6/2/22 1103 h
# 4 6/2/22 1103 i
# 5 6/2/22 1103 j
答案2
得分: 1
以下是您要翻译的内容:
They key is to use fromLast = TRUE
in the base duplicated()
function.
dplyr approach
As you've tagged [tag:dplyr], here is an approach using that package. First, create a column indicating whether each date is duplicated, starting from the last date by group:
library(dplyr)
example <- example |>
group_by(sub) |>
mutate(duplicate_date = duplicated(date, fromLast = TRUE))
Then for the first output simply filter those case where there is a duplicated date, by sub
:
output_1 <- example |>
filter(!duplicate_date ) |>
select(-duplicate_date)
output_1
# date sub text
# <chr> <dbl> <chr>
# 1 6/1/22 1101 a
# 2 6/2/22 1101 b
# 3 6/3/22 1101 c
# 4 6/1/22 1102 d
# 5 6/2/22 1102 f
# 6 6/3/22 1102 g
# 7 6/2/22 1103 j
# 8 6/3/22 1103 k
# 9 6/4/22 1103 l
For the second, group by date
and sub
and filter the cases where there are any duplicate dates, by group:
output_2 <- example |>
group_by(sub, date) |>
filter(any(duplicate_date)) |>
select(-duplicate_date)
output2
# date sub text
# 1 6/2/22 1102 e
# 2 6/2/22 1102 f
# 3 6/2/22 1103 h
# 4 6/2/22 1103 i
# 5 6/2/22 1103 j
Just to confirm these are the same as the output1
and output2
you posted (you need to convert to data.frame
first or they will not be identical as dplyr
produces a tbl_df
:
identical(data.frame(output_1), output1) # TRUE
identical(data.frame(output_2), output2) # TRUE
base R approach
If you want to use base R, here is a method that does essentially the same thing, using tapply()
. Output 1:
example <- within(example, duplicate_date <- unlist(
tapply(date, sub, \(x) duplicated(x, fromLast = TRUE))
))
output_1 <- with(example, example[!duplicate_date,])
Output 2:
duplicate_sub_date <- with(example,
tapply(duplicate_date, list(date, sub), any)
)
output_2 <- example[diag(duplicate_sub_date[example$date, as.character(example$sub)]), ]
This is basically the same logic as the dplyr
approach and produces the same output, though you get a data.frame
rather than a tbl_df
.
英文:
They key is to use fromLast = TRUE
in the base duplicated()
function.
dplyr approach
As you've tagged [tag:dplyr], here is an approach using that package. First, create a column indicating whether each date is duplicated, starting from the last date by group:
library(dplyr)
example <- example |>
group_by(sub) |>
mutate(duplicate_date = duplicated(date, fromLast = TRUE))
Then for the first output simply filter those case where there is a duplicated date, by sub
:
output_1 <- example |>
filter(!duplicate_date ) |>
select(-duplicate_date)
output_1
# date sub text
# <chr> <dbl> <chr>
# 1 6/1/22 1101 a
# 2 6/2/22 1101 b
# 3 6/3/22 1101 c
# 4 6/1/22 1102 d
# 5 6/2/22 1102 f
# 6 6/3/22 1102 g
# 7 6/2/22 1103 j
# 8 6/3/22 1103 k
# 9 6/4/22 1103 l
For the second, group by date
and sub
and filter the cases where there are any duplicate dates, by group:
output_2 <- example |>
group_by(sub, date) |>
filter(any(duplicate_date)) |>
select(-duplicate_date)
output2
# date sub text
# 1 6/2/22 1102 e
# 2 6/2/22 1102 f
# 3 6/2/22 1103 h
# 4 6/2/22 1103 i
# 5 6/2/22 1103 j
Just to confirm these are the same as the output1
and output2
you posted (you need to convert to data.frame
first or they will not be identical as dplyr
produces a tbl_df
:
identical(data.frame(output_1), output1) # TRUE
identical(data.frame(output_2), output2) # TRUE
base R approach
If you want to use base R, here is a method that does essentially the same thing, using tapply()
. Output 1:
example <- within(example, duplicate_date <- unlist(
tapply(date, sub, \(x) duplicated(x, fromLast = TRUE))
))
output_1 <- with(example, example[!duplicate_date,])
Output 2:
duplicate_sub_date <- with(example,
tapply(duplicate_date, list(date, sub), any)
)
output_2 <- example[diag(duplicate_sub_date[example$date, as.character(example$sub)]), ]
This is basically the same logic as the dplyr
approach and produces the same output, though you get a data.frame
rather than a tbl_df
.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论