英文:
How to subset duplicates on the earliest date among a group?
问题
我有一个包含多个个体(`id`)的`data.frame`。之前已经删除了重复的行。
```R
df <- data.frame(id=as.integer(c(123,123,123,124,124,124,125,125,125,126,126,126)),
date=as.Date(c("2014-03-12", "2014-03-12", "2015-09-16",
"2015-10-24", "2016-12-11", "2016-12-11",
"2017-08-06", "2017-11-26", "2018-01-29",
"2015-09-16", "2015-09-16", "2015-09-16")),
fruit=as.character(c("Apple", "Orange", "Passion fruit", "Banana",
"Lemon", "Strawberry", "Banana", "Apple",
"Passion fruit", "Orange", "Bluberry", "Pineapple")),
row=rep(c(1, 2, 3)))
我需要选择每个个体中包含重复的最早日期,也就是说,如果最早日期发生多次,我需要保留所有出现的情况。
期望的输出:
df
id date fruit row
1 123 2014-03-12 Apple 1
2 123 2014-03-12 Orange 2
10 126 2015-09-16 Orange 1
11 126 2015-09-16 Blueberry 2
12 126 2015-09-16 Pineapple 3
英文:
I have a data.frame
with multiple events by individual (id
). Duplicated
rows
have been previously removed.
df <- data.frame(id=as.integer(c(123,123,123,124,124,124,125,125,125,126,126,126)),
date=as.Date(c("2014-03-12", "2014-03-12", "2015-09-16",
"2015-10-24", "2016-12-11", "2016-12-11",
"2017-08-06", "2017-11-26", "2018-01-29",
"2015-09-16", "2015-09-16", "2015-09-16")),
fruit=as.character(c("Apple", "Orange", "Passion fruit", "Banana",
"Lemon", "Strawberry", "Banana", "Apple",
"Passion fruit", "Orange", "Bluberry", "Pineapple")),
row=rep(c(1, 2, 3)))
id date fruit row
1 123 2014-03-12 Apple 1
2 123 2014-03-12 Orange 2
3 123 2015-09-16 Passion fruit 3
4 124 2015-10-24 Banana 1
5 124 2016-12-11 Lemon 2
6 124 2016-12-11 Strawberry 3
7 125 2017-08-06 Banana 1
8 125 2017-11-26 Apple 2
9 125 2018-01-29 Passion fruit 3
10 126 2015-09-16 Orange 1
11 126 2015-09-16 Blueberry 2
12 126 2015-09-16 Pineapple 3
I need to select only the earliest date per individual that contains a duplicate, that is, if the earliest date happens more than once, I need to keep all occurrences.
Desired Output:
df
id date fruit row
1 123 2014-03-12 Apple 1
2 123 2014-03-12 Orange 2
3 126 2015-09-16 Orange 1
4 126 2015-09-16 Blueberry 2
5 126 2015-09-16 Pineapple 3
答案1
得分: 2
我们可以按'id'分组,使用min
创建条件,然后使用duplicated
来检查重复项。
library(dplyr)
df %>%
filter(date == min(date) & (duplicated(date) |
duplicated(date, fromLast = TRUE)), .by = id)
输出:
id date fruit row
1 123 2014-03-12 Apple 1
2 123 2014-03-12 Orange 2
3 126 2015-09-16 Orange 1
4 126 2015-09-16 Blueberry 2
5 126 2015-09-16 Pineapple 3
英文:
We could group by 'id', create a condition with min
and use duplicated
to check for duplicates
library(dplyr)
df %>%
filter(date == min(date) & (duplicated(date)|
duplicated(date, fromLast = TRUE)), .by = id)
-output
id date fruit row
1 123 2014-03-12 Apple 1
2 123 2014-03-12 Orange 2
3 126 2015-09-16 Orange 1
4 126 2015-09-16 Bluberry 2
5 126 2015-09-16 Pineapple 3
答案2
得分: 2
这里是使用dplyr
的另一种尝试:
library(dplyr) # 版本 >= 1.1.0
df %>%
slice_min(date, by = id) %>%
filter(n() >= 2, .by = c(id, date))
#> id date fruit row
#> 1 123 2014-03-12 Apple 1
#> 2 123 2014-03-12 Orange 2
#> 3 126 2015-09-16 Orange 1
#> 4 126 2015-09-16 Bluberry 2
#> 5 126 2015-09-16 Pineapple 3
创建于2023-05-25,使用 reprex v2.0.2
英文:
Here's another attempt using dplyr
:
library(dplyr) # version >= 1.1.0
df %>%
slice_min(date, by = id) %>%
filter(n() >= 2, .by = c(id, date))
#> id date fruit row
#> 1 123 2014-03-12 Apple 1
#> 2 123 2014-03-12 Orange 2
#> 3 126 2015-09-16 Orange 1
#> 4 126 2015-09-16 Bluberry 2
#> 5 126 2015-09-16 Pineapple 3
<sup>Created on 2023-05-25 with reprex v2.0.2</sup>
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论