英文:
How to subset duplicates on the earliest date among a group?
问题
我有一个包含多个个体(`id`)的`data.frame`。之前已经删除了重复的行。
```R
df <- data.frame(id=as.integer(c(123,123,123,124,124,124,125,125,125,126,126,126)),
                 date=as.Date(c("2014-03-12", "2014-03-12", "2015-09-16", 
                                "2015-10-24", "2016-12-11", "2016-12-11", 
                                "2017-08-06", "2017-11-26", "2018-01-29", 
                                "2015-09-16", "2015-09-16", "2015-09-16")),
                 fruit=as.character(c("Apple", "Orange", "Passion fruit", "Banana", 
                                      "Lemon", "Strawberry",  "Banana", "Apple",
                                      "Passion fruit", "Orange", "Bluberry", "Pineapple")),
                 row=rep(c(1, 2, 3)))        
我需要选择每个个体中包含重复的最早日期,也就是说,如果最早日期发生多次,我需要保留所有出现的情况。
期望的输出:
df
    id       date         fruit row
1  123 2014-03-12         Apple   1
2  123 2014-03-12        Orange   2
10 126 2015-09-16        Orange   1
11 126 2015-09-16      Blueberry  2
12 126 2015-09-16     Pineapple   3
英文:
I have a data.frame with multiple events by individual (id). Duplicated rows have been previously removed.
df <- data.frame(id=as.integer(c(123,123,123,124,124,124,125,125,125,126,126,126)),
                 date=as.Date(c("2014-03-12", "2014-03-12", "2015-09-16", 
                                "2015-10-24", "2016-12-11", "2016-12-11", 
                                "2017-08-06", "2017-11-26", "2018-01-29", 
                                "2015-09-16", "2015-09-16", "2015-09-16")),
                 fruit=as.character(c("Apple", "Orange", "Passion fruit", "Banana", 
                                      "Lemon", "Strawberry",  "Banana", "Apple",
                                      "Passion fruit", "Orange", "Bluberry", "Pineapple")),
                 row=rep(c(1, 2, 3)))        
        id       date         fruit row
    1  123 2014-03-12         Apple   1
    2  123 2014-03-12        Orange   2
    3  123 2015-09-16 Passion fruit   3
    4  124 2015-10-24        Banana   1
    5  124 2016-12-11         Lemon   2
    6  124 2016-12-11    Strawberry   3
    7  125 2017-08-06        Banana   1
    8  125 2017-11-26         Apple   2
    9  125 2018-01-29 Passion fruit   3
    10 126 2015-09-16        Orange   1
    11 126 2015-09-16      Blueberry  2
    12 126 2015-09-16     Pineapple   3
I need to select only the earliest date per individual that contains a duplicate, that is, if the earliest date happens more than once, I need to keep all occurrences.
Desired Output:
df
    id       date         fruit row
1  123 2014-03-12         Apple   1
2  123 2014-03-12        Orange   2
3  126 2015-09-16        Orange   1
4  126 2015-09-16      Blueberry  2
5  126 2015-09-16     Pineapple   3
答案1
得分: 2
我们可以按'id'分组,使用min创建条件,然后使用duplicated来检查重复项。
library(dplyr)
df %>%
  filter(date == min(date) & (duplicated(date) |
    duplicated(date, fromLast = TRUE)), .by = id)
输出:
   id       date     fruit row
1 123 2014-03-12     Apple   1
2 123 2014-03-12    Orange   2
3 126 2015-09-16    Orange   1
4 126 2015-09-16  Blueberry   2
5 126 2015-09-16 Pineapple   3
英文:
We could group by 'id', create a condition with min and use duplicated to check for duplicates
library(dplyr)
df %>% 
  filter(date == min(date) & (duplicated(date)|
     duplicated(date, fromLast = TRUE)), .by = id)
-output
   id       date     fruit row
1 123 2014-03-12     Apple   1
2 123 2014-03-12    Orange   2
3 126 2015-09-16    Orange   1
4 126 2015-09-16  Bluberry   2
5 126 2015-09-16 Pineapple   3
答案2
得分: 2
这里是使用dplyr的另一种尝试:
library(dplyr) # 版本 >= 1.1.0
df %>% 
  slice_min(date, by = id) %>% 
  filter(n() >= 2, .by = c(id, date))
#>    id       date     fruit row
#> 1 123 2014-03-12     Apple   1
#> 2 123 2014-03-12    Orange   2
#> 3 126 2015-09-16    Orange   1
#> 4 126 2015-09-16  Bluberry   2
#> 5 126 2015-09-16 Pineapple   3
创建于2023-05-25,使用 reprex v2.0.2
英文:
Here's another attempt using dplyr:
library(dplyr) # version >= 1.1.0
df %>% 
  slice_min(date, by = id) %>% 
  filter(n() >= 2, .by = c(id, date))
#>    id       date     fruit row
#> 1 123 2014-03-12     Apple   1
#> 2 123 2014-03-12    Orange   2
#> 3 126 2015-09-16    Orange   1
#> 4 126 2015-09-16  Bluberry   2
#> 5 126 2015-09-16 Pineapple   3
<sup>Created on 2023-05-25 with reprex v2.0.2</sup>
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。


评论