如何在一组中选择最早日期上的重复项?

huangapple go评论120阅读模式
英文:

How to subset duplicates on the earliest date among a group?

问题

我有一个包含多个个体(`id`)的`data.frame`。之前已经删除了重复的行。

```R
df <- data.frame(id=as.integer(c(123,123,123,124,124,124,125,125,125,126,126,126)),
                 date=as.Date(c("2014-03-12", "2014-03-12", "2015-09-16", 
                                "2015-10-24", "2016-12-11", "2016-12-11", 
                                "2017-08-06", "2017-11-26", "2018-01-29", 
                                "2015-09-16", "2015-09-16", "2015-09-16")),
                 fruit=as.character(c("Apple", "Orange", "Passion fruit", "Banana", 
                                      "Lemon", "Strawberry",  "Banana", "Apple",
                                      "Passion fruit", "Orange", "Bluberry", "Pineapple")),
                 row=rep(c(1, 2, 3)))        

我需要选择每个个体中包含重复的最早日期,也就是说,如果最早日期发生多次,我需要保留所有出现的情况。

期望的输出

df
    id       date         fruit row
1  123 2014-03-12         Apple   1
2  123 2014-03-12        Orange   2
10 126 2015-09-16        Orange   1
11 126 2015-09-16      Blueberry  2
12 126 2015-09-16     Pineapple   3
英文:

I have a data.frame with multiple events by individual (id). Duplicated rows have been previously removed.

df &lt;- data.frame(id=as.integer(c(123,123,123,124,124,124,125,125,125,126,126,126)),
                 date=as.Date(c(&quot;2014-03-12&quot;, &quot;2014-03-12&quot;, &quot;2015-09-16&quot;, 
                                &quot;2015-10-24&quot;, &quot;2016-12-11&quot;, &quot;2016-12-11&quot;, 
                                &quot;2017-08-06&quot;, &quot;2017-11-26&quot;, &quot;2018-01-29&quot;, 
                                &quot;2015-09-16&quot;, &quot;2015-09-16&quot;, &quot;2015-09-16&quot;)),
                 fruit=as.character(c(&quot;Apple&quot;, &quot;Orange&quot;, &quot;Passion fruit&quot;, &quot;Banana&quot;, 
                                      &quot;Lemon&quot;, &quot;Strawberry&quot;,  &quot;Banana&quot;, &quot;Apple&quot;,
                                      &quot;Passion fruit&quot;, &quot;Orange&quot;, &quot;Bluberry&quot;, &quot;Pineapple&quot;)),
                 row=rep(c(1, 2, 3)))        
        id       date         fruit row
    1  123 2014-03-12         Apple   1
    2  123 2014-03-12        Orange   2
    3  123 2015-09-16 Passion fruit   3
    4  124 2015-10-24        Banana   1
    5  124 2016-12-11         Lemon   2
    6  124 2016-12-11    Strawberry   3
    7  125 2017-08-06        Banana   1
    8  125 2017-11-26         Apple   2
    9  125 2018-01-29 Passion fruit   3
    10 126 2015-09-16        Orange   1
    11 126 2015-09-16      Blueberry  2
    12 126 2015-09-16     Pineapple   3

I need to select only the earliest date per individual that contains a duplicate, that is, if the earliest date happens more than once, I need to keep all occurrences.

Desired Output:

df
    id       date         fruit row
1  123 2014-03-12         Apple   1
2  123 2014-03-12        Orange   2
3  126 2015-09-16        Orange   1
4  126 2015-09-16      Blueberry  2
5  126 2015-09-16     Pineapple   3

答案1

得分: 2

我们可以按'id'分组,使用min创建条件,然后使用duplicated来检查重复项。

library(dplyr)
df %>%
  filter(date == min(date) & (duplicated(date) |
    duplicated(date, fromLast = TRUE)), .by = id)

输出:

   id       date     fruit row
1 123 2014-03-12     Apple   1
2 123 2014-03-12    Orange   2
3 126 2015-09-16    Orange   1
4 126 2015-09-16  Blueberry   2
5 126 2015-09-16 Pineapple   3
英文:

We could group by 'id', create a condition with min and use duplicated to check for duplicates

library(dplyr)
df %&gt;% 
  filter(date == min(date) &amp; (duplicated(date)|
     duplicated(date, fromLast = TRUE)), .by = id)

-output

   id       date     fruit row
1 123 2014-03-12     Apple   1
2 123 2014-03-12    Orange   2
3 126 2015-09-16    Orange   1
4 126 2015-09-16  Bluberry   2
5 126 2015-09-16 Pineapple   3

答案2

得分: 2

这里是使用dplyr的另一种尝试:

library(dplyr) # 版本 >= 1.1.0

df %>% 
  slice_min(date, by = id) %>% 
  filter(n() >= 2, .by = c(id, date))

#>    id       date     fruit row
#> 1 123 2014-03-12     Apple   1
#> 2 123 2014-03-12    Orange   2
#> 3 126 2015-09-16    Orange   1
#> 4 126 2015-09-16  Bluberry   2
#> 5 126 2015-09-16 Pineapple   3

创建于2023-05-25,使用 reprex v2.0.2

英文:

Here's another attempt using dplyr:

library(dplyr) # version &gt;= 1.1.0

df %&gt;% 
  slice_min(date, by = id) %&gt;% 
  filter(n() &gt;= 2, .by = c(id, date))

#&gt;    id       date     fruit row
#&gt; 1 123 2014-03-12     Apple   1
#&gt; 2 123 2014-03-12    Orange   2
#&gt; 3 126 2015-09-16    Orange   1
#&gt; 4 126 2015-09-16  Bluberry   2
#&gt; 5 126 2015-09-16 Pineapple   3

<sup>Created on 2023-05-25 with reprex v2.0.2</sup>

huangapple
  • 本文由 发表于 2023年5月25日 17:07:50
  • 转载请务必保留本文链接:https://go.coder-hub.com/76330594.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定