2023年5月25日 17:07:50go评论135阅读模式

英文:

How to subset duplicates on the earliest date among a group?

问题

我有一个包含多个个体（`id`）的`data.frame`。之前已经删除了重复的行。

```R
df <- data.frame(id=as.integer(c(123,123,123,124,124,124,125,125,125,126,126,126)),
                 date=as.Date(c("2014-03-12", "2014-03-12", "2015-09-16", 
                                "2015-10-24", "2016-12-11", "2016-12-11", 
                                "2017-08-06", "2017-11-26", "2018-01-29", 
                                "2015-09-16", "2015-09-16", "2015-09-16")),
                 fruit=as.character(c("Apple", "Orange", "Passion fruit", "Banana", 
                                      "Lemon", "Strawberry",  "Banana", "Apple",
                                      "Passion fruit", "Orange", "Bluberry", "Pineapple")),
                 row=rep(c(1, 2, 3)))

我需要选择每个个体中包含重复的最早日期，也就是说，如果最早日期发生多次，我需要保留所有出现的情况。

期望的输出：

df
    id       date         fruit row
1  123 2014-03-12         Apple   1
2  123 2014-03-12        Orange   2
10 126 2015-09-16        Orange   1
11 126 2015-09-16      Blueberry  2
12 126 2015-09-16     Pineapple   3

英文:

I have a data.frame with multiple events by individual (id). Duplicated rows have been previously removed.

df &lt;- data.frame(id=as.integer(c(123,123,123,124,124,124,125,125,125,126,126,126)),
                 date=as.Date(c(&quot;2014-03-12&quot;, &quot;2014-03-12&quot;, &quot;2015-09-16&quot;, 
                                &quot;2015-10-24&quot;, &quot;2016-12-11&quot;, &quot;2016-12-11&quot;, 
                                &quot;2017-08-06&quot;, &quot;2017-11-26&quot;, &quot;2018-01-29&quot;, 
                                &quot;2015-09-16&quot;, &quot;2015-09-16&quot;, &quot;2015-09-16&quot;)),
                 fruit=as.character(c(&quot;Apple&quot;, &quot;Orange&quot;, &quot;Passion fruit&quot;, &quot;Banana&quot;, 
                                      &quot;Lemon&quot;, &quot;Strawberry&quot;,  &quot;Banana&quot;, &quot;Apple&quot;,
                                      &quot;Passion fruit&quot;, &quot;Orange&quot;, &quot;Bluberry&quot;, &quot;Pineapple&quot;)),
                 row=rep(c(1, 2, 3)))

        id       date         fruit row
    1  123 2014-03-12         Apple   1
    2  123 2014-03-12        Orange   2
    3  123 2015-09-16 Passion fruit   3
    4  124 2015-10-24        Banana   1
    5  124 2016-12-11         Lemon   2
    6  124 2016-12-11    Strawberry   3
    7  125 2017-08-06        Banana   1
    8  125 2017-11-26         Apple   2
    9  125 2018-01-29 Passion fruit   3
    10 126 2015-09-16        Orange   1
    11 126 2015-09-16      Blueberry  2
    12 126 2015-09-16     Pineapple   3

I need to select only the earliest date per individual that contains a duplicate, that is, if the earliest date happens more than once, I need to keep all occurrences.

Desired Output:

df
    id       date         fruit row
1  123 2014-03-12         Apple   1
2  123 2014-03-12        Orange   2
3  126 2015-09-16        Orange   1
4  126 2015-09-16      Blueberry  2
5  126 2015-09-16     Pineapple   3

答案1

得分: 2

我们可以按'id'分组，使用min创建条件，然后使用duplicated来检查重复项。

library(dplyr)
df %>%
  filter(date == min(date) & (duplicated(date) |
    duplicated(date, fromLast = TRUE)), .by = id)

输出：

   id       date     fruit row
1 123 2014-03-12     Apple   1
2 123 2014-03-12    Orange   2
3 126 2015-09-16    Orange   1
4 126 2015-09-16  Blueberry   2
5 126 2015-09-16 Pineapple   3

英文:

We could group by 'id', create a condition with min and use duplicated to check for duplicates

library(dplyr)
df %&gt;% 
  filter(date == min(date) &amp; (duplicated(date)|
     duplicated(date, fromLast = TRUE)), .by = id)

-output

   id       date     fruit row
1 123 2014-03-12     Apple   1
2 123 2014-03-12    Orange   2
3 126 2015-09-16    Orange   1
4 126 2015-09-16  Bluberry   2
5 126 2015-09-16 Pineapple   3

答案2

得分: 2

这里是使用dplyr的另一种尝试：

library(dplyr) # 版本 >= 1.1.0

df %>% 
  slice_min(date, by = id) %>% 
  filter(n() >= 2, .by = c(id, date))

#>    id       date     fruit row
#> 1 123 2014-03-12     Apple   1
#> 2 123 2014-03-12    Orange   2
#> 3 126 2015-09-16    Orange   1
#> 4 126 2015-09-16  Bluberry   2
#> 5 126 2015-09-16 Pineapple   3

^{创建于2023-05-25，使用 reprex v2.0.2}

英文:

Here's another attempt using dplyr:

library(dplyr) # version &gt;= 1.1.0

df %&gt;% 
  slice_min(date, by = id) %&gt;% 
  filter(n() &gt;= 2, .by = c(id, date))

#&gt;    id       date     fruit row
#&gt; 1 123 2014-03-12     Apple   1
#&gt; 2 123 2014-03-12    Orange   2
#&gt; 3 126 2015-09-16    Orange   1
#&gt; 4 126 2015-09-16  Bluberry   2
#&gt; 5 126 2015-09-16 Pineapple   3

<sup>Created on 2023-05-25 with reprex v2.0.2</sup>

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

如何在一组中选择最早日期上的重复项？

问题

答案1

答案2

如何阻止prettyR describe()在控制台中打印

合并共享列但观测单位不同的数据框

随机选择 R 数据表中的 50 列会导致只有 50 行的表格。如何修复这个问题？

数据框架：根据条件替换值及其周围的值。

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论