2023年6月5日 15:04:04go评论96阅读模式

英文:

Saving only the last duplicate by group in R

问题

我有一个示例数据集：

example <- data.frame(
  date = c("6/1/22", "6/2/22", "6/3/22",
           "6/1/22", "6/2/22", "6/2/22", "6/3/22",
           "6/2/22", "6/2/22", "6/2/22", "6/3/22", "6/4/22"),
  sub = c(1101, 1101, 1101,
          1102, 1102, 1102, 1102,
          1103, 1103, 1103, 1103, 1103),
  text = c("a", "b", "c",
           "d","e", "f", "g",
           "h", "i", "j", "k", "l"))

有一些sub在date列中有重复的条目（例如1102，1103）。如果在date列中有重复的条目，我只想保留每个sub的最底部行。这是我想要从这个数据框中得到的两个示例输出。

输出1： 一个数据框，其中每个sub都有唯一的日期

output1 <- data.frame(
  date = c("6/1/22", "6/2/22", "6/3/22",
           "6/1/22", "6/2/22", "6/3/22",
           "6/2/22", "6/3/22", "6/4/22"),
  sub = c(1101, 1101, 1101,
          1102, 1102, 1102,
          1103, 1103, 1103),
  text = c("a", "b", "c",
           "d", "f", "g",
           "j", "k", "l")
)

输出2： 一个数据框，其中包含所有sub和date的条目，其中存在多个副本。

output2 <- data.frame(
  date = c("6/2/22", "6/2/22",
           "6/2/22", "6/2/22", "6/2/22"),
  sub = c(1102, 1102,
          1103, 1103, 1103),
  text = c("e", "f",
           "h", "i", "j")
)

我已经看到使用distinct()来解决这个问题的解决方案，但通常只保留具有重复值的第一行。我想要的是最新的值（例如，重复条目的底部行）。有谁知道如何做到这一点吗？非常感谢！

英文:

I have an example dataset:

example &lt;- data.frame(
  date = c(&quot;6/1/22&quot;, &quot;6/2/22&quot;, &quot;6/3/22&quot;,
           &quot;6/1/22&quot;, &quot;6/2/22&quot;, &quot;6/2/22&quot;, &quot;6/3/22&quot;,
           &quot;6/2/22&quot;, &quot;6/2/22&quot;, &quot;6/2/22&quot;, &quot;6/3/22&quot;, &quot;6/4/22&quot;),
  sub = c(1101, 1101, 1101,
          1102, 1102, 1102, 1102,
          1103, 1103, 1103, 1103, 1103),
  text = c(&quot;a&quot;, &quot;b&quot;, &quot;c&quot;,
           &quot;d&quot;,&quot;e&quot;, &quot;f&quot;, &quot;g&quot;,
           &quot;h&quot;, &quot;i&quot;, &quot;j&quot;, &quot;k&quot;, &quot;l&quot;))

There are some subs that have repeated entries for some dates (e.g. 1102, 1103). I want to keep ONLY the bottom most row for each sub if there are duplicated entries in the date column. These are the two example outputs I want from this dataframe.

Output 1: A dataframe where there are unique dates for each sub

output1 &lt;- data.frame(
  date = c(&quot;6/1/22&quot;, &quot;6/2/22&quot;, &quot;6/3/22&quot;,
           &quot;6/1/22&quot;, &quot;6/2/22&quot;, &quot;6/3/22&quot;,
           &quot;6/2/22&quot;, &quot;6/3/22&quot;, &quot;6/4/22&quot;),
  sub = c(1101, 1101, 1101,
          1102, 1102, 1102,
          1103, 1103, 1103),
  text = c(&quot;a&quot;, &quot;b&quot;, &quot;c&quot;,
           &quot;d&quot;,&quot;f&quot;, &quot;g&quot;,
           &quot;j&quot;, &quot;k&quot;, &quot;l&quot;)
)

Output 2: A dataframe with ALL entries of the subs and dates where there are multiple copies.

output2 &lt;- data.frame(
  date = c(&quot;6/2/22&quot;, &quot;6/2/22&quot;,
           &quot;6/2/22&quot;, &quot;6/2/22&quot;, &quot;6/2/22&quot;),
  sub = c(1102, 1102,
          1103, 1103, 1103),
  text = c(&quot;e&quot;, &quot;f&quot;,
           &quot;h&quot;, &quot;i&quot;, &quot;j&quot;)
)

I have seen solutions for this using distinct(), but that usually only keeps the first row with the duplicated value. I would like the latest value (e.g. the bottom most row of the duplicated entry). Does anyone know how to do this? Thank you so much!

答案1

得分: 2

任务 1

使用 dplyr，你可以通过 sub 和 date 来使用 slice_tail()：

example %>%
  slice_tail(by = c(sub, date))
#     date  sub text
# 1 6/1/22 1101    a
# 2 6/2/22 1101    b
# 3 6/3/22 1101    c
# 4 6/1/22 1102    d
# 5 6/2/22 1102    f
# 6 6/3/22 1102    g
# 7 6/2/22 1103    j
# 8 6/3/22 1103    k
# 9 6/4/22 1103    l

任务 2

使用 filter(n() > 1) 来按照 sub 和 date 进行筛选：

example %>%
  filter(n() > 1, .by = c(sub, date))
#     date  sub text
# 1 6/2/22 1102    e
# 2 6/2/22 1102    f
# 3 6/2/22 1103    h
# 4 6/2/22 1103    i
# 5 6/2/22 1103    j

英文:

Task 1

With dplyr, you can use slice_tail() by sub and date:

example %&gt;%
  slice_tail(by = c(sub, date))
#     date  sub text
# 1 6/1/22 1101    a
# 2 6/2/22 1101    b
# 3 6/3/22 1101    c
# 4 6/1/22 1102    d
# 5 6/2/22 1102    f
# 6 6/3/22 1102    g
# 7 6/2/22 1103    j
# 8 6/3/22 1103    k
# 9 6/4/22 1103    l

Task 2

Use filter(n() > 1) by sub and date:

example %&gt;%
  filter(n() &gt; 1, .by = c(sub, date))
#     date  sub text
# 1 6/2/22 1102    e
# 2 6/2/22 1102    f
# 3 6/2/22 1103    h
# 4 6/2/22 1103    i
# 5 6/2/22 1103    j

答案2

得分: 1

以下是您要翻译的内容：

They key is to use fromLast = TRUE in the base duplicated() function.

dplyr approach

As you've tagged [tag:dplyr], here is an approach using that package. First, create a column indicating whether each date is duplicated, starting from the last date by group:

library(dplyr)
example  &lt;- example |&gt;
        group_by(sub) |&gt;
        mutate(duplicate_date = duplicated(date, fromLast = TRUE))

Then for the first output simply filter those case where there is a duplicated date, by sub:

output_1  &lt;- example |&gt;
        filter(!duplicate_date ) |&gt;
        select(-duplicate_date)
output_1
#   date     sub text
#   &lt;chr&gt;  &lt;dbl&gt; &lt;chr&gt;
# 1 6/1/22  1101 a
# 2 6/2/22  1101 b
# 3 6/3/22  1101 c
# 4 6/1/22  1102 d
# 5 6/2/22  1102 f
# 6 6/3/22  1102 g
# 7 6/2/22  1103 j
# 8 6/3/22  1103 k
# 9 6/4/22  1103 l

For the second, group by date and sub and filter the cases where there are any duplicate dates, by group:

output_2  &lt;- example |&gt;
        group_by(sub, date) |&gt;
        filter(any(duplicate_date)) |&gt;
        select(-duplicate_date)
output2
#     date  sub text
# 1 6/2/22 1102    e
# 2 6/2/22 1102    f
# 3 6/2/22 1103    h
# 4 6/2/22 1103    i
# 5 6/2/22 1103    j

Just to confirm these are the same as the output1 and output2 you posted (you need to convert to data.frame first or they will not be identical as dplyr produces a tbl_df:

identical(data.frame(output_1), output1) # TRUE
identical(data.frame(output_2), output2) # TRUE

base R approach

If you want to use base R, here is a method that does essentially the same thing, using tapply(). Output 1:

example  &lt;- within(example, duplicate_date  &lt;- unlist(
    tapply(date, sub, \(x) duplicated(x, fromLast = TRUE))
))
output_1  &lt;- with(example, example[!duplicate_date,])

Output 2:

duplicate_sub_date &lt;- with(example, 
    tapply(duplicate_date, list(date, sub), any)
)
output_2  &lt;- example[diag(duplicate_sub_date[example$date, as.character(example$sub)]), ]

This is basically the same logic as the dplyr approach and produces the same output, though you get a data.frame rather than a tbl_df.

英文:

They key is to use fromLast = TRUE in the base duplicated() function.

dplyr approach

As you've tagged [tag:dplyr], here is an approach using that package. First, create a column indicating whether each date is duplicated, starting from the last date by group:

library(dplyr)
example  &lt;- example |&gt;
        group_by(sub) |&gt;
        mutate(duplicate_date = duplicated(date, fromLast = TRUE))

Then for the first output simply filter those case where there is a duplicated date, by sub:

output_1  &lt;- example |&gt;
        filter(!duplicate_date ) |&gt;
        select(-duplicate_date)
output_1
#   date     sub text
#   &lt;chr&gt;  &lt;dbl&gt; &lt;chr&gt;
# 1 6/1/22  1101 a
# 2 6/2/22  1101 b
# 3 6/3/22  1101 c
# 4 6/1/22  1102 d
# 5 6/2/22  1102 f
# 6 6/3/22  1102 g
# 7 6/2/22  1103 j
# 8 6/3/22  1103 k
# 9 6/4/22  1103 l

For the second, group by date and sub and filter the cases where there are any duplicate dates, by group:

output_2  &lt;- example |&gt;
        group_by(sub, date) |&gt;
        filter(any(duplicate_date)) |&gt;
        select(-duplicate_date)
output2
#     date  sub text
# 1 6/2/22 1102    e
# 2 6/2/22 1102    f
# 3 6/2/22 1103    h
# 4 6/2/22 1103    i
# 5 6/2/22 1103    j

Just to confirm these are the same as the output1 and output2 you posted (you need to convert to data.frame first or they will not be identical as dplyr produces a tbl_df:

identical(data.frame(output_1), output1) # TRUE
identical(data.frame(output_2), output2) # TRUE

base R approach

If you want to use base R, here is a method that does essentially the same thing, using tapply(). Output 1:

example  &lt;- within(example, duplicate_date  &lt;- unlist(
    tapply(date, sub, \(x) duplicated(x, fromLast = TRUE))
))
output_1  &lt;- with(example, example[!duplicate_date,])

Output 2:

duplicate_sub_date &lt;- with(example, 
    tapply(duplicate_date, list(date, sub), any)
)
output_2  &lt;- example[diag(duplicate_sub_date[example$date, as.character(example$sub)]), ]

This is basically the same logic as the dplyr approach and produces the same output, though you get a data.frame rather than a tbl_df.

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

只保留R中每个组中的最后一个重复项。

问题

答案1

任务 1

任务 2

Task 1

Task 2

答案2

dplyr approach

base R approach

dplyr approach

base R approach

为什么我的emmeans输出中有非估计值？

有效编辑子集调用中的索引范围的方法

识别特定节点的所有子节点，适用于非常大的数据。

将数据框架旋转以根据特定列将某些行转换为列

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

发表评论