只保留R中每个组中的最后一个重复项。

huangapple go评论70阅读模式
英文:

Saving only the last duplicate by group in R

问题

我有一个示例数据集:

example <- data.frame(
  date = c("6/1/22", "6/2/22", "6/3/22",
           "6/1/22", "6/2/22", "6/2/22", "6/3/22",
           "6/2/22", "6/2/22", "6/2/22", "6/3/22", "6/4/22"),
  sub = c(1101, 1101, 1101,
          1102, 1102, 1102, 1102,
          1103, 1103, 1103, 1103, 1103),
  text = c("a", "b", "c",
           "d","e", "f", "g",
           "h", "i", "j", "k", "l"))

有一些subdate列中有重复的条目(例如1102,1103)。如果在date列中有重复的条目,我只想保留每个sub的最底部行。这是我想要从这个数据框中得到的两个示例输出。

输出1: 一个数据框,其中每个sub都有唯一的日期

output1 <- data.frame(
  date = c("6/1/22", "6/2/22", "6/3/22",
           "6/1/22", "6/2/22", "6/3/22",
           "6/2/22", "6/3/22", "6/4/22"),
  sub = c(1101, 1101, 1101,
          1102, 1102, 1102,
          1103, 1103, 1103),
  text = c("a", "b", "c",
           "d", "f", "g",
           "j", "k", "l")
)

输出2: 一个数据框,其中包含所有subdate的条目,其中存在多个副本。

output2 <- data.frame(
  date = c("6/2/22", "6/2/22",
           "6/2/22", "6/2/22", "6/2/22"),
  sub = c(1102, 1102,
          1103, 1103, 1103),
  text = c("e", "f",
           "h", "i", "j")
)

我已经看到使用distinct()来解决这个问题的解决方案,但通常只保留具有重复值的第一行。我想要的是最新的值(例如,重复条目的底部行)。有谁知道如何做到这一点吗?非常感谢!

英文:

I have an example dataset:

example &lt;- data.frame(
  date = c(&quot;6/1/22&quot;, &quot;6/2/22&quot;, &quot;6/3/22&quot;,
           &quot;6/1/22&quot;, &quot;6/2/22&quot;, &quot;6/2/22&quot;, &quot;6/3/22&quot;,
           &quot;6/2/22&quot;, &quot;6/2/22&quot;, &quot;6/2/22&quot;, &quot;6/3/22&quot;, &quot;6/4/22&quot;),
  sub = c(1101, 1101, 1101,
          1102, 1102, 1102, 1102,
          1103, 1103, 1103, 1103, 1103),
  text = c(&quot;a&quot;, &quot;b&quot;, &quot;c&quot;,
           &quot;d&quot;,&quot;e&quot;, &quot;f&quot;, &quot;g&quot;,
           &quot;h&quot;, &quot;i&quot;, &quot;j&quot;, &quot;k&quot;, &quot;l&quot;))

There are some subs that have repeated entries for some dates (e.g. 1102, 1103). I want to keep ONLY the bottom most row for each sub if there are duplicated entries in the date column. These are the two example outputs I want from this dataframe.

Output 1: A dataframe where there are unique dates for each sub

output1 &lt;- data.frame(
  date = c(&quot;6/1/22&quot;, &quot;6/2/22&quot;, &quot;6/3/22&quot;,
           &quot;6/1/22&quot;, &quot;6/2/22&quot;, &quot;6/3/22&quot;,
           &quot;6/2/22&quot;, &quot;6/3/22&quot;, &quot;6/4/22&quot;),
  sub = c(1101, 1101, 1101,
          1102, 1102, 1102,
          1103, 1103, 1103),
  text = c(&quot;a&quot;, &quot;b&quot;, &quot;c&quot;,
           &quot;d&quot;,&quot;f&quot;, &quot;g&quot;,
           &quot;j&quot;, &quot;k&quot;, &quot;l&quot;)
)

Output 2: A dataframe with ALL entries of the subs and dates where there are multiple copies.

output2 &lt;- data.frame(
  date = c(&quot;6/2/22&quot;, &quot;6/2/22&quot;,
           &quot;6/2/22&quot;, &quot;6/2/22&quot;, &quot;6/2/22&quot;),
  sub = c(1102, 1102,
          1103, 1103, 1103),
  text = c(&quot;e&quot;, &quot;f&quot;,
           &quot;h&quot;, &quot;i&quot;, &quot;j&quot;)
)

I have seen solutions for this using distinct(), but that usually only keeps the first row with the duplicated value. I would like the latest value (e.g. the bottom most row of the duplicated entry). Does anyone know how to do this? Thank you so much!

答案1

得分: 2

任务 1

使用 dplyr,你可以通过 subdate 来使用 slice_tail()

example %>%
  slice_tail(by = c(sub, date))

#     date  sub text
# 1 6/1/22 1101    a
# 2 6/2/22 1101    b
# 3 6/3/22 1101    c
# 4 6/1/22 1102    d
# 5 6/2/22 1102    f
# 6 6/3/22 1102    g
# 7 6/2/22 1103    j
# 8 6/3/22 1103    k
# 9 6/4/22 1103    l

任务 2

使用 filter(n() > 1) 来按照 subdate 进行筛选:

example %>%
  filter(n() > 1, .by = c(sub, date))

#     date  sub text
# 1 6/2/22 1102    e
# 2 6/2/22 1102    f
# 3 6/2/22 1103    h
# 4 6/2/22 1103    i
# 5 6/2/22 1103    j
英文:
Task 1

With dplyr, you can use slice_tail() by sub and date:

example %&gt;%
  slice_tail(by = c(sub, date))

#     date  sub text
# 1 6/1/22 1101    a
# 2 6/2/22 1101    b
# 3 6/3/22 1101    c
# 4 6/1/22 1102    d
# 5 6/2/22 1102    f
# 6 6/3/22 1102    g
# 7 6/2/22 1103    j
# 8 6/3/22 1103    k
# 9 6/4/22 1103    l

Task 2

Use filter(n() &gt; 1) by sub and date:

example %&gt;%
  filter(n() &gt; 1, .by = c(sub, date))

#     date  sub text
# 1 6/2/22 1102    e
# 2 6/2/22 1102    f
# 3 6/2/22 1103    h
# 4 6/2/22 1103    i
# 5 6/2/22 1103    j

答案2

得分: 1

以下是您要翻译的内容:

They key is to use fromLast = TRUE in the base duplicated() function.

dplyr approach

As you've tagged [tag:dplyr], here is an approach using that package. First, create a column indicating whether each date is duplicated, starting from the last date by group:

library(dplyr)
example  &lt;- example |&gt;
        group_by(sub) |&gt;
        mutate(duplicate_date = duplicated(date, fromLast = TRUE))

Then for the first output simply filter those case where there is a duplicated date, by sub:

output_1  &lt;- example |&gt;
        filter(!duplicate_date ) |&gt;
        select(-duplicate_date)

output_1
#   date     sub text
#   &lt;chr&gt;  &lt;dbl&gt; &lt;chr&gt;
# 1 6/1/22  1101 a
# 2 6/2/22  1101 b
# 3 6/3/22  1101 c
# 4 6/1/22  1102 d
# 5 6/2/22  1102 f
# 6 6/3/22  1102 g
# 7 6/2/22  1103 j
# 8 6/3/22  1103 k
# 9 6/4/22  1103 l

For the second, group by date and sub and filter the cases where there are any duplicate dates, by group:

output_2  &lt;- example |&gt;
        group_by(sub, date) |&gt;
        filter(any(duplicate_date)) |&gt;
        select(-duplicate_date)

output2
#     date  sub text
# 1 6/2/22 1102    e
# 2 6/2/22 1102    f
# 3 6/2/22 1103    h
# 4 6/2/22 1103    i
# 5 6/2/22 1103    j

Just to confirm these are the same as the output1 and output2 you posted (you need to convert to data.frame first or they will not be identical as dplyr produces a tbl_df:

identical(data.frame(output_1), output1) # TRUE
identical(data.frame(output_2), output2) # TRUE

base R approach

If you want to use base R, here is a method that does essentially the same thing, using tapply(). Output 1:

example  &lt;- within(example, duplicate_date  &lt;- unlist(
    tapply(date, sub, \(x) duplicated(x, fromLast = TRUE))
))
output_1  &lt;- with(example, example[!duplicate_date,])

Output 2:

duplicate_sub_date &lt;- with(example, 
    tapply(duplicate_date, list(date, sub), any)
)
output_2  &lt;- example[diag(duplicate_sub_date[example$date, as.character(example$sub)]), ]

This is basically the same logic as the dplyr approach and produces the same output, though you get a data.frame rather than a tbl_df.

英文:

They key is to use fromLast = TRUE in the base duplicated() function.

dplyr approach

As you've tagged [tag:dplyr], here is an approach using that package. First, create a column indicating whether each date is duplicated, starting from the last date by group:

library(dplyr)
example  &lt;- example |&gt;
        group_by(sub) |&gt;
        mutate(duplicate_date = duplicated(date, fromLast = TRUE))

Then for the first output simply filter those case where there is a duplicated date, by sub:

output_1  &lt;- example |&gt;
        filter(!duplicate_date ) |&gt;
        select(-duplicate_date)

output_1
#   date     sub text
#   &lt;chr&gt;  &lt;dbl&gt; &lt;chr&gt;
# 1 6/1/22  1101 a
# 2 6/2/22  1101 b
# 3 6/3/22  1101 c
# 4 6/1/22  1102 d
# 5 6/2/22  1102 f
# 6 6/3/22  1102 g
# 7 6/2/22  1103 j
# 8 6/3/22  1103 k
# 9 6/4/22  1103 l

For the second, group by date and sub and filter the cases where there are any duplicate dates, by group:

output_2  &lt;- example |&gt;
        group_by(sub, date) |&gt;
        filter(any(duplicate_date)) |&gt;
        select(-duplicate_date)

output2
#     date  sub text
# 1 6/2/22 1102    e
# 2 6/2/22 1102    f
# 3 6/2/22 1103    h
# 4 6/2/22 1103    i
# 5 6/2/22 1103    j

Just to confirm these are the same as the output1 and output2 you posted (you need to convert to data.frame first or they will not be identical as dplyr produces a tbl_df:

identical(data.frame(output_1), output1) # TRUE
identical(data.frame(output_2), output2) # TRUE

base R approach

If you want to use base R, here is a method that does essentially the same thing, using tapply(). Output 1:

example  &lt;- within(example, duplicate_date  &lt;- unlist(
    tapply(date, sub, \(x) duplicated(x, fromLast = TRUE))
))
output_1  &lt;- with(example, example[!duplicate_date,])

Output 2:

duplicate_sub_date &lt;- with(example, 
    tapply(duplicate_date, list(date, sub), any)
)
output_2  &lt;- example[diag(duplicate_sub_date[example$date, as.character(example$sub)]), ]

This is basically the same logic as the dplyr approach and produces the same output, though you get a data.frame rather than a tbl_df.

huangapple
  • 本文由 发表于 2023年6月5日 15:04:04
  • 转载请务必保留本文链接:https://go.coder-hub.com/76404153.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定