只保留R中每个组中的最后一个重复项。

huangapple go评论96阅读模式
英文:

Saving only the last duplicate by group in R

问题

我有一个示例数据集:

  1. example <- data.frame(
  2. date = c("6/1/22", "6/2/22", "6/3/22",
  3. "6/1/22", "6/2/22", "6/2/22", "6/3/22",
  4. "6/2/22", "6/2/22", "6/2/22", "6/3/22", "6/4/22"),
  5. sub = c(1101, 1101, 1101,
  6. 1102, 1102, 1102, 1102,
  7. 1103, 1103, 1103, 1103, 1103),
  8. text = c("a", "b", "c",
  9. "d","e", "f", "g",
  10. "h", "i", "j", "k", "l"))

有一些subdate列中有重复的条目(例如1102,1103)。如果在date列中有重复的条目,我只想保留每个sub的最底部行。这是我想要从这个数据框中得到的两个示例输出。

输出1: 一个数据框,其中每个sub都有唯一的日期

  1. output1 <- data.frame(
  2. date = c("6/1/22", "6/2/22", "6/3/22",
  3. "6/1/22", "6/2/22", "6/3/22",
  4. "6/2/22", "6/3/22", "6/4/22"),
  5. sub = c(1101, 1101, 1101,
  6. 1102, 1102, 1102,
  7. 1103, 1103, 1103),
  8. text = c("a", "b", "c",
  9. "d", "f", "g",
  10. "j", "k", "l")
  11. )

输出2: 一个数据框,其中包含所有subdate的条目,其中存在多个副本。

  1. output2 <- data.frame(
  2. date = c("6/2/22", "6/2/22",
  3. "6/2/22", "6/2/22", "6/2/22"),
  4. sub = c(1102, 1102,
  5. 1103, 1103, 1103),
  6. text = c("e", "f",
  7. "h", "i", "j")
  8. )

我已经看到使用distinct()来解决这个问题的解决方案,但通常只保留具有重复值的第一行。我想要的是最新的值(例如,重复条目的底部行)。有谁知道如何做到这一点吗?非常感谢!

英文:

I have an example dataset:

  1. example &lt;- data.frame(
  2. date = c(&quot;6/1/22&quot;, &quot;6/2/22&quot;, &quot;6/3/22&quot;,
  3. &quot;6/1/22&quot;, &quot;6/2/22&quot;, &quot;6/2/22&quot;, &quot;6/3/22&quot;,
  4. &quot;6/2/22&quot;, &quot;6/2/22&quot;, &quot;6/2/22&quot;, &quot;6/3/22&quot;, &quot;6/4/22&quot;),
  5. sub = c(1101, 1101, 1101,
  6. 1102, 1102, 1102, 1102,
  7. 1103, 1103, 1103, 1103, 1103),
  8. text = c(&quot;a&quot;, &quot;b&quot;, &quot;c&quot;,
  9. &quot;d&quot;,&quot;e&quot;, &quot;f&quot;, &quot;g&quot;,
  10. &quot;h&quot;, &quot;i&quot;, &quot;j&quot;, &quot;k&quot;, &quot;l&quot;))

There are some subs that have repeated entries for some dates (e.g. 1102, 1103). I want to keep ONLY the bottom most row for each sub if there are duplicated entries in the date column. These are the two example outputs I want from this dataframe.

Output 1: A dataframe where there are unique dates for each sub

  1. output1 &lt;- data.frame(
  2. date = c(&quot;6/1/22&quot;, &quot;6/2/22&quot;, &quot;6/3/22&quot;,
  3. &quot;6/1/22&quot;, &quot;6/2/22&quot;, &quot;6/3/22&quot;,
  4. &quot;6/2/22&quot;, &quot;6/3/22&quot;, &quot;6/4/22&quot;),
  5. sub = c(1101, 1101, 1101,
  6. 1102, 1102, 1102,
  7. 1103, 1103, 1103),
  8. text = c(&quot;a&quot;, &quot;b&quot;, &quot;c&quot;,
  9. &quot;d&quot;,&quot;f&quot;, &quot;g&quot;,
  10. &quot;j&quot;, &quot;k&quot;, &quot;l&quot;)
  11. )

Output 2: A dataframe with ALL entries of the subs and dates where there are multiple copies.

  1. output2 &lt;- data.frame(
  2. date = c(&quot;6/2/22&quot;, &quot;6/2/22&quot;,
  3. &quot;6/2/22&quot;, &quot;6/2/22&quot;, &quot;6/2/22&quot;),
  4. sub = c(1102, 1102,
  5. 1103, 1103, 1103),
  6. text = c(&quot;e&quot;, &quot;f&quot;,
  7. &quot;h&quot;, &quot;i&quot;, &quot;j&quot;)
  8. )

I have seen solutions for this using distinct(), but that usually only keeps the first row with the duplicated value. I would like the latest value (e.g. the bottom most row of the duplicated entry). Does anyone know how to do this? Thank you so much!

答案1

得分: 2

任务 1

使用 dplyr,你可以通过 subdate 来使用 slice_tail()

  1. example %>%
  2. slice_tail(by = c(sub, date))
  3. # date sub text
  4. # 1 6/1/22 1101 a
  5. # 2 6/2/22 1101 b
  6. # 3 6/3/22 1101 c
  7. # 4 6/1/22 1102 d
  8. # 5 6/2/22 1102 f
  9. # 6 6/3/22 1102 g
  10. # 7 6/2/22 1103 j
  11. # 8 6/3/22 1103 k
  12. # 9 6/4/22 1103 l

任务 2

使用 filter(n() > 1) 来按照 subdate 进行筛选:

  1. example %>%
  2. filter(n() > 1, .by = c(sub, date))
  3. # date sub text
  4. # 1 6/2/22 1102 e
  5. # 2 6/2/22 1102 f
  6. # 3 6/2/22 1103 h
  7. # 4 6/2/22 1103 i
  8. # 5 6/2/22 1103 j
英文:
Task 1

With dplyr, you can use slice_tail() by sub and date:

  1. example %&gt;%
  2. slice_tail(by = c(sub, date))
  3. # date sub text
  4. # 1 6/1/22 1101 a
  5. # 2 6/2/22 1101 b
  6. # 3 6/3/22 1101 c
  7. # 4 6/1/22 1102 d
  8. # 5 6/2/22 1102 f
  9. # 6 6/3/22 1102 g
  10. # 7 6/2/22 1103 j
  11. # 8 6/3/22 1103 k
  12. # 9 6/4/22 1103 l

Task 2

Use filter(n() &gt; 1) by sub and date:

  1. example %&gt;%
  2. filter(n() &gt; 1, .by = c(sub, date))
  3. # date sub text
  4. # 1 6/2/22 1102 e
  5. # 2 6/2/22 1102 f
  6. # 3 6/2/22 1103 h
  7. # 4 6/2/22 1103 i
  8. # 5 6/2/22 1103 j

答案2

得分: 1

以下是您要翻译的内容:

They key is to use fromLast = TRUE in the base duplicated() function.

dplyr approach

As you've tagged [tag:dplyr], here is an approach using that package. First, create a column indicating whether each date is duplicated, starting from the last date by group:

  1. library(dplyr)
  2. example &lt;- example |&gt;
  3. group_by(sub) |&gt;
  4. mutate(duplicate_date = duplicated(date, fromLast = TRUE))

Then for the first output simply filter those case where there is a duplicated date, by sub:

  1. output_1 &lt;- example |&gt;
  2. filter(!duplicate_date ) |&gt;
  3. select(-duplicate_date)
  4. output_1
  5. # date sub text
  6. # &lt;chr&gt; &lt;dbl&gt; &lt;chr&gt;
  7. # 1 6/1/22 1101 a
  8. # 2 6/2/22 1101 b
  9. # 3 6/3/22 1101 c
  10. # 4 6/1/22 1102 d
  11. # 5 6/2/22 1102 f
  12. # 6 6/3/22 1102 g
  13. # 7 6/2/22 1103 j
  14. # 8 6/3/22 1103 k
  15. # 9 6/4/22 1103 l

For the second, group by date and sub and filter the cases where there are any duplicate dates, by group:

  1. output_2 &lt;- example |&gt;
  2. group_by(sub, date) |&gt;
  3. filter(any(duplicate_date)) |&gt;
  4. select(-duplicate_date)
  5. output2
  6. # date sub text
  7. # 1 6/2/22 1102 e
  8. # 2 6/2/22 1102 f
  9. # 3 6/2/22 1103 h
  10. # 4 6/2/22 1103 i
  11. # 5 6/2/22 1103 j

Just to confirm these are the same as the output1 and output2 you posted (you need to convert to data.frame first or they will not be identical as dplyr produces a tbl_df:

  1. identical(data.frame(output_1), output1) # TRUE
  2. identical(data.frame(output_2), output2) # TRUE

base R approach

If you want to use base R, here is a method that does essentially the same thing, using tapply(). Output 1:

  1. example &lt;- within(example, duplicate_date &lt;- unlist(
  2. tapply(date, sub, \(x) duplicated(x, fromLast = TRUE))
  3. ))
  4. output_1 &lt;- with(example, example[!duplicate_date,])

Output 2:

  1. duplicate_sub_date &lt;- with(example,
  2. tapply(duplicate_date, list(date, sub), any)
  3. )
  4. output_2 &lt;- example[diag(duplicate_sub_date[example$date, as.character(example$sub)]), ]

This is basically the same logic as the dplyr approach and produces the same output, though you get a data.frame rather than a tbl_df.

英文:

They key is to use fromLast = TRUE in the base duplicated() function.

dplyr approach

As you've tagged [tag:dplyr], here is an approach using that package. First, create a column indicating whether each date is duplicated, starting from the last date by group:

  1. library(dplyr)
  2. example &lt;- example |&gt;
  3. group_by(sub) |&gt;
  4. mutate(duplicate_date = duplicated(date, fromLast = TRUE))

Then for the first output simply filter those case where there is a duplicated date, by sub:

  1. output_1 &lt;- example |&gt;
  2. filter(!duplicate_date ) |&gt;
  3. select(-duplicate_date)
  4. output_1
  5. # date sub text
  6. # &lt;chr&gt; &lt;dbl&gt; &lt;chr&gt;
  7. # 1 6/1/22 1101 a
  8. # 2 6/2/22 1101 b
  9. # 3 6/3/22 1101 c
  10. # 4 6/1/22 1102 d
  11. # 5 6/2/22 1102 f
  12. # 6 6/3/22 1102 g
  13. # 7 6/2/22 1103 j
  14. # 8 6/3/22 1103 k
  15. # 9 6/4/22 1103 l

For the second, group by date and sub and filter the cases where there are any duplicate dates, by group:

  1. output_2 &lt;- example |&gt;
  2. group_by(sub, date) |&gt;
  3. filter(any(duplicate_date)) |&gt;
  4. select(-duplicate_date)
  5. output2
  6. # date sub text
  7. # 1 6/2/22 1102 e
  8. # 2 6/2/22 1102 f
  9. # 3 6/2/22 1103 h
  10. # 4 6/2/22 1103 i
  11. # 5 6/2/22 1103 j

Just to confirm these are the same as the output1 and output2 you posted (you need to convert to data.frame first or they will not be identical as dplyr produces a tbl_df:

  1. identical(data.frame(output_1), output1) # TRUE
  2. identical(data.frame(output_2), output2) # TRUE

base R approach

If you want to use base R, here is a method that does essentially the same thing, using tapply(). Output 1:

  1. example &lt;- within(example, duplicate_date &lt;- unlist(
  2. tapply(date, sub, \(x) duplicated(x, fromLast = TRUE))
  3. ))
  4. output_1 &lt;- with(example, example[!duplicate_date,])

Output 2:

  1. duplicate_sub_date &lt;- with(example,
  2. tapply(duplicate_date, list(date, sub), any)
  3. )
  4. output_2 &lt;- example[diag(duplicate_sub_date[example$date, as.character(example$sub)]), ]

This is basically the same logic as the dplyr approach and produces the same output, though you get a data.frame rather than a tbl_df.

huangapple
  • 本文由 发表于 2023年6月5日 15:04:04
  • 转载请务必保留本文链接:https://go.coder-hub.com/76404153.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定