如何仅保留列中的第一组重复项,如果有多个重复项。

huangapple go评论110阅读模式
英文:

How to keep only the first set of duplicate if there are multiple duplicates in a column

问题

clin.info$Sample.ID存在重复值。如果有多对重复值,我只想保留第一对。

  1. n_occur <- data.frame(table(clin.info$Sample.ID))
  2. multiple.duplicates <- n_occur[n_occur$Freq > 2,]
  3. if(multiple.duplicates$Var1 %in% clin.info$Sample.ID){
  4. clin.info <- clin.info %>%
  5. group_by(Sample.ID) %>%
  6. distinct
  7. }

错误回溯:

  1. Error in if (multiple.duplicates$Var1 %in% clin.info$Sample.ID) { :
  2. argument is of length zero

数据:

  1. > dput(clin.info)
  2. structure(list(Sample.ID = c("TCGA.B2.3924.01", "TCGA.B2.3924.01",
  3. "TCGA.B2.3924.01", "TCGA.B2.3924.01", "TCGA.B2.5635.01", "TCGA.B2.5635.01",
  4. "TCGA.B2.5635.01", "TCGA.B2.5635.01", "TCGA.B2.5635.01", "TCGA.B2.5635.01",
  5. "TCGA.A3.3357.01", "TCGA.A3.3357.01", "TCGA.A3.3367.01", "TCGA.A3.3367.01",
  6. "TCGA.A3.3387.01", "TCGA.A3.3387.01", "TCGA.B0.4698.01", "TCGA.B0.4698.01",
  7. "TCGA.B0.4710.01", "TCGA.B0.4710.01"), age = c("73", "73", "73",
  8. "73", "74", "74", "74", "74", "74", "74", "62", "62", "72", "72",
  9. "49", "49", "75", "75", "75", "75")), row.names = c(67L, 68L,
  10. 69L, 70L, 71L, 72L, 73L, 74L, 75L, 76L, 1L, 2L, 3L, 4L, 5L, 6L,
  11. 7L, 8L, 9L, 10L), class = "data.frame")
  12. > dput(multiple.duplicates)
  13. structure(list(Var1 = structure(6:7, levels = c("TCGA.A3.3357.01",
  14. "TCGA.A3.3367.01", "TCGA.A3.3387.01", "TCGA.B0.4698.01", "TCGA.B0.4710.01",
  15. "TCGA.B2.3924.01", "TCGA.B2.5635.01"), class = "factor"), Freq = c(4L,
  16. 6L)), row.names = 6:7, class = "data.frame")

期望输出:

根据multiple.duplicates,有两个Sample.ID值有多于一个重复值。

因此,对于这两个Sample.ID,只保留在clin.info中的第一组重复值。

英文:

The clin.info$Sample.ID has duplicates. If there are more than one pair of duplicates, I want to take only the first pair.

  1. n_occur &lt;- data.frame(table(clin.info$Sample.ID))
  2. multiple.duplicates &lt;- n_occur[n_occur$Freq &gt; 2,]
  3. if(multiple.duplicates$Var1 %in% clin.info$Sample.ID){
  4. clin.info &lt;- clin.info %&gt;%
  5. group_by(Sample.ID) %&gt;%
  6. distinct
  7. }

Traceback:

  1. Error in if (multiple.duplicates$Var1 %in% clin.info$Sample.ID) { :
  2. argument is of length zero

Data:

  1. &gt; dput(clin.info)
  2. structure(list(Sample.ID = c(&quot;TCGA.B2.3924.01&quot;, &quot;TCGA.B2.3924.01&quot;,
  3. &quot;TCGA.B2.3924.01&quot;, &quot;TCGA.B2.3924.01&quot;, &quot;TCGA.B2.5635.01&quot;, &quot;TCGA.B2.5635.01&quot;,
  4. &quot;TCGA.B2.5635.01&quot;, &quot;TCGA.B2.5635.01&quot;, &quot;TCGA.B2.5635.01&quot;, &quot;TCGA.B2.5635.01&quot;,
  5. &quot;TCGA.A3.3357.01&quot;, &quot;TCGA.A3.3357.01&quot;, &quot;TCGA.A3.3367.01&quot;, &quot;TCGA.A3.3367.01&quot;,
  6. &quot;TCGA.A3.3387.01&quot;, &quot;TCGA.A3.3387.01&quot;, &quot;TCGA.B0.4698.01&quot;, &quot;TCGA.B0.4698.01&quot;,
  7. &quot;TCGA.B0.4710.01&quot;, &quot;TCGA.B0.4710.01&quot;), age = c(&quot;73&quot;, &quot;73&quot;, &quot;73&quot;,
  8. &quot;73&quot;, &quot;74&quot;, &quot;74&quot;, &quot;74&quot;, &quot;74&quot;, &quot;74&quot;, &quot;74&quot;, &quot;62&quot;, &quot;62&quot;, &quot;72&quot;, &quot;72&quot;,
  9. &quot;49&quot;, &quot;49&quot;, &quot;75&quot;, &quot;75&quot;, &quot;75&quot;, &quot;75&quot;)), row.names = c(67L, 68L,
  10. 69L, 70L, 71L, 72L, 73L, 74L, 75L, 76L, 1L, 2L, 3L, 4L, 5L, 6L,
  11. 7L, 8L, 9L, 10L), class = &quot;data.frame&quot;)
  12. &gt; dput(multiple.duplicates)
  13. structure(list(Var1 = structure(6:7, levels = c(&quot;TCGA.A3.3357.01&quot;,
  14. &quot;TCGA.A3.3367.01&quot;, &quot;TCGA.A3.3387.01&quot;, &quot;TCGA.B0.4698.01&quot;, &quot;TCGA.B0.4710.01&quot;,
  15. &quot;TCGA.B2.3924.01&quot;, &quot;TCGA.B2.5635.01&quot;), class = &quot;factor&quot;), Freq = c(4L,
  16. 6L)), row.names = 6:7, class = &quot;data.frame&quot;)

Expected output:

Based on multiple.duplicates, there are two Sample.ID values with more than one duplicate.

Hence, for these two Sample.ID, keep only the first set of duplicate in clin.info.

答案1

得分: 2

  1. dplyr::slice_head(clin.info, n = 2, by = Sample.ID)
  2. #> Sample.ID age
  3. #> 1 TCGA.B2.3924.01 73
  4. #> 2 TCGA.B2.3924.01 73
英文:
  1. dplyr::slice_head(clin.info, n = 2, by = Sample.ID)
  2. #&gt; Sample.ID age
  3. #&gt; 1 TCGA.B2.3924.01 73
  4. #&gt; 2 TCGA.B2.3924.01 73
  5. #&gt; 3 TCGA.B2.5635.01 74
  6. #&gt; 4 TCGA.B2.5635.01 74
  7. #&gt; 5 TCGA.A3.3357.01 62
  8. #&gt; 6 TCGA.A3.3357.01 62
  9. #&gt; 7 TCGA.A3.3367.01 72
  10. #&gt; 8 TCGA.A3.3367.01 72
  11. #&gt; 9 TCGA.A3.3387.01 49
  12. #&gt; 10 TCGA.A3.3387.01 49
  13. #&gt; 11 TCGA.B0.4698.01 75
  14. #&gt; 12 TCGA.B0.4698.01 75
  15. #&gt; 13 TCGA.B0.4710.01 75
  16. #&gt; 14 TCGA.B0.4710.01 75

<sup>Created on 2023-05-28 with reprex v2.0.2</sup>

Input data:

  1. clin.info &lt;-
  2. structure(list(Sample.ID = c(&quot;TCGA.B2.3924.01&quot;, &quot;TCGA.B2.3924.01&quot;,
  3. &quot;TCGA.B2.3924.01&quot;, &quot;TCGA.B2.3924.01&quot;, &quot;TCGA.B2.5635.01&quot;, &quot;TCGA.B2.5635.01&quot;,
  4. &quot;TCGA.B2.5635.01&quot;, &quot;TCGA.B2.5635.01&quot;, &quot;TCGA.B2.5635.01&quot;, &quot;TCGA.B2.5635.01&quot;,
  5. &quot;TCGA.A3.3357.01&quot;, &quot;TCGA.A3.3357.01&quot;, &quot;TCGA.A3.3367.01&quot;, &quot;TCGA.A3.3367.01&quot;,
  6. &quot;TCGA.A3.3387.01&quot;, &quot;TCGA.A3.3387.01&quot;, &quot;TCGA.B0.4698.01&quot;, &quot;TCGA.B0.4698.01&quot;,
  7. &quot;TCGA.B0.4710.01&quot;, &quot;TCGA.B0.4710.01&quot;), age = c(&quot;73&quot;, &quot;73&quot;, &quot;73&quot;,
  8. &quot;73&quot;, &quot;74&quot;, &quot;74&quot;, &quot;74&quot;, &quot;74&quot;, &quot;74&quot;, &quot;74&quot;, &quot;62&quot;, &quot;62&quot;, &quot;72&quot;, &quot;72&quot;,
  9. &quot;49&quot;, &quot;49&quot;, &quot;75&quot;, &quot;75&quot;, &quot;75&quot;, &quot;75&quot;)), row.names = c(67L, 68L,
  10. 69L, 70L, 71L, 72L, 73L, 74L, 75L, 76L, 1L, 2L, 3L, 4L, 5L, 6L,
  11. 7L, 8L, 9L, 10L), class = &quot;data.frame&quot;)

答案2

得分: 0

我认为你可以使用以下代码:

  1. dedup <- clin.info %>%
  2. group_by(Sample.ID) %>%
  3. filter(n() > 2) %>%
  4. distinct() %>% ungroup()
  5. if (dim(dedup)[1] > 0) {
  6. result <- clin.info %>%
  7. filter(!(Sample.ID %in% dedup$Sample.ID)) %>%
  8. bind_rows(dedup)
  9. }
英文:

I think you can use below code:

  1. dedup &lt;- clin.info %&gt;%
  2. group_by(Sample.ID) %&gt;%
  3. filter(n() &gt; 2) %&gt;%
  4. distinct() %&gt;% ungroup()
  5. if (dim(dedup)[1] &gt;0) {
  6. result &lt;- clin.info %&gt;%
  7. filter(!(Sample.ID %in% dedup$Sample.ID)) %&gt;%
  8. bind_rows(dedup)
  9. }

huangapple
  • 本文由 发表于 2023年5月28日 22:17:15
  • 转载请务必保留本文链接:https://go.coder-hub.com/76351934.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定