如何在将数据按照其中一列分组后,保留数据框中确切的两个重复记录?

huangapple go评论101阅读模式
英文:

How to keep exactly two duplicated records in a data frame, after grouping the data according to one of the columns?

问题

以下是修改后的代码,以保留每个物种的两个随机重复序列:

  1. library(dplyr)
  2. set.seed(123) # 设置随机种子以确保结果可重复
  3. df_filtered <- df %>%
  4. group_by(species, sequence) %>%
  5. sample_n(size = 2) %>% # 从每个组中随机选择两个序列
  6. ungroup()

这段代码将在每个物种组中随机选择两个重复序列,并将其保留在结果数据框中。

英文:

I have a data frame with IDs, species names and DNA sequences.
Some species in the df have repeated sequences, and for each species, I want to keep exactly two of those duplicated sequences (so if Species X has 100 identical sequences I want to keep just two of those). It doesn't matter from which ID the two duplicated sequences come from, it can be random or it can be the first instances found.

  1. ID | species |sequence
  2. ---------------------------
  3. 001 |Species A|ATGTAGCTCAGC
  4. 002 |Species A|ATGTAGCTCAGC
  5. 003 |Species A|ATGTAGCTCAGC
  6. 004 |Species B|CGCGCGATATTA
  7. 005 |Species B|AAACGGCCAATC
  8. 006 |Species C|TGTCGGCTCGTC
  9. 007 |Species D|ATGTAGCTCAGC
  10. 008 |Species E|GCGCGGAGATTT
  11. 009 |Species E|GCGCGGAGATTT
  12. 010 |Species E|AACTCTATATAT
  13. 011 |Species F|ATCGTAGCCTTG
  14. 012 |Species F|GGGCGCGCGGCG
  15. 013 |Species F|ATCGTAGCCTTG
  16. 014 |Species F|ATCGTAGCCTTG

I have used this code to keep only one of the repeated sequences for each species and filter out all other repeated sequences.
What is the best way to alter it so it keep two random repeated sequences instead of just one?

  1. library(dplyr)
  2. df_filtered &lt;- df %&gt;%
  3. group_by(species, sequence) %&gt;%
  4. slice(1) %&gt;%
  5. ungroup()

My output would be this (although the repeated sequences that are kept could be others):

  1. ID | species |sequence
  2. ---------------------------
  3. 001 |Species A|ATGTAGCTCAGC
  4. 003 |Species A|ATGTAGCTCAGC
  5. 004 |Species B|CGCGCGATATTA
  6. 005 |Species B|AAACGGCCAATC
  7. 006 |Species C|TGTCGGCTCGTC
  8. 007 |Species D|ATGTAGCTCAGC
  9. 008 |Species E|GCGCGGAGATTT
  10. 009 |Species E|GCGCGGAGATTT
  11. 010 |Species E|AACTCTATATAT
  12. 011 |Species F|ATCGTAGCCTTG
  13. 012 |Species F|GGGCGCGCGGCG
  14. 014 |Species F|ATCGTAGCCTTG

答案1

得分: 2

使用slice_head(n=2)

  1. library(dplyr)
  2. df_filtered <- df %>%
  3. group_by(species, sequence) %>%
  4. slice_head(n=2) %>%
  5. ungroup()
  6. df_filtered
  7. # A tibble: 12 × 3
  8. ID species sequence
  9. <dbl> <chr> <chr>
  10. 1 1 Species A ATGTAGCTCAGC
  11. 2 2 Species A ATGTAGCTCAGC
  12. 3 5 Species B AAACGGCCAATC
  13. 4 4 Species B CGCGCGATATTA
  14. 5 6 Species C TGTCGGCTCGTC
  15. 6 7 Species D ATGTAGCTCAGC
  16. 7 10 Species E AACTCTATATAT
  17. 8 8 Species E GCGCGGAGATTT
  18. 9 9 Species E GCGCGGAGATTT
  19. 10 11 Species F ATCGTAGCCTTG
  20. 11 13 Species F ATCGTAGCCTTG
  21. 12 12 Species F GGGCGCGCGGCG
英文:

use slice_head(n=2):

  1. library(dplyr)
  2. df_filtered &lt;- df %&gt;%
  3. group_by(species, sequence) %&gt;%
  4. slice_head(n=2) %&gt;%
  5. ungroup()
  6. df_filtered
  7. # A tibble: 12 &#215; 3
  8. ID species sequence
  9. &lt;dbl&gt; &lt;chr&gt; &lt;chr&gt;
  10. 1 1 Species A ATGTAGCTCAGC
  11. 2 2 Species A ATGTAGCTCAGC
  12. 3 5 Species B AAACGGCCAATC
  13. 4 4 Species B CGCGCGATATTA
  14. 5 6 Species C TGTCGGCTCGTC
  15. 6 7 Species D ATGTAGCTCAGC
  16. 7 10 Species E AACTCTATATAT
  17. 8 8 Species E GCGCGGAGATTT
  18. 9 9 Species E GCGCGGAGATTT
  19. 10 11 Species F ATCGTAGCCTTG
  20. 11 13 Species F ATCGTAGCCTTG
  21. 12 12 Species F GGGCGCGCGGCG

huangapple
  • 本文由 发表于 2023年6月1日 00:30:30
  • 转载请务必保留本文链接:https://go.coder-hub.com/76375606.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定