如何在将数据按照其中一列分组后,保留数据框中确切的两个重复记录?

huangapple go评论64阅读模式
英文:

How to keep exactly two duplicated records in a data frame, after grouping the data according to one of the columns?

问题

以下是修改后的代码,以保留每个物种的两个随机重复序列:

library(dplyr)

set.seed(123) # 设置随机种子以确保结果可重复

df_filtered <- df %>%
  group_by(species, sequence) %>%
  sample_n(size = 2) %>% # 从每个组中随机选择两个序列
  ungroup()

这段代码将在每个物种组中随机选择两个重复序列,并将其保留在结果数据框中。

英文:

I have a data frame with IDs, species names and DNA sequences.
Some species in the df have repeated sequences, and for each species, I want to keep exactly two of those duplicated sequences (so if Species X has 100 identical sequences I want to keep just two of those). It doesn't matter from which ID the two duplicated sequences come from, it can be random or it can be the first instances found.

ID  | species |sequence
---------------------------
001 |Species A|ATGTAGCTCAGC
002 |Species A|ATGTAGCTCAGC
003 |Species A|ATGTAGCTCAGC
004 |Species B|CGCGCGATATTA
005 |Species B|AAACGGCCAATC
006 |Species C|TGTCGGCTCGTC
007 |Species D|ATGTAGCTCAGC
008 |Species E|GCGCGGAGATTT
009 |Species E|GCGCGGAGATTT
010 |Species E|AACTCTATATAT
011 |Species F|ATCGTAGCCTTG
012 |Species F|GGGCGCGCGGCG
013 |Species F|ATCGTAGCCTTG
014 |Species F|ATCGTAGCCTTG

I have used this code to keep only one of the repeated sequences for each species and filter out all other repeated sequences.
What is the best way to alter it so it keep two random repeated sequences instead of just one?

library(dplyr)
df_filtered &lt;- df %&gt;%
  group_by(species, sequence) %&gt;%
  slice(1) %&gt;%
  ungroup()

My output would be this (although the repeated sequences that are kept could be others):

ID  | species  |sequence
---------------------------
001 |Species A|ATGTAGCTCAGC
003 |Species A|ATGTAGCTCAGC
004 |Species B|CGCGCGATATTA
005 |Species B|AAACGGCCAATC
006 |Species C|TGTCGGCTCGTC
007 |Species D|ATGTAGCTCAGC
008 |Species E|GCGCGGAGATTT
009 |Species E|GCGCGGAGATTT
010 |Species E|AACTCTATATAT
011 |Species F|ATCGTAGCCTTG
012 |Species F|GGGCGCGCGGCG
014 |Species F|ATCGTAGCCTTG

答案1

得分: 2

使用slice_head(n=2)

library(dplyr)
df_filtered <- df %>%
  group_by(species, sequence) %>%
  slice_head(n=2) %>%
  ungroup()

df_filtered
# A tibble: 12 × 3
      ID species   sequence    
   <dbl> <chr>     <chr>       
 1     1 Species A ATGTAGCTCAGC
 2     2 Species A ATGTAGCTCAGC
 3     5 Species B AAACGGCCAATC
 4     4 Species B CGCGCGATATTA
 5     6 Species C TGTCGGCTCGTC
 6     7 Species D ATGTAGCTCAGC
 7    10 Species E AACTCTATATAT
 8     8 Species E GCGCGGAGATTT
 9     9 Species E GCGCGGAGATTT
10    11 Species F ATCGTAGCCTTG
11    13 Species F ATCGTAGCCTTG
12    12 Species F GGGCGCGCGGCG
英文:

use slice_head(n=2):

library(dplyr)
df_filtered &lt;- df %&gt;%
  group_by(species, sequence) %&gt;%
  slice_head(n=2) %&gt;%
  ungroup()

df_filtered
# A tibble: 12 &#215; 3
      ID species   sequence    
   &lt;dbl&gt; &lt;chr&gt;     &lt;chr&gt;       
 1     1 Species A ATGTAGCTCAGC
 2     2 Species A ATGTAGCTCAGC
 3     5 Species B AAACGGCCAATC
 4     4 Species B CGCGCGATATTA
 5     6 Species C TGTCGGCTCGTC
 6     7 Species D ATGTAGCTCAGC
 7    10 Species E AACTCTATATAT
 8     8 Species E GCGCGGAGATTT
 9     9 Species E GCGCGGAGATTT
10    11 Species F ATCGTAGCCTTG
11    13 Species F ATCGTAGCCTTG
12    12 Species F GGGCGCGCGGCG

huangapple
  • 本文由 发表于 2023年6月1日 00:30:30
  • 转载请务必保留本文链接:https://go.coder-hub.com/76375606.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定