英文:
How to keep exactly two duplicated records in a data frame, after grouping the data according to one of the columns?
问题
以下是修改后的代码,以保留每个物种的两个随机重复序列:
library(dplyr)
set.seed(123) # 设置随机种子以确保结果可重复
df_filtered <- df %>%
group_by(species, sequence) %>%
sample_n(size = 2) %>% # 从每个组中随机选择两个序列
ungroup()
这段代码将在每个物种组中随机选择两个重复序列,并将其保留在结果数据框中。
英文:
I have a data frame with IDs, species names and DNA sequences.
Some species in the df have repeated sequences, and for each species, I want to keep exactly two of those duplicated sequences (so if Species X has 100 identical sequences I want to keep just two of those). It doesn't matter from which ID the two duplicated sequences come from, it can be random or it can be the first instances found.
ID | species |sequence
---------------------------
001 |Species A|ATGTAGCTCAGC
002 |Species A|ATGTAGCTCAGC
003 |Species A|ATGTAGCTCAGC
004 |Species B|CGCGCGATATTA
005 |Species B|AAACGGCCAATC
006 |Species C|TGTCGGCTCGTC
007 |Species D|ATGTAGCTCAGC
008 |Species E|GCGCGGAGATTT
009 |Species E|GCGCGGAGATTT
010 |Species E|AACTCTATATAT
011 |Species F|ATCGTAGCCTTG
012 |Species F|GGGCGCGCGGCG
013 |Species F|ATCGTAGCCTTG
014 |Species F|ATCGTAGCCTTG
I have used this code to keep only one of the repeated sequences for each species and filter out all other repeated sequences.
What is the best way to alter it so it keep two random repeated sequences instead of just one?
library(dplyr)
df_filtered <- df %>%
group_by(species, sequence) %>%
slice(1) %>%
ungroup()
My output would be this (although the repeated sequences that are kept could be others):
ID | species |sequence
---------------------------
001 |Species A|ATGTAGCTCAGC
003 |Species A|ATGTAGCTCAGC
004 |Species B|CGCGCGATATTA
005 |Species B|AAACGGCCAATC
006 |Species C|TGTCGGCTCGTC
007 |Species D|ATGTAGCTCAGC
008 |Species E|GCGCGGAGATTT
009 |Species E|GCGCGGAGATTT
010 |Species E|AACTCTATATAT
011 |Species F|ATCGTAGCCTTG
012 |Species F|GGGCGCGCGGCG
014 |Species F|ATCGTAGCCTTG
答案1
得分: 2
使用slice_head(n=2)
:
library(dplyr)
df_filtered <- df %>%
group_by(species, sequence) %>%
slice_head(n=2) %>%
ungroup()
df_filtered
# A tibble: 12 × 3
ID species sequence
<dbl> <chr> <chr>
1 1 Species A ATGTAGCTCAGC
2 2 Species A ATGTAGCTCAGC
3 5 Species B AAACGGCCAATC
4 4 Species B CGCGCGATATTA
5 6 Species C TGTCGGCTCGTC
6 7 Species D ATGTAGCTCAGC
7 10 Species E AACTCTATATAT
8 8 Species E GCGCGGAGATTT
9 9 Species E GCGCGGAGATTT
10 11 Species F ATCGTAGCCTTG
11 13 Species F ATCGTAGCCTTG
12 12 Species F GGGCGCGCGGCG
英文:
use slice_head(n=2)
:
library(dplyr)
df_filtered <- df %>%
group_by(species, sequence) %>%
slice_head(n=2) %>%
ungroup()
df_filtered
# A tibble: 12 × 3
ID species sequence
<dbl> <chr> <chr>
1 1 Species A ATGTAGCTCAGC
2 2 Species A ATGTAGCTCAGC
3 5 Species B AAACGGCCAATC
4 4 Species B CGCGCGATATTA
5 6 Species C TGTCGGCTCGTC
6 7 Species D ATGTAGCTCAGC
7 10 Species E AACTCTATATAT
8 8 Species E GCGCGGAGATTT
9 9 Species E GCGCGGAGATTT
10 11 Species F ATCGTAGCCTTG
11 13 Species F ATCGTAGCCTTG
12 12 Species F GGGCGCGCGGCG
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论