2023年6月1日 00:30:30go评论101阅读模式

英文:

How to keep exactly two duplicated records in a data frame, after grouping the data according to one of the columns?

问题

以下是修改后的代码，以保留每个物种的两个随机重复序列：

library(dplyr)
set.seed(123) # 设置随机种子以确保结果可重复
df_filtered <- df %>%
  group_by(species, sequence) %>%
  sample_n(size = 2) %>% # 从每个组中随机选择两个序列
  ungroup()

这段代码将在每个物种组中随机选择两个重复序列，并将其保留在结果数据框中。

英文:

I have a data frame with IDs, species names and DNA sequences.
Some species in the df have repeated sequences, and for each species, I want to keep exactly two of those duplicated sequences (so if Species X has 100 identical sequences I want to keep just two of those). It doesn't matter from which ID the two duplicated sequences come from, it can be random or it can be the first instances found.

ID  | species |sequence
---------------------------
001 |Species A|ATGTAGCTCAGC
002 |Species A|ATGTAGCTCAGC
003 |Species A|ATGTAGCTCAGC
004 |Species B|CGCGCGATATTA
005 |Species B|AAACGGCCAATC
006 |Species C|TGTCGGCTCGTC
007 |Species D|ATGTAGCTCAGC
008 |Species E|GCGCGGAGATTT
009 |Species E|GCGCGGAGATTT
010 |Species E|AACTCTATATAT
011 |Species F|ATCGTAGCCTTG
012 |Species F|GGGCGCGCGGCG
013 |Species F|ATCGTAGCCTTG
014 |Species F|ATCGTAGCCTTG

I have used this code to keep only one of the repeated sequences for each species and filter out all other repeated sequences.
What is the best way to alter it so it keep two random repeated sequences instead of just one?

library(dplyr)
df_filtered &lt;- df %&gt;%
  group_by(species, sequence) %&gt;%
  slice(1) %&gt;%
  ungroup()

My output would be this (although the repeated sequences that are kept could be others):

ID  | species  |sequence
---------------------------
001 |Species A|ATGTAGCTCAGC
003 |Species A|ATGTAGCTCAGC
004 |Species B|CGCGCGATATTA
005 |Species B|AAACGGCCAATC
006 |Species C|TGTCGGCTCGTC
007 |Species D|ATGTAGCTCAGC
008 |Species E|GCGCGGAGATTT
009 |Species E|GCGCGGAGATTT
010 |Species E|AACTCTATATAT
011 |Species F|ATCGTAGCCTTG
012 |Species F|GGGCGCGCGGCG
014 |Species F|ATCGTAGCCTTG

答案1

得分: 2

使用slice_head(n=2)：

library(dplyr)
df_filtered <- df %>%
  group_by(species, sequence) %>%
  slice_head(n=2) %>%
  ungroup()
df_filtered
# A tibble: 12 × 3
      ID species   sequence    
   <dbl> <chr>     <chr>       
 1     1 Species A ATGTAGCTCAGC
 2     2 Species A ATGTAGCTCAGC
 3     5 Species B AAACGGCCAATC
 4     4 Species B CGCGCGATATTA
 5     6 Species C TGTCGGCTCGTC
 6     7 Species D ATGTAGCTCAGC
 7    10 Species E AACTCTATATAT
 8     8 Species E GCGCGGAGATTT
 9     9 Species E GCGCGGAGATTT
10    11 Species F ATCGTAGCCTTG
11    13 Species F ATCGTAGCCTTG
12    12 Species F GGGCGCGCGGCG

英文:

use slice_head(n=2):

library(dplyr)
df_filtered &lt;- df %&gt;%
  group_by(species, sequence) %&gt;%
  slice_head(n=2) %&gt;%
  ungroup()
df_filtered
# A tibble: 12 &#215; 3
      ID species   sequence    
   &lt;dbl&gt; &lt;chr&gt;     &lt;chr&gt;       
 1     1 Species A ATGTAGCTCAGC
 2     2 Species A ATGTAGCTCAGC
 3     5 Species B AAACGGCCAATC
 4     4 Species B CGCGCGATATTA
 5     6 Species C TGTCGGCTCGTC
 6     7 Species D ATGTAGCTCAGC
 7    10 Species E AACTCTATATAT
 8     8 Species E GCGCGGAGATTT
 9     9 Species E GCGCGGAGATTT
10    11 Species F ATCGTAGCCTTG
11    13 Species F ATCGTAGCCTTG
12    12 Species F GGGCGCGCGGCG

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

如何在将数据按照其中一列分组后，保留数据框中确切的两个重复记录？

问题

答案1

将SQLite表导出为Apache Parquet，无需创建数据框。

在R的Plotly动画中，连接点的线段消失。

基于行的标准确定学生的等级

如何使用tidyverse将表格扩展为更宽，如果某些因子水平为空

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。