2023年2月14日 00:40:10go评论87阅读模式

英文:

How to remove rows of a data frame that include a sub-string that is already contained within another string?

问题

以下是您要翻译的内容：

"Species | sequence | size

"输出将是这个数据框：
Species | sequence | size

"我尝试使用dplyr将行按Species分组，然后使用grep查找并删除包含在其他行序列中的序列。不幸的是，我无法对序列进行子集操作："

"我收到这个错误：
Error in h(simpleError(msg, call)) :
error in evaluating the argument 'pattern' in selecting a method for function 'grep': object of type 'closure' is not subsettable"

英文:

So I have a data frame containing names of species and different DNA sequences for each species. Sometimes the sequences are different within the same species, other times they are similar, although with different sizes.

What I am trying to do is to remove the rows of the data frame that have sequences that are already contained in the sequences of other rows, provided they have a minimum size.

For example, in the example below, if a sequence has a size of at least 5 and is contained in another sequence, I want to delete that row and keep the row in the largest sequence:

 Species           |  sequence      | size
-----------------------------------------
Tilapia guineensis |   AAATGGA      | 7
Tilapia guineensis |   AAATGGAATA   |10
Tilapia guineensis |   AAATGGAATAGAT|13   
Tilapia guineensis |   TTATGGAGTAGA |12     
Sprattus sprattus  |   GTGCA        |5 
Sprattus sprattus  |   GTGCAATGC    |9
Sprattus sprattus  |   GTGCAATGCA   |10
Eutrigla gurnardus |   ACTGACTGATCG |12
Eutrigla gurnardus |   ACTGACT      |7  
Eutrigla gurnardus |   ACGAGTTTGCGAG|13

The output would be this data frame:

 Species           |  sequence      | size
--------------------------------------------
Tilapia guineensis |   AAATGGAATAGAT|13   
Tilapia guineensis |   TTATGGAGTAGA |12        
Sprattus sprattus  |   GTGCAATGCA   |10
Eutrigla gurnardus |   ACTGACTGATCG |12 
Eutrigla gurnardus |   ACGAGTTTGCGAG|13

I have tried using dplyr to group the rows by Species, then use grep to find and remove the sequences that are contained in sequences of other rows. Unfortunately, I am not being able to subset the sequences:

library(dplyr)
df2&lt;-df%&gt;% 
  group_by(Species) %&gt;% 
    df[-(grep(sequence[1:5],sequence)),]

I am getting this error:

Error in h(simpleError(msg, call)) : 
  error in evaluating the argument &#39;pattern&#39; in selecting a method for function &#39;grep&#39;: object of type &#39;closure&#39; is not subsettable

答案1

得分: 2

df %>%
  group_by(Species) %>%
  mutate(inother = mapply(function(ptn, rownum) any(grepl(ptn, sequence[-rownum])),
                          sequence, row_number())) %>%
  filter(size >= 5 & !inother) %>%
  ungroup() %>%
  select(-inother)
# # A tibble: 5 × 3
#   Species            sequence       size
#   <chr>              <chr>         <int>
# 1 Tilapia guineensis AAATGGAATAGAT    13
# 2 Tilapia guineensis TTATGGAGTAGA     12
# 3 Sprattus sprattus  GTGCAATGCA       10
# 4 Eutrigla gurnardus ACTGACTGATCG     12
# 5 Eutrigla gurnardus ACGAGTTTGCGAG    13

mapply 代码正在迭代每个 sequence（作为 ptn），查看是否匹配另一个字符串。由于它总是会与自己匹配，我使用 sequence[-rownum] 来排除自己的值，其中 rownum 由每个组内的 row_number() 提供。如果需要，可以通过更新模式为类似 sprintf("(.%s|%s.)", ptn, ptn) 的高级模式来确保不匹配相同但不同行的 sequence 值。

英文:

df %&gt;%
  group_by(Species) %&gt;%
  mutate(inother = mapply(function(ptn, rownum) any(grepl(ptn, sequence[-rownum])),
                          sequence, row_number())) %&gt;%
  filter(size &gt;= 5 &amp; !inother) %&gt;%
  ungroup() %&gt;%
  select(-inother)
# # A tibble: 5 &#215; 3
#   Species            sequence       size
#   &lt;chr&gt;              &lt;chr&gt;         &lt;int&gt;
# 1 Tilapia guineensis AAATGGAATAGAT    13
# 2 Tilapia guineensis TTATGGAGTAGA     12
# 3 Sprattus sprattus  GTGCAATGCA       10
# 4 Eutrigla gurnardus ACTGACTGATCG     12
# 5 Eutrigla gurnardus ACGAGTTTGCGAG    13

The mapply code is iterating over each sequence (as ptn), looking to see if it is matched in another string. Because it will always match itself, I exclude its own value from the RHS of the grp using sequence[-rownum], where rownum is provided by row_number() within each group. This might be expanded with more advanced patterns to make sure it doesn't match equal (but different-row) sequence values, if needed, by updating pattern to be something like sprintf("(.%s|%s.)", ptn, ptn) to look for at least one leading or trailing extra character.

答案2

得分: 1

在基本的R语言中，以下是代码的翻译部分：

a <- which(!adist(df$sequence, df$sequence, partial = TRUE), TRUE)
b <- igraph::clusters(igraph::graph_from_data_frame(a, FALSE))$membership
subset(df, ave(size, b, FUN = max) == size)
#>               Species      sequence size
#> 3  Tilapia guineensis AAATGGAATAGAT   13
#> 4  Tilapia guineensis  TTATGGAGTAGA   12
#> 7   Sprattus sprattus    GTGCAATGCA   10
#> 8  Eutrigla gurnardus  ACTGACTGATCG   12
#> 10 Eutrigla gurnardus ACGAGTTTGCGAG   13

只提供代码的翻译部分，不包括其他内容。

英文:

In base R:

a &lt;- which(!adist(df$sequence, df$sequence, partial = TRUE), TRUE)
b &lt;-igraph::clusters(igraph::graph_from_data_frame(a,FALSE))$membership
subset(df,ave(size, b, FUN = max) == size)
#&gt;               Species      sequence size
#&gt; 3  Tilapia guineensis AAATGGAATAGAT   13
#&gt; 4  Tilapia guineensis  TTATGGAGTAGA   12
#&gt; 7   Sprattus sprattus    GTGCAATGCA   10
#&gt; 8  Eutrigla gurnardus  ACTGACTGATCG   12
#&gt; 10 Eutrigla gurnardus ACGAGTTTGCGAG   13

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

如何删除数据框中包含在另一个字符串中已经包含的子字符串的行？

问题

"Species | sequence | size

"输出将是这个数据框：
Species | sequence | size

答案1

答案2

在R中，我可以将参数的参数从一个变量传递（仅在该变量存在时）吗？

在ggplot2中，如何在单个堆叠条形图的每个子部分中垂直居中数值？

读取一个文本文件，根据分隔符将其拆分为多行。

将旧变量重塑为新的行名称的数据框

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

发表评论

问题

"Species | sequence | size

"输出将是这个数据框： Species | sequence | size

答案1

答案2

发表评论

"输出将是这个数据框：
Species | sequence | size