如何删除数据框中包含在另一个字符串中已经包含的子字符串的行?

huangapple go评论61阅读模式
英文:

How to remove rows of a data frame that include a sub-string that is already contained within another string?

问题

以下是您要翻译的内容:

"Species | sequence | size

Tilapia guineensis | AAATGGA | 7
Tilapia guineensis | AAATGGAATA |10
Tilapia guineensis | AAATGGAATAGAT|13
Tilapia guineensis | TTATGGAGTAGA |12
Sprattus sprattus | GTGCA |5
Sprattus sprattus | GTGCAATGC |9
Sprattus sprattus | GTGCAATGCA |10
Eutrigla gurnardus | ACTGACTGATCG |12
Eutrigla gurnardus | ACTGACT |7
Eutrigla gurnardus | ACGAGTTTGCGAG|13"

"输出将是这个数据框:
Species | sequence | size

Tilapia guineensis | AAATGGAATAGAT|13
Tilapia guineensis | TTATGGAGTAGA |12
Sprattus sprattus | GTGCAATGCA |10
Eutrigla gurnardus | ACTGACTGATCG |12
Eutrigla gurnardus | ACGAGTTTGCGAG|13"

"我尝试使用dplyr将行按Species分组,然后使用grep查找并删除包含在其他行序列中的序列。不幸的是,我无法对序列进行子集操作:"

"我收到这个错误:
Error in h(simpleError(msg, call)) :
error in evaluating the argument 'pattern' in selecting a method for function 'grep': object of type 'closure' is not subsettable"

英文:

So I have a data frame containing names of species and different DNA sequences for each species. Sometimes the sequences are different within the same species, other times they are similar, although with different sizes.

What I am trying to do is to remove the rows of the data frame that have sequences that are already contained in the sequences of other rows, provided they have a minimum size.

For example, in the example below, if a sequence has a size of at least 5 and is contained in another sequence, I want to delete that row and keep the row in the largest sequence:

 Species           |  sequence      | size
-----------------------------------------
Tilapia guineensis |   AAATGGA      | 7
Tilapia guineensis |   AAATGGAATA   |10
Tilapia guineensis |   AAATGGAATAGAT|13   
Tilapia guineensis |   TTATGGAGTAGA |12     
Sprattus sprattus  |   GTGCA        |5 
Sprattus sprattus  |   GTGCAATGC    |9
Sprattus sprattus  |   GTGCAATGCA   |10
Eutrigla gurnardus |   ACTGACTGATCG |12
Eutrigla gurnardus |   ACTGACT      |7  
Eutrigla gurnardus |   ACGAGTTTGCGAG|13

The output would be this data frame:

 Species           |  sequence      | size
--------------------------------------------
Tilapia guineensis |   AAATGGAATAGAT|13   
Tilapia guineensis |   TTATGGAGTAGA |12        
Sprattus sprattus  |   GTGCAATGCA   |10
Eutrigla gurnardus |   ACTGACTGATCG |12 
Eutrigla gurnardus |   ACGAGTTTGCGAG|13

I have tried using dplyr to group the rows by Species, then use grep to find and remove the sequences that are contained in sequences of other rows. Unfortunately, I am not being able to subset the sequences:

library(dplyr)
df2<-df%>% 
  group_by(Species) %>% 
    df[-(grep(sequence[1:5],sequence)),]

I am getting this error:

Error in h(simpleError(msg, call)) : 
  error in evaluating the argument 'pattern' in selecting a method for function 'grep': object of type 'closure' is not subsettable

答案1

得分: 2

df %>%
  group_by(Species) %>%
  mutate(inother = mapply(function(ptn, rownum) any(grepl(ptn, sequence[-rownum])),
                          sequence, row_number())) %>%
  filter(size >= 5 & !inother) %>%
  ungroup() %>%
  select(-inother)
# # A tibble: 5 × 3
#   Species            sequence       size
#   <chr>              <chr>         <int>
# 1 Tilapia guineensis AAATGGAATAGAT    13
# 2 Tilapia guineensis TTATGGAGTAGA     12
# 3 Sprattus sprattus  GTGCAATGCA       10
# 4 Eutrigla gurnardus ACTGACTGATCG     12
# 5 Eutrigla gurnardus ACGAGTTTGCGAG    13

mapply 代码正在迭代每个 sequence(作为 ptn),查看是否匹配另一个字符串。由于它总是会与自己匹配,我使用 sequence[-rownum] 来排除自己的值,其中 rownum 由每个组内的 row_number() 提供。如果需要,可以通过更新模式为类似 sprintf("(.%s|%s.)", ptn, ptn) 的高级模式来确保不匹配相同但不同行的 sequence 值。

英文:
df %&gt;%
  group_by(Species) %&gt;%
  mutate(inother = mapply(function(ptn, rownum) any(grepl(ptn, sequence[-rownum])),
                          sequence, row_number())) %&gt;%
  filter(size &gt;= 5 &amp; !inother) %&gt;%
  ungroup() %&gt;%
  select(-inother)
# # A tibble: 5 &#215; 3
#   Species            sequence       size
#   &lt;chr&gt;              &lt;chr&gt;         &lt;int&gt;
# 1 Tilapia guineensis AAATGGAATAGAT    13
# 2 Tilapia guineensis TTATGGAGTAGA     12
# 3 Sprattus sprattus  GTGCAATGCA       10
# 4 Eutrigla gurnardus ACTGACTGATCG     12
# 5 Eutrigla gurnardus ACGAGTTTGCGAG    13

The mapply code is iterating over each sequence (as ptn), looking to see if it is matched in another string. Because it will always match itself, I exclude its own value from the RHS of the grp using sequence[-rownum], where rownum is provided by row_number() within each group. This might be expanded with more advanced patterns to make sure it doesn't match equal (but different-row) sequence values, if needed, by updating pattern to be something like sprintf(&quot;(.%s|%s.)&quot;, ptn, ptn) to look for at least one leading or trailing extra character.

答案2

得分: 1

在基本的R语言中,以下是代码的翻译部分:

a <- which(!adist(df$sequence, df$sequence, partial = TRUE), TRUE)
b <- igraph::clusters(igraph::graph_from_data_frame(a, FALSE))$membership
subset(df, ave(size, b, FUN = max) == size)
#>               Species      sequence size
#> 3  Tilapia guineensis AAATGGAATAGAT   13
#> 4  Tilapia guineensis  TTATGGAGTAGA   12
#> 7   Sprattus sprattus    GTGCAATGCA   10
#> 8  Eutrigla gurnardus  ACTGACTGATCG   12
#> 10 Eutrigla gurnardus ACGAGTTTGCGAG   13

只提供代码的翻译部分,不包括其他内容。

英文:

In base R:

a &lt;- which(!adist(df$sequence, df$sequence, partial = TRUE), TRUE)
b &lt;-igraph::clusters(igraph::graph_from_data_frame(a,FALSE))$membership
subset(df,ave(size, b, FUN = max) == size)
#&gt;               Species      sequence size
#&gt; 3  Tilapia guineensis AAATGGAATAGAT   13
#&gt; 4  Tilapia guineensis  TTATGGAGTAGA   12
#&gt; 7   Sprattus sprattus    GTGCAATGCA   10
#&gt; 8  Eutrigla gurnardus  ACTGACTGATCG   12
#&gt; 10 Eutrigla gurnardus ACGAGTTTGCGAG   13

huangapple
  • 本文由 发表于 2023年2月14日 00:40:10
  • 转载请务必保留本文链接:https://go.coder-hub.com/75438774.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定