英文:
How to remove rows of a data frame that include a sub-string that is already contained within another string?
问题
以下是您要翻译的内容:
"Species | sequence | size
Tilapia guineensis | AAATGGA | 7
Tilapia guineensis | AAATGGAATA |10
Tilapia guineensis | AAATGGAATAGAT|13
Tilapia guineensis | TTATGGAGTAGA |12
Sprattus sprattus | GTGCA |5
Sprattus sprattus | GTGCAATGC |9
Sprattus sprattus | GTGCAATGCA |10
Eutrigla gurnardus | ACTGACTGATCG |12
Eutrigla gurnardus | ACTGACT |7
Eutrigla gurnardus | ACGAGTTTGCGAG|13"
"输出将是这个数据框:
Species | sequence | size
Tilapia guineensis | AAATGGAATAGAT|13
Tilapia guineensis | TTATGGAGTAGA |12
Sprattus sprattus | GTGCAATGCA |10
Eutrigla gurnardus | ACTGACTGATCG |12
Eutrigla gurnardus | ACGAGTTTGCGAG|13"
"我尝试使用dplyr将行按Species分组,然后使用grep查找并删除包含在其他行序列中的序列。不幸的是,我无法对序列进行子集操作:"
"我收到这个错误:
Error in h(simpleError(msg, call)) :
error in evaluating the argument 'pattern' in selecting a method for function 'grep': object of type 'closure' is not subsettable"
英文:
So I have a data frame containing names of species and different DNA sequences for each species. Sometimes the sequences are different within the same species, other times they are similar, although with different sizes.
What I am trying to do is to remove the rows of the data frame that have sequences that are already contained in the sequences of other rows, provided they have a minimum size.
For example, in the example below, if a sequence has a size of at least 5 and is contained in another sequence, I want to delete that row and keep the row in the largest sequence:
Species | sequence | size
-----------------------------------------
Tilapia guineensis | AAATGGA | 7
Tilapia guineensis | AAATGGAATA |10
Tilapia guineensis | AAATGGAATAGAT|13
Tilapia guineensis | TTATGGAGTAGA |12
Sprattus sprattus | GTGCA |5
Sprattus sprattus | GTGCAATGC |9
Sprattus sprattus | GTGCAATGCA |10
Eutrigla gurnardus | ACTGACTGATCG |12
Eutrigla gurnardus | ACTGACT |7
Eutrigla gurnardus | ACGAGTTTGCGAG|13
The output would be this data frame:
Species | sequence | size
--------------------------------------------
Tilapia guineensis | AAATGGAATAGAT|13
Tilapia guineensis | TTATGGAGTAGA |12
Sprattus sprattus | GTGCAATGCA |10
Eutrigla gurnardus | ACTGACTGATCG |12
Eutrigla gurnardus | ACGAGTTTGCGAG|13
I have tried using dplyr to group the rows by Species, then use grep to find and remove the sequences that are contained in sequences of other rows. Unfortunately, I am not being able to subset the sequences:
library(dplyr)
df2<-df%>%
group_by(Species) %>%
df[-(grep(sequence[1:5],sequence)),]
I am getting this error:
Error in h(simpleError(msg, call)) :
error in evaluating the argument 'pattern' in selecting a method for function 'grep': object of type 'closure' is not subsettable
答案1
得分: 2
df %>%
group_by(Species) %>%
mutate(inother = mapply(function(ptn, rownum) any(grepl(ptn, sequence[-rownum])),
sequence, row_number())) %>%
filter(size >= 5 & !inother) %>%
ungroup() %>%
select(-inother)
# # A tibble: 5 × 3
# Species sequence size
# <chr> <chr> <int>
# 1 Tilapia guineensis AAATGGAATAGAT 13
# 2 Tilapia guineensis TTATGGAGTAGA 12
# 3 Sprattus sprattus GTGCAATGCA 10
# 4 Eutrigla gurnardus ACTGACTGATCG 12
# 5 Eutrigla gurnardus ACGAGTTTGCGAG 13
mapply
代码正在迭代每个 sequence
(作为 ptn
),查看是否匹配另一个字符串。由于它总是会与自己匹配,我使用 sequence[-rownum]
来排除自己的值,其中 rownum
由每个组内的 row_number()
提供。如果需要,可以通过更新模式为类似 sprintf("(.%s|%s.)", ptn, ptn)
的高级模式来确保不匹配相同但不同行的 sequence
值。
英文:
df %>%
group_by(Species) %>%
mutate(inother = mapply(function(ptn, rownum) any(grepl(ptn, sequence[-rownum])),
sequence, row_number())) %>%
filter(size >= 5 & !inother) %>%
ungroup() %>%
select(-inother)
# # A tibble: 5 × 3
# Species sequence size
# <chr> <chr> <int>
# 1 Tilapia guineensis AAATGGAATAGAT 13
# 2 Tilapia guineensis TTATGGAGTAGA 12
# 3 Sprattus sprattus GTGCAATGCA 10
# 4 Eutrigla gurnardus ACTGACTGATCG 12
# 5 Eutrigla gurnardus ACGAGTTTGCGAG 13
The mapply
code is iterating over each sequence
(as ptn
), looking to see if it is matched in another string. Because it will always match itself, I exclude its own value from the RHS of the grp using sequence[-rownum]
, where rownum
is provided by row_number()
within each group. This might be expanded with more advanced patterns to make sure it doesn't match equal (but different-row) sequence
values, if needed, by updating pattern to be something like sprintf("(.%s|%s.)", ptn, ptn)
to look for at least one leading or trailing extra character.
答案2
得分: 1
在基本的R语言中,以下是代码的翻译部分:
a <- which(!adist(df$sequence, df$sequence, partial = TRUE), TRUE)
b <- igraph::clusters(igraph::graph_from_data_frame(a, FALSE))$membership
subset(df, ave(size, b, FUN = max) == size)
#> Species sequence size
#> 3 Tilapia guineensis AAATGGAATAGAT 13
#> 4 Tilapia guineensis TTATGGAGTAGA 12
#> 7 Sprattus sprattus GTGCAATGCA 10
#> 8 Eutrigla gurnardus ACTGACTGATCG 12
#> 10 Eutrigla gurnardus ACGAGTTTGCGAG 13
只提供代码的翻译部分,不包括其他内容。
英文:
In base R:
a <- which(!adist(df$sequence, df$sequence, partial = TRUE), TRUE)
b <-igraph::clusters(igraph::graph_from_data_frame(a,FALSE))$membership
subset(df,ave(size, b, FUN = max) == size)
#> Species sequence size
#> 3 Tilapia guineensis AAATGGAATAGAT 13
#> 4 Tilapia guineensis TTATGGAGTAGA 12
#> 7 Sprattus sprattus GTGCAATGCA 10
#> 8 Eutrigla gurnardus ACTGACTGATCG 12
#> 10 Eutrigla gurnardus ACGAGTTTGCGAG 13
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论