英文:
Get character indices match in one string and apply to another string
问题
I have the dataframe below, where each row represents changes in text. I then use the adist()
function to extract whether the change is a match (M), insertion (I), substitution (S) or deletion (D).
I need to find all of the indices of I
s in the change
column (illustrated here in the insrtion_idx
column). Using those indices, I need to extract the corresponding characters in current_text
(illustrated here in insertion_chars
).
df <- tibble(current_text = c("A","AB","ABCD","ABZ"),
previous_text = c("","A","AB","ABCD"),
change = c("I","MI","MMII","MMSD"),
insertion_idx = c(c(1),c(2),c(3,4),""),
insertion_chars = c("A","B","CD",""))
I have tried splitting up strings and comparing string differences, but this gets very messy very fast with real-world data. How do I accomplish the above task?
英文:
I have the dataframe below, where each row represents changes in text. I then use the adist()
function to extract whether the change is a match (M), insertion (I), substitution (S) or deletion (D).
I need to find all of the indices of I
s in the change
column (illustrated here in the insrtion_idx
column). Using those indices, I need to extract the corresponding characters in current_text
(illustrated here in insertion_chars
).
df <- tibble(current_text = c("A","AB","ABCD","ABZ"),
previous_text = c("","A","AB","ABCD"),
change = c("I","MI","MMII","MMSD"),
insertion_idx = c(c(1),c(2),c(3,4),""),
insertion_chars = c("A","B","CD",""))
I have tried splitting up strings and comparing string differences, but this gets very messy very fast with real-world data. How do I accomplish the above task?
答案1
得分: 3
将我关于使用 gregexpr
和 regmatches
的评论转化为答案。
这个过程的很多部分与这个问题的内容非常相似 - https://stackoverflow.com/questions/2192316/extract-a-regular-expression-match/23901600 - 如果你正在寻找替代方法。
df <- data.frame(current_text = c("A","AB","ABCD","ABZ"),
previous_text = c("","A","AB","ABCD"),
change = c("I","MI","MMII","MMSD"))
df$insertion_idx <- gregexpr("I", df$change)
df$insertion_chars <- sapply(regmatches(df$current_text, df$insertion_idx),
paste, collapse="")
df
## current_text previous_text change insertion_chars insertion_idx
##1 A I A 1
##2 AB A MI B 2
##3 ABCD AB MMII CD 3, 4
##4 ABZ ABCD MMSD -1
英文:
Turning my comment about using gregexpr
and regmatches
into an answer.
A lot of this procedure is very similar to the content in this question - https://stackoverflow.com/questions/2192316/extract-a-regular-expression-match/23901600 - if you are looking for alternative methods.
df <- data.frame(current_text = c("A","AB","ABCD","ABZ"),
previous_text = c("","A","AB","ABCD"),
change = c("I","MI","MMII","MMSD"))
df$insertion_idx <- gregexpr("I", df$change)
df$insertion_chars <- sapply(regmatches(df$current_text, df$insertion_idx),
paste, collapse="")
df
## current_text previous_text change insertion_chars insertion_idx
##1 A I A 1
##2 AB A MI B 2
##3 ABCD AB MMII CD 3, 4
##4 ABZ ABCD MMSD -1
答案2
得分: 2
尝试这个替代方案,以取代thelatemail(很棒的)的建议(也适用):
quux <- structure(list(current_text = c("A", "AB", "ABCD", "ABZ"), previous_text = c("", "A", "AB", "ABCD"), change = c("I", "MI", "MMII", "MMSD")), row.names = c(NA, -4L), class = c("tbl_df", "tbl", "data.frame"))
quux$insertion_idx <- lapply(strsplit(quux$change, ""), function(z) which(z == "I"))
quux$insertion_chars <- mapply(function(ctxt, idx) {
if (length(idx)) paste(substring(ctxt, idx, idx), collapse = "") else ""
}, quux$current_text, quux$insertion_idx)
quux
# # A tibble: 4 × 5
# current_text previous_text change insertion_idx insertion_chars
# <chr> <chr> <chr> <list> <chr>
# 1 A "" I <int [1]> "A"
# 2 AB "A" MI <int [1]> "B"
# 3 ABCD "AB" MMII <int [2]> "CD"
# 4 ABZ "ABCD" MMSD <int [0]> ""
注意,insertion_idx
是一个包含你寻找的索引的列表列:
str(quux)
# tibble [4 × 5] (S3: tbl_df/tbl/data.frame)
# $ current_text : chr [1:4] "A" "AB" "ABCD" "ABZ"
# $ previous_text : chr [1:4] "" "A" "AB" "ABCD"
# $ change : chr [1:4] "I" "MI" "MMII" "MMSD"
# $ insertion_idx :List of 4
# ..$ : int 1
# ..$ : int 2
# ..$ : int [1:2] 3 4
# ..$ : int(0)
# $ insertion_chars: Named chr [1:4] "A" "B" "CD" ""
# ..- attr(*, "names")= chr [1:4] "A" "AB" "ABCD" "ABZ"
英文:
Try this alternative to thelatemail's (excellent) recommendation (which also works):
quux <- structure(list(current_text = c("A", "AB", "ABCD", "ABZ"), previous_text = c("", "A", "AB", "ABCD"), change = c("I", "MI", "MMII", "MMSD")), row.names = c(NA, -4L), class = c("tbl_df", "tbl", "data.frame"))
quux$insertion_idx <- lapply(strsplit(quux$change, ""), function(z) which(z == "I"))
quux$insertion_chars <- mapply(function(ctxt, idx) {
if (length(idx)) paste(substring(ctxt, idx, idx), collapse = "") else ""
}, quux$current_text, quux$insertion_idx)
quux
# # A tibble: 4 × 5
# current_text previous_text change insertion_idx insertion_chars
# <chr> <chr> <chr> <list> <chr>
# 1 A "" I <int [1]> "A"
# 2 AB "A" MI <int [1]> "B"
# 3 ABCD "AB" MMII <int [2]> "CD"
# 4 ABZ "ABCD" MMSD <int [0]> ""
Note that insertion_idx
is a list-column with the indices you were looking for:
str(quux)
# tibble [4 × 5] (S3: tbl_df/tbl/data.frame)
# $ current_text : chr [1:4] "A" "AB" "ABCD" "ABZ"
# $ previous_text : chr [1:4] "" "A" "AB" "ABCD"
# $ change : chr [1:4] "I" "MI" "MMII" "MMSD"
# $ insertion_idx :List of 4
# ..$ : int 1
# ..$ : int 2
# ..$ : int [1:2] 3 4
# ..$ : int(0)
# $ insertion_chars: Named chr [1:4] "A" "B" "CD" ""
# ..- attr(*, "names")= chr [1:4] "A" "AB" "ABCD" "ABZ"
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论