在一个字符串中获取字符索引匹配并应用到另一个字符串。

huangapple go评论113阅读模式
英文:

Get character indices match in one string and apply to another string

问题

I have the dataframe below, where each row represents changes in text. I then use the adist() function to extract whether the change is a match (M), insertion (I), substitution (S) or deletion (D).

I need to find all of the indices of Is in the change column (illustrated here in the insrtion_idx column). Using those indices, I need to extract the corresponding characters in current_text (illustrated here in insertion_chars).

df <- tibble(current_text = c("A","AB","ABCD","ABZ"),
             previous_text = c("","A","AB","ABCD"),
             change = c("I","MI","MMII","MMSD"),
             insertion_idx = c(c(1),c(2),c(3,4),""),
             insertion_chars = c("A","B","CD",""))

I have tried splitting up strings and comparing string differences, but this gets very messy very fast with real-world data. How do I accomplish the above task?

英文:

I have the dataframe below, where each row represents changes in text. I then use the adist() function to extract whether the change is a match (M), insertion (I), substitution (S) or deletion (D).

I need to find all of the indices of Is in the change column (illustrated here in the insrtion_idx column). Using those indices, I need to extract the corresponding characters in current_text (illustrated here in insertion_chars).

df &lt;- tibble(current_text = c(&quot;A&quot;,&quot;AB&quot;,&quot;ABCD&quot;,&quot;ABZ&quot;),
             previous_text = c(&quot;&quot;,&quot;A&quot;,&quot;AB&quot;,&quot;ABCD&quot;),
             change = c(&quot;I&quot;,&quot;MI&quot;,&quot;MMII&quot;,&quot;MMSD&quot;),
             insertion_idx = c(c(1),c(2),c(3,4),&quot;&quot;),
             insertion_chars = c(&quot;A&quot;,&quot;B&quot;,&quot;CD&quot;,&quot;&quot;))

I have tried splitting up strings and comparing string differences, but this gets very messy very fast with real-world data. How do I accomplish the above task?

答案1

得分: 3

将我关于使用 gregexprregmatches 的评论转化为答案。
这个过程的很多部分与这个问题的内容非常相似 - https://stackoverflow.com/questions/2192316/extract-a-regular-expression-match/23901600 - 如果你正在寻找替代方法。

df <- data.frame(current_text = c("A","AB","ABCD","ABZ"),
                 previous_text = c("","A","AB","ABCD"),
                 change = c("I","MI","MMII","MMSD"))

df$insertion_idx <- gregexpr("I", df$change)
df$insertion_chars <- sapply(regmatches(df$current_text, df$insertion_idx), 
                             paste, collapse="")
df
##  current_text previous_text change insertion_chars insertion_idx
##1            A                    I               A             1
##2           AB             A     MI               B             2
##3         ABCD            AB   MMII              CD          3, 4
##4          ABZ          ABCD   MMSD                            -1
英文:

Turning my comment about using gregexpr and regmatches into an answer.
A lot of this procedure is very similar to the content in this question - https://stackoverflow.com/questions/2192316/extract-a-regular-expression-match/23901600 - if you are looking for alternative methods.

df &lt;- data.frame(current_text = c(&quot;A&quot;,&quot;AB&quot;,&quot;ABCD&quot;,&quot;ABZ&quot;),
             previous_text = c(&quot;&quot;,&quot;A&quot;,&quot;AB&quot;,&quot;ABCD&quot;),
             change = c(&quot;I&quot;,&quot;MI&quot;,&quot;MMII&quot;,&quot;MMSD&quot;))

df$insertion_idx &lt;- gregexpr(&quot;I&quot;, df$change)
df$insertion_chars &lt;- sapply(regmatches(df$current_text, df$insertion_idx), 
                             paste, collapse=&quot;&quot;)
df
##  current_text previous_text change insertion_chars insertion_idx
##1            A                    I               A             1
##2           AB             A     MI               B             2
##3         ABCD            AB   MMII              CD          3, 4
##4          ABZ          ABCD   MMSD                            -1

答案2

得分: 2

尝试这个替代方案,以取代thelatemail(很棒的)的建议(也适用):

quux <- structure(list(current_text = c("A", "AB", "ABCD", "ABZ"), previous_text = c("", "A", "AB", "ABCD"), change = c("I", "MI", "MMII", "MMSD")), row.names = c(NA, -4L), class = c("tbl_df", "tbl", "data.frame"))

quux$insertion_idx <- lapply(strsplit(quux$change, ""), function(z) which(z == "I"))
quux$insertion_chars <- mapply(function(ctxt, idx) {
  if (length(idx)) paste(substring(ctxt, idx, idx), collapse = "") else ""
}, quux$current_text, quux$insertion_idx)
quux
# # A tibble: 4 × 5
#   current_text previous_text change insertion_idx insertion_chars
#   <chr>        <chr>         <chr>  <list>        <chr>          
# 1 A            ""            I      <int [1]>     "A"            
# 2 AB           "A"           MI     <int [1]>     "B"            
# 3 ABCD         "AB"          MMII   <int [2]>     "CD"           
# 4 ABZ          "ABCD"        MMSD   <int [0]>     ""

注意,insertion_idx 是一个包含你寻找的索引的列表列:

str(quux)
# tibble [4 × 5] (S3: tbl_df/tbl/data.frame)
#  $ current_text   : chr [1:4] "A" "AB" "ABCD" "ABZ"
#  $ previous_text  : chr [1:4] "" "A" "AB" "ABCD"
#  $ change         : chr [1:4] "I" "MI" "MMII" "MMSD"
#  $ insertion_idx  :List of 4
#   ..$ : int 1
#   ..$ : int 2
#   ..$ : int [1:2] 3 4
#   ..$ : int(0) 
#  $ insertion_chars: Named chr [1:4] "A" "B" "CD" ""
#   ..- attr(*, "names")= chr [1:4] "A" "AB" "ABCD" "ABZ"
英文:

Try this alternative to thelatemail's (excellent) recommendation (which also works):

quux &lt;- structure(list(current_text = c(&quot;A&quot;, &quot;AB&quot;, &quot;ABCD&quot;, &quot;ABZ&quot;), previous_text = c(&quot;&quot;, &quot;A&quot;, &quot;AB&quot;, &quot;ABCD&quot;), change = c(&quot;I&quot;, &quot;MI&quot;, &quot;MMII&quot;, &quot;MMSD&quot;)), row.names = c(NA, -4L), class = c(&quot;tbl_df&quot;, &quot;tbl&quot;, &quot;data.frame&quot;))

quux$insertion_idx &lt;- lapply(strsplit(quux$change, &quot;&quot;), function(z) which(z == &quot;I&quot;))
quux$insertion_chars &lt;- mapply(function(ctxt, idx) {
  if (length(idx)) paste(substring(ctxt, idx, idx), collapse = &quot;&quot;) else &quot;&quot;
}, quux$current_text, quux$insertion_idx)
quux
# # A tibble: 4 &#215; 5
#   current_text previous_text change insertion_idx insertion_chars
#   &lt;chr&gt;        &lt;chr&gt;         &lt;chr&gt;  &lt;list&gt;        &lt;chr&gt;          
# 1 A            &quot;&quot;            I      &lt;int [1]&gt;     &quot;A&quot;            
# 2 AB           &quot;A&quot;           MI     &lt;int [1]&gt;     &quot;B&quot;            
# 3 ABCD         &quot;AB&quot;          MMII   &lt;int [2]&gt;     &quot;CD&quot;           
# 4 ABZ          &quot;ABCD&quot;        MMSD   &lt;int [0]&gt;     &quot;&quot;             

Note that insertion_idx is a list-column with the indices you were looking for:

str(quux)
# tibble [4 &#215; 5] (S3: tbl_df/tbl/data.frame)
#  $ current_text   : chr [1:4] &quot;A&quot; &quot;AB&quot; &quot;ABCD&quot; &quot;ABZ&quot;
#  $ previous_text  : chr [1:4] &quot;&quot; &quot;A&quot; &quot;AB&quot; &quot;ABCD&quot;
#  $ change         : chr [1:4] &quot;I&quot; &quot;MI&quot; &quot;MMII&quot; &quot;MMSD&quot;
#  $ insertion_idx  :List of 4
#   ..$ : int 1
#   ..$ : int 2
#   ..$ : int [1:2] 3 4
#   ..$ : int(0) 
#  $ insertion_chars: Named chr [1:4] &quot;A&quot; &quot;B&quot; &quot;CD&quot; &quot;&quot;
#   ..- attr(*, &quot;names&quot;)= chr [1:4] &quot;A&quot; &quot;AB&quot; &quot;ABCD&quot; &quot;ABZ&quot;

huangapple
  • 本文由 发表于 2023年5月25日 06:50:23
  • 转载请务必保留本文链接:https://go.coder-hub.com/76327866.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定