在一个字符串中获取字符索引匹配并应用到另一个字符串。

huangapple go评论145阅读模式
英文:

Get character indices match in one string and apply to another string

问题

I have the dataframe below, where each row represents changes in text. I then use the adist() function to extract whether the change is a match (M), insertion (I), substitution (S) or deletion (D).

I need to find all of the indices of Is in the change column (illustrated here in the insrtion_idx column). Using those indices, I need to extract the corresponding characters in current_text (illustrated here in insertion_chars).

  1. df <- tibble(current_text = c("A","AB","ABCD","ABZ"),
  2. previous_text = c("","A","AB","ABCD"),
  3. change = c("I","MI","MMII","MMSD"),
  4. insertion_idx = c(c(1),c(2),c(3,4),""),
  5. insertion_chars = c("A","B","CD",""))

I have tried splitting up strings and comparing string differences, but this gets very messy very fast with real-world data. How do I accomplish the above task?

英文:

I have the dataframe below, where each row represents changes in text. I then use the adist() function to extract whether the change is a match (M), insertion (I), substitution (S) or deletion (D).

I need to find all of the indices of Is in the change column (illustrated here in the insrtion_idx column). Using those indices, I need to extract the corresponding characters in current_text (illustrated here in insertion_chars).

  1. df &lt;- tibble(current_text = c(&quot;A&quot;,&quot;AB&quot;,&quot;ABCD&quot;,&quot;ABZ&quot;),
  2. previous_text = c(&quot;&quot;,&quot;A&quot;,&quot;AB&quot;,&quot;ABCD&quot;),
  3. change = c(&quot;I&quot;,&quot;MI&quot;,&quot;MMII&quot;,&quot;MMSD&quot;),
  4. insertion_idx = c(c(1),c(2),c(3,4),&quot;&quot;),
  5. insertion_chars = c(&quot;A&quot;,&quot;B&quot;,&quot;CD&quot;,&quot;&quot;))

I have tried splitting up strings and comparing string differences, but this gets very messy very fast with real-world data. How do I accomplish the above task?

答案1

得分: 3

将我关于使用 gregexprregmatches 的评论转化为答案。
这个过程的很多部分与这个问题的内容非常相似 - https://stackoverflow.com/questions/2192316/extract-a-regular-expression-match/23901600 - 如果你正在寻找替代方法。

  1. df <- data.frame(current_text = c("A","AB","ABCD","ABZ"),
  2. previous_text = c("","A","AB","ABCD"),
  3. change = c("I","MI","MMII","MMSD"))
  4. df$insertion_idx <- gregexpr("I", df$change)
  5. df$insertion_chars <- sapply(regmatches(df$current_text, df$insertion_idx),
  6. paste, collapse="")
  7. df
  8. ## current_text previous_text change insertion_chars insertion_idx
  9. ##1 A I A 1
  10. ##2 AB A MI B 2
  11. ##3 ABCD AB MMII CD 3, 4
  12. ##4 ABZ ABCD MMSD -1
英文:

Turning my comment about using gregexpr and regmatches into an answer.
A lot of this procedure is very similar to the content in this question - https://stackoverflow.com/questions/2192316/extract-a-regular-expression-match/23901600 - if you are looking for alternative methods.

  1. df &lt;- data.frame(current_text = c(&quot;A&quot;,&quot;AB&quot;,&quot;ABCD&quot;,&quot;ABZ&quot;),
  2. previous_text = c(&quot;&quot;,&quot;A&quot;,&quot;AB&quot;,&quot;ABCD&quot;),
  3. change = c(&quot;I&quot;,&quot;MI&quot;,&quot;MMII&quot;,&quot;MMSD&quot;))
  4. df$insertion_idx &lt;- gregexpr(&quot;I&quot;, df$change)
  5. df$insertion_chars &lt;- sapply(regmatches(df$current_text, df$insertion_idx),
  6. paste, collapse=&quot;&quot;)
  7. df
  8. ## current_text previous_text change insertion_chars insertion_idx
  9. ##1 A I A 1
  10. ##2 AB A MI B 2
  11. ##3 ABCD AB MMII CD 3, 4
  12. ##4 ABZ ABCD MMSD -1

答案2

得分: 2

尝试这个替代方案,以取代thelatemail(很棒的)的建议(也适用):

  1. quux <- structure(list(current_text = c("A", "AB", "ABCD", "ABZ"), previous_text = c("", "A", "AB", "ABCD"), change = c("I", "MI", "MMII", "MMSD")), row.names = c(NA, -4L), class = c("tbl_df", "tbl", "data.frame"))
  2. quux$insertion_idx <- lapply(strsplit(quux$change, ""), function(z) which(z == "I"))
  3. quux$insertion_chars <- mapply(function(ctxt, idx) {
  4. if (length(idx)) paste(substring(ctxt, idx, idx), collapse = "") else ""
  5. }, quux$current_text, quux$insertion_idx)
  6. quux
  7. # # A tibble: 4 × 5
  8. # current_text previous_text change insertion_idx insertion_chars
  9. # <chr> <chr> <chr> <list> <chr>
  10. # 1 A "" I <int [1]> "A"
  11. # 2 AB "A" MI <int [1]> "B"
  12. # 3 ABCD "AB" MMII <int [2]> "CD"
  13. # 4 ABZ "ABCD" MMSD <int [0]> ""

注意,insertion_idx 是一个包含你寻找的索引的列表列:

  1. str(quux)
  2. # tibble [4 × 5] (S3: tbl_df/tbl/data.frame)
  3. # $ current_text : chr [1:4] "A" "AB" "ABCD" "ABZ"
  4. # $ previous_text : chr [1:4] "" "A" "AB" "ABCD"
  5. # $ change : chr [1:4] "I" "MI" "MMII" "MMSD"
  6. # $ insertion_idx :List of 4
  7. # ..$ : int 1
  8. # ..$ : int 2
  9. # ..$ : int [1:2] 3 4
  10. # ..$ : int(0)
  11. # $ insertion_chars: Named chr [1:4] "A" "B" "CD" ""
  12. # ..- attr(*, "names")= chr [1:4] "A" "AB" "ABCD" "ABZ"
英文:

Try this alternative to thelatemail's (excellent) recommendation (which also works):

  1. quux &lt;- structure(list(current_text = c(&quot;A&quot;, &quot;AB&quot;, &quot;ABCD&quot;, &quot;ABZ&quot;), previous_text = c(&quot;&quot;, &quot;A&quot;, &quot;AB&quot;, &quot;ABCD&quot;), change = c(&quot;I&quot;, &quot;MI&quot;, &quot;MMII&quot;, &quot;MMSD&quot;)), row.names = c(NA, -4L), class = c(&quot;tbl_df&quot;, &quot;tbl&quot;, &quot;data.frame&quot;))
  2. quux$insertion_idx &lt;- lapply(strsplit(quux$change, &quot;&quot;), function(z) which(z == &quot;I&quot;))
  3. quux$insertion_chars &lt;- mapply(function(ctxt, idx) {
  4. if (length(idx)) paste(substring(ctxt, idx, idx), collapse = &quot;&quot;) else &quot;&quot;
  5. }, quux$current_text, quux$insertion_idx)
  6. quux
  7. # # A tibble: 4 &#215; 5
  8. # current_text previous_text change insertion_idx insertion_chars
  9. # &lt;chr&gt; &lt;chr&gt; &lt;chr&gt; &lt;list&gt; &lt;chr&gt;
  10. # 1 A &quot;&quot; I &lt;int [1]&gt; &quot;A&quot;
  11. # 2 AB &quot;A&quot; MI &lt;int [1]&gt; &quot;B&quot;
  12. # 3 ABCD &quot;AB&quot; MMII &lt;int [2]&gt; &quot;CD&quot;
  13. # 4 ABZ &quot;ABCD&quot; MMSD &lt;int [0]&gt; &quot;&quot;

Note that insertion_idx is a list-column with the indices you were looking for:

  1. str(quux)
  2. # tibble [4 &#215; 5] (S3: tbl_df/tbl/data.frame)
  3. # $ current_text : chr [1:4] &quot;A&quot; &quot;AB&quot; &quot;ABCD&quot; &quot;ABZ&quot;
  4. # $ previous_text : chr [1:4] &quot;&quot; &quot;A&quot; &quot;AB&quot; &quot;ABCD&quot;
  5. # $ change : chr [1:4] &quot;I&quot; &quot;MI&quot; &quot;MMII&quot; &quot;MMSD&quot;
  6. # $ insertion_idx :List of 4
  7. # ..$ : int 1
  8. # ..$ : int 2
  9. # ..$ : int [1:2] 3 4
  10. # ..$ : int(0)
  11. # $ insertion_chars: Named chr [1:4] &quot;A&quot; &quot;B&quot; &quot;CD&quot; &quot;&quot;
  12. # ..- attr(*, &quot;names&quot;)= chr [1:4] &quot;A&quot; &quot;AB&quot; &quot;ABCD&quot; &quot;ABZ&quot;

huangapple
  • 本文由 发表于 2023年5月25日 06:50:23
  • 转载请务必保留本文链接:https://go.coder-hub.com/76327866.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定