2023年5月25日 06:50:23go评论145阅读模式

英文:

Get character indices match in one string and apply to another string

问题

I have the dataframe below, where each row represents changes in text. I then use the adist() function to extract whether the change is a match (M), insertion (I), substitution (S) or deletion (D).

I need to find all of the indices of Is in the change column (illustrated here in the insrtion_idx column). Using those indices, I need to extract the corresponding characters in current_text (illustrated here in insertion_chars).

df <- tibble(current_text = c("A","AB","ABCD","ABZ"),
             previous_text = c("","A","AB","ABCD"),
             change = c("I","MI","MMII","MMSD"),
             insertion_idx = c(c(1),c(2),c(3,4),""),
             insertion_chars = c("A","B","CD",""))

I have tried splitting up strings and comparing string differences, but this gets very messy very fast with real-world data. How do I accomplish the above task?

英文:

I have the dataframe below, where each row represents changes in text. I then use the adist() function to extract whether the change is a match (M), insertion (I), substitution (S) or deletion (D).

df &lt;- tibble(current_text = c(&quot;A&quot;,&quot;AB&quot;,&quot;ABCD&quot;,&quot;ABZ&quot;),
             previous_text = c(&quot;&quot;,&quot;A&quot;,&quot;AB&quot;,&quot;ABCD&quot;),
             change = c(&quot;I&quot;,&quot;MI&quot;,&quot;MMII&quot;,&quot;MMSD&quot;),
             insertion_idx = c(c(1),c(2),c(3,4),&quot;&quot;),
             insertion_chars = c(&quot;A&quot;,&quot;B&quot;,&quot;CD&quot;,&quot;&quot;))

I have tried splitting up strings and comparing string differences, but this gets very messy very fast with real-world data. How do I accomplish the above task?

答案1

得分: 3

将我关于使用 gregexpr 和 regmatches 的评论转化为答案。
这个过程的很多部分与这个问题的内容非常相似 - https://stackoverflow.com/questions/2192316/extract-a-regular-expression-match/23901600 - 如果你正在寻找替代方法。

df <- data.frame(current_text = c("A","AB","ABCD","ABZ"),
                 previous_text = c("","A","AB","ABCD"),
                 change = c("I","MI","MMII","MMSD"))
df$insertion_idx <- gregexpr("I", df$change)
df$insertion_chars <- sapply(regmatches(df$current_text, df$insertion_idx), 
                             paste, collapse="")
df
##  current_text previous_text change insertion_chars insertion_idx
##1            A                    I               A             1
##2           AB             A     MI               B             2
##3         ABCD            AB   MMII              CD          3, 4
##4          ABZ          ABCD   MMSD                            -1

英文:

Turning my comment about using gregexpr and regmatches into an answer.
A lot of this procedure is very similar to the content in this question - https://stackoverflow.com/questions/2192316/extract-a-regular-expression-match/23901600 - if you are looking for alternative methods.

df &lt;- data.frame(current_text = c(&quot;A&quot;,&quot;AB&quot;,&quot;ABCD&quot;,&quot;ABZ&quot;),
             previous_text = c(&quot;&quot;,&quot;A&quot;,&quot;AB&quot;,&quot;ABCD&quot;),
             change = c(&quot;I&quot;,&quot;MI&quot;,&quot;MMII&quot;,&quot;MMSD&quot;))
df$insertion_idx &lt;- gregexpr(&quot;I&quot;, df$change)
df$insertion_chars &lt;- sapply(regmatches(df$current_text, df$insertion_idx), 
                             paste, collapse=&quot;&quot;)
df
##  current_text previous_text change insertion_chars insertion_idx
##1            A                    I               A             1
##2           AB             A     MI               B             2
##3         ABCD            AB   MMII              CD          3, 4
##4          ABZ          ABCD   MMSD                            -1

答案2

得分: 2

尝试这个替代方案，以取代thelatemail（很棒的）的建议（也适用）：

quux <- structure(list(current_text = c("A", "AB", "ABCD", "ABZ"), previous_text = c("", "A", "AB", "ABCD"), change = c("I", "MI", "MMII", "MMSD")), row.names = c(NA, -4L), class = c("tbl_df", "tbl", "data.frame"))
quux$insertion_idx <- lapply(strsplit(quux$change, ""), function(z) which(z == "I"))
quux$insertion_chars <- mapply(function(ctxt, idx) {
  if (length(idx)) paste(substring(ctxt, idx, idx), collapse = "") else ""
}, quux$current_text, quux$insertion_idx)
quux
# # A tibble: 4 × 5
#   current_text previous_text change insertion_idx insertion_chars
#   <chr>        <chr>         <chr>  <list>        <chr>          
# 1 A            ""            I      <int [1]>     "A"            
# 2 AB           "A"           MI     <int [1]>     "B"            
# 3 ABCD         "AB"          MMII   <int [2]>     "CD"           
# 4 ABZ          "ABCD"        MMSD   <int [0]>     ""

注意，insertion_idx 是一个包含你寻找的索引的列表列：

str(quux)
# tibble [4 × 5] (S3: tbl_df/tbl/data.frame)
#  $ current_text   : chr [1:4] "A" "AB" "ABCD" "ABZ"
#  $ previous_text  : chr [1:4] "" "A" "AB" "ABCD"
#  $ change         : chr [1:4] "I" "MI" "MMII" "MMSD"
#  $ insertion_idx  :List of 4
#   ..$ : int 1
#   ..$ : int 2
#   ..$ : int [1:2] 3 4
#   ..$ : int(0) 
#  $ insertion_chars: Named chr [1:4] "A" "B" "CD" ""
#   ..- attr(*, "names")= chr [1:4] "A" "AB" "ABCD" "ABZ"

英文:

Try this alternative to thelatemail's (excellent) recommendation (which also works):

quux &lt;- structure(list(current_text = c(&quot;A&quot;, &quot;AB&quot;, &quot;ABCD&quot;, &quot;ABZ&quot;), previous_text = c(&quot;&quot;, &quot;A&quot;, &quot;AB&quot;, &quot;ABCD&quot;), change = c(&quot;I&quot;, &quot;MI&quot;, &quot;MMII&quot;, &quot;MMSD&quot;)), row.names = c(NA, -4L), class = c(&quot;tbl_df&quot;, &quot;tbl&quot;, &quot;data.frame&quot;))
quux$insertion_idx &lt;- lapply(strsplit(quux$change, &quot;&quot;), function(z) which(z == &quot;I&quot;))
quux$insertion_chars &lt;- mapply(function(ctxt, idx) {
  if (length(idx)) paste(substring(ctxt, idx, idx), collapse = &quot;&quot;) else &quot;&quot;
}, quux$current_text, quux$insertion_idx)
quux
# # A tibble: 4 &#215; 5
#   current_text previous_text change insertion_idx insertion_chars
#   &lt;chr&gt;        &lt;chr&gt;         &lt;chr&gt;  &lt;list&gt;        &lt;chr&gt;          
# 1 A            &quot;&quot;            I      &lt;int [1]&gt;     &quot;A&quot;            
# 2 AB           &quot;A&quot;           MI     &lt;int [1]&gt;     &quot;B&quot;            
# 3 ABCD         &quot;AB&quot;          MMII   &lt;int [2]&gt;     &quot;CD&quot;           
# 4 ABZ          &quot;ABCD&quot;        MMSD   &lt;int [0]&gt;     &quot;&quot;

Note that insertion_idx is a list-column with the indices you were looking for:

str(quux)
# tibble [4 &#215; 5] (S3: tbl_df/tbl/data.frame)
#  $ current_text   : chr [1:4] &quot;A&quot; &quot;AB&quot; &quot;ABCD&quot; &quot;ABZ&quot;
#  $ previous_text  : chr [1:4] &quot;&quot; &quot;A&quot; &quot;AB&quot; &quot;ABCD&quot;
#  $ change         : chr [1:4] &quot;I&quot; &quot;MI&quot; &quot;MMII&quot; &quot;MMSD&quot;
#  $ insertion_idx  :List of 4
#   ..$ : int 1
#   ..$ : int 2
#   ..$ : int [1:2] 3 4
#   ..$ : int(0) 
#  $ insertion_chars: Named chr [1:4] &quot;A&quot; &quot;B&quot; &quot;CD&quot; &quot;&quot;
#   ..- attr(*, &quot;names&quot;)= chr [1:4] &quot;A&quot; &quot;AB&quot; &quot;ABCD&quot; &quot;ABZ&quot;

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

在一个字符串中获取字符索引匹配并应用到另一个字符串。

问题

答案1

答案2

点击列表元素使用 Rselenium

阻止列表编号出现在 `do.call(“cbind.data.frame”, my_list)` 后的列名中。

从字符串列中提取两列

随机从随机字符串列表中选择（JAVA）

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。