2023年2月14日 04:06:53go评论81阅读模式

英文:

Extract matching words from strings in order

问题

Sure, here's the translated code portion:

如果我有两个看起来像这样的字符串：
    x <- "这是一个单词和内容测试。"
    y <- "这是一个更好的单词和内容测试。"
是否有一种简单的方法来从左到右检查这些单词并创建一个新的匹配单词的字符串，然后当单词不再匹配时停止，输出看起来像这样：
    > "这是一个"
    
我不想找到两个字符串之间的所有匹配单词，而是只想找到按顺序匹配的单词。所以 "单词和内容测试。" 存在于两个字符串中，但我不希望选择它。

I've translated the code portion as requested, and you can see the code with the original variable names and code structure.

英文:

If I have two strings that look like this:

x &lt;- &quot;Here is a test of words and stuff.&quot;
y &lt;- &quot;Here is a better test of words and stuff.&quot;

Is there an easy way to check the words from left to right and create a new string of matching words and then stop when the words no longer match so the output would look like:

&gt; &quot;Here is a&quot;

I don't want to find all matching words between the two strings but rather just the words that match in order. So "words and stuff." is in both string but I don't want that to be selected.

答案1

得分: 3

将字符串拆分，计算两个拆分中长度的最小值，从每个拆分的开头取相应数量的单词，并附加一个"FALSE"以确保在匹配相应的单词时不匹配。然后使用"which.min"找到第一个不匹配的位置，取该位置减1的单词数量，然后将它们拼接在一起。

L <- strsplit(c(x, y), " +")
wx <- which.min(c(do.call(`==`, lapply(L, head, min(lengths(L))), FALSE))
paste(head(L[[1]], wx - 1), collapse = " ")
## [1] "Here is a"

不翻译代码部分。

英文:

Split the strings, compute the minimum of the length of the two splits, take that number of words from the head of each and append a FALSE to ensure a non-match can occur when matching the corresponding words. Then use which.min to find the first non-match and take that number minus 1 of the words and paste back together.

L &lt;- strsplit(c(x, y), &quot; +&quot;)
wx &lt;- which.min(c(do.call(`==`, lapply(L, head, min(lengths(L)))), FALSE))
paste(head(L[[1]], wx - 1), collapse = &quot; &quot;)
## [1] &quot;Here is a&quot;

答案2

得分: 1

这显示了与第一个 n 个匹配的单词：

xvec <- strsplit(x, " +")[[1]]
yvec <- strsplit(y, " +")[[1]]
(len <- min(c(length(xvec), length(yvec))))
# [1] 8
i <- which.max(cumsum(head(xvec, len) != head(yvec, len)))
list(xvec[1:i], yvec[1:i])
# [[1]]
# [1] "Here"   "is"     "a"      "test"   "of"     "words"  "and"    "stuff."
# [[2]]
# [1] "Here"   "is"     "a"      "better" "test"   "of"     "words"  "and"   
cumsum(head(xvec, len) != head(yvec, len))
# [1] 0 0 0 1 2 3 4 5
i <- which.max(cumsum(head(xvec, len) != head(yvec, len)) > 0)
list(xvec[1:(i-1)], yvec[1:(i-1)])
# [[1]]
# [1] "Here" "is"   "a"   
# [[2]]
# [1] "Here" "is"   "a"

从这里，我们可以轻松地获得前导字符串：

paste(xvec[1:(i-1)], collapse = " ")
# [1] "Here is a"

以及剩余的字符串：

paste(xvec[-(1:(i-1))], collapse = " ")
# [1] "test of words and stuff."

英文:

This shows you the first n words that match:

xvec &lt;- strsplit(x, &quot; +&quot;)[[1]]
yvec &lt;- strsplit(y, &quot; +&quot;)[[1]]
(len &lt;- min(c(length(xvec), length(yvec))))
# [1] 8
i &lt;- which.max(cumsum(head(xvec, len) != head(yvec, len)))
list(xvec[1:i], yvec[1:i])
# [[1]]
# [1] &quot;Here&quot;   &quot;is&quot;     &quot;a&quot;      &quot;test&quot;   &quot;of&quot;     &quot;words&quot;  &quot;and&quot;    &quot;stuff.&quot;
# [[2]]
# [1] &quot;Here&quot;   &quot;is&quot;     &quot;a&quot;      &quot;better&quot; &quot;test&quot;   &quot;of&quot;     &quot;words&quot;  &quot;and&quot;   
cumsum(head(xvec, len) != head(yvec, len))
# [1] 0 0 0 1 2 3 4 5
i &lt;- which.max(cumsum(head(xvec, len) != head(yvec, len)) &gt; 0)
list(xvec[1:(i-1)], yvec[1:(i-1)])
# [[1]]
# [1] &quot;Here&quot; &quot;is&quot;   &quot;a&quot;   
# [[2]]
# [1] &quot;Here&quot; &quot;is&quot;   &quot;a&quot;

From here, we can easily derive the leading string:

paste(xvec[1:(i-1)], collapse = &quot; &quot;)
# [1] &quot;Here is a&quot;

and the remaining strings with

paste(xvec[-(1:(i-1))], collapse = &quot; &quot;)
# [1] &quot;test of words and stuff.&quot;

答案3

得分: 1

我写了一个函数，它将检查字符串并返回所需的输出:

x &lt;- &quot;Here is a test of words and stuff.&quot;
y &lt;- &quot;Here is a better test of words and stuff.&quot;
z &lt;- &quot;This string doesn&#39;t match&quot;
library(purrr)
check_str &lt;- function(inp, pat, delimiter = &quot;\\s&quot;) {
  inp &lt;- unlist(strsplit(inp, delimiter))
  pat &lt;- unlist(strsplit(pat, delimiter))
  ln_diff &lt;- length(inp) - length(pat)
  
  if (ln_diff &lt; 0) {
    inp &lt;- append(inp, rep(&quot;&quot;, abs(ln_diff)))
  }
  if (ln_diff &gt; 0) {
    pat &lt;- append(pat, rep(&quot;&quot;, abs(ln_diff)))
  }
  
  idx &lt;- map2_lgl(inp, pat, ~ identical(.x, .y))
  rle_idx &lt;- rle(idx)
  
  if (rle_idx$values[1]) {
    idx2 &lt;- seq_len(rle_idx$length[1])
  } else {
    idx2 &lt;- 0
  }
  
  paste0(inp[idx2], collapse = delimiter)
}
check_str(x, y, &quot; &quot;)
#&gt; [1] &quot;Here is a&quot;
check_str(x, z, &quot; &quot;)
#&gt; [1] &quot;&quot;

<sup>创建于2023-02-13，使用 reprex v2.0.2</sup>

英文:

I wrote a function which will check the string and return the desired output:

x &lt;- &quot;Here is a test of words and stuff.&quot;
y &lt;- &quot;Here is a better test of words and stuff.&quot;
z &lt;- &quot;This string doesn&#39;t match&quot;
library(purrr)
check_str &lt;- function(inp, pat, delimiter = &quot;\\s&quot;) {
  inp &lt;- unlist(strsplit(inp, delimiter))
  pat &lt;- unlist(strsplit(pat, delimiter))
  ln_diff &lt;- length(inp) - length(pat)
  
  if (ln_diff &lt; 0) {
    inp &lt;- append(inp, rep(&quot;&quot;, abs(ln_diff)))
  }
  if (ln_diff &gt; 0) {
    pat &lt;- append(pat, rep(&quot;&quot;, abs(ln_diff)))
  }
  
  idx &lt;- map2_lgl(inp, pat, ~ identical(.x, .y))
  rle_idx &lt;- rle(idx)
  
  if (rle_idx$values[1]) {
    idx2 &lt;- seq_len(rle_idx$length[1])
  } else {
    idx2 &lt;- 0
  }
  
  paste0(inp[idx2], collapse = delimiter)
}
check_str(x, y, &quot; &quot;)
#&gt; [1] &quot;Here is a&quot;
check_str(x, z, &quot; &quot;)
#&gt; [1] &quot;&quot;

<sup>Created on 2023-02-13 with reprex v2.0.2</sup>

答案4

得分: 1

你可以编写一个辅助函数来为你执行检查：

common_start <- function(x, y) {
  i <- 1
  last <- NA
  while (i <= nchar(x) & i <= nchar(x)) {
    if (substr(x, i, i) == substr(y, i, i)) {
      if (grepl("[[:space:][:punct:]]", substr(x, i, i), perl = TRUE)) {
        last <- i
      }
    } else {
      break;
    }
    i <- i + 1
  }
  if (!is.na(last)) {
    substr(x, 1, last - 1)
  } else {
    NA
  }
}

然后将其与你的示例字符串一起使用：

common_start(x, y)
# [1] "Here is a"

这个思路是检查每个字符，同时跟踪最后一个非单词字符，它仍然匹配。使用 while 循环可能不够复杂，但它意味着一旦发现不匹配，你可以提前结束，而不必处理整个字符串。

英文:

You could write a helper function to do the check for you

common_start&lt;-function(x, y) {
  i &lt;- 1
  last &lt;- NA
  while (i &lt;= nchar(x) &amp; i &lt;= nchar(x)) {
    if (substr(x,i,i) == substr(y,i,i)) {
      if (grepl(&quot;[[:space:][:punct:]]&quot;, substr(x,i,i), perl=T)) {
        last &lt;- i
      }
    } else {
      break;
    }
    i &lt;- i + 1
  }
  if (!is.na(last)) {
    substr(x, 1, last-1)
  } else {
    NA
  }
}

and use that with your sample stirngs

common_start(x,y)
# [1] &quot;Here is a&quot;

The idea is to check every character, keeping track of the last non-word character that still matches. Using a while loop may not be fancy but it does mean you get to break early without processing the whole string as soon as a mismatch is found.

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

提取字符串中匹配的单词，按顺序。

问题

答案1

答案2

答案3

答案4

R Shiny的Leaflet地图与checkboxGroupInput一起使用时，每次只显示两个标记点。

创建一个变量/列，使用从开始日期算起的13周期内事件的计数。

quarto标题字体大小太大。我怎么让它变小？

使用正则表达式重新排列字符串

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

发表评论