
huangapple go评论44阅读模式

Extract matching words from strings in order


Sure, here's the translated code portion:


    x <- "这是一个单词和内容测试。"
    y <- "这是一个更好的单词和内容测试。"


    > "这是一个"
我不想找到两个字符串之间的所有匹配单词,而是只想找到按顺序匹配的单词。所以 "单词和内容测试。" 存在于两个字符串中,但我不希望选择它。

I've translated the code portion as requested, and you can see the code with the original variable names and code structure.


If I have two strings that look like this:

x &lt;- &quot;Here is a test of words and stuff.&quot;
y &lt;- &quot;Here is a better test of words and stuff.&quot;

Is there an easy way to check the words from left to right and create a new string of matching words and then stop when the words no longer match so the output would look like:

&gt; &quot;Here is a&quot;

I don't want to find all matching words between the two strings but rather just the words that match in order. So "words and stuff." is in both string but I don't want that to be selected.


得分: 3


L <- strsplit(c(x, y), " +")
wx <- which.min(c(do.call(`==`, lapply(L, head, min(lengths(L))), FALSE))
paste(head(L[[1]], wx - 1), collapse = " ")
## [1] "Here is a"



Split the strings, compute the minimum of the length of the two splits, take that number of words from the head of each and append a FALSE to ensure a non-match can occur when matching the corresponding words. Then use which.min to find the first non-match and take that number minus 1 of the words and paste back together.

L &lt;- strsplit(c(x, y), &quot; +&quot;)
wx &lt;- which.min(c(do.call(`==`, lapply(L, head, min(lengths(L)))), FALSE))
paste(head(L[[1]], wx - 1), collapse = &quot; &quot;)
## [1] &quot;Here is a&quot;


得分: 1

这显示了与第一个 n 个匹配的单词:

xvec <- strsplit(x, " +")[[1]]
yvec <- strsplit(y, " +")[[1]]
(len <- min(c(length(xvec), length(yvec))))
# [1] 8
i <- which.max(cumsum(head(xvec, len) != head(yvec, len)))
list(xvec[1:i], yvec[1:i])
# [[1]]
# [1] "Here"   "is"     "a"      "test"   "of"     "words"  "and"    "stuff."
# [[2]]
# [1] "Here"   "is"     "a"      "better" "test"   "of"     "words"  "and"   
cumsum(head(xvec, len) != head(yvec, len))
# [1] 0 0 0 1 2 3 4 5
i <- which.max(cumsum(head(xvec, len) != head(yvec, len)) > 0)
list(xvec[1:(i-1)], yvec[1:(i-1)])
# [[1]]
# [1] "Here" "is"   "a"   
# [[2]]
# [1] "Here" "is"   "a"   


paste(xvec[1:(i-1)], collapse = " ")
# [1] "Here is a"


paste(xvec[-(1:(i-1))], collapse = " ")
# [1] "test of words and stuff."

This shows you the first n words that match:

xvec &lt;- strsplit(x, &quot; +&quot;)[[1]]
yvec &lt;- strsplit(y, &quot; +&quot;)[[1]]
(len &lt;- min(c(length(xvec), length(yvec))))
# [1] 8
i &lt;- which.max(cumsum(head(xvec, len) != head(yvec, len)))
list(xvec[1:i], yvec[1:i])
# [[1]]
# [1] &quot;Here&quot;   &quot;is&quot;     &quot;a&quot;      &quot;test&quot;   &quot;of&quot;     &quot;words&quot;  &quot;and&quot;    &quot;stuff.&quot;
# [[2]]
# [1] &quot;Here&quot;   &quot;is&quot;     &quot;a&quot;      &quot;better&quot; &quot;test&quot;   &quot;of&quot;     &quot;words&quot;  &quot;and&quot;   
cumsum(head(xvec, len) != head(yvec, len))
# [1] 0 0 0 1 2 3 4 5
i &lt;- which.max(cumsum(head(xvec, len) != head(yvec, len)) &gt; 0)
list(xvec[1:(i-1)], yvec[1:(i-1)])
# [[1]]
# [1] &quot;Here&quot; &quot;is&quot;   &quot;a&quot;   
# [[2]]
# [1] &quot;Here&quot; &quot;is&quot;   &quot;a&quot;   

From here, we can easily derive the leading string:

paste(xvec[1:(i-1)], collapse = &quot; &quot;)
# [1] &quot;Here is a&quot;

and the remaining strings with

paste(xvec[-(1:(i-1))], collapse = &quot; &quot;)
# [1] &quot;test of words and stuff.&quot;


得分: 1


x &lt;- &quot;Here is a test of words and stuff.&quot;
y &lt;- &quot;Here is a better test of words and stuff.&quot;
z &lt;- &quot;This string doesn&#39;t match&quot;


check_str &lt;- function(inp, pat, delimiter = &quot;\\s&quot;) {

  inp &lt;- unlist(strsplit(inp, delimiter))
  pat &lt;- unlist(strsplit(pat, delimiter))
  ln_diff &lt;- length(inp) - length(pat)
  if (ln_diff &lt; 0) {
    inp &lt;- append(inp, rep(&quot;&quot;, abs(ln_diff)))
  if (ln_diff &gt; 0) {
    pat &lt;- append(pat, rep(&quot;&quot;, abs(ln_diff)))
  idx &lt;- map2_lgl(inp, pat, ~ identical(.x, .y))
  rle_idx &lt;- rle(idx)
  if (rle_idx$values[1]) {
    idx2 &lt;- seq_len(rle_idx$length[1])
  } else {
    idx2 &lt;- 0
  paste0(inp[idx2], collapse = delimiter)

check_str(x, y, &quot; &quot;)
#&gt; [1] &quot;Here is a&quot;
check_str(x, z, &quot; &quot;)
#&gt; [1] &quot;&quot;

<sup>创建于2023-02-13,使用 reprex v2.0.2</sup>


I wrote a function which will check the string and return the desired output:

x &lt;- &quot;Here is a test of words and stuff.&quot;
y &lt;- &quot;Here is a better test of words and stuff.&quot;
z &lt;- &quot;This string doesn&#39;t match&quot;


check_str &lt;- function(inp, pat, delimiter = &quot;\\s&quot;) {

  inp &lt;- unlist(strsplit(inp, delimiter))
  pat &lt;- unlist(strsplit(pat, delimiter))
  ln_diff &lt;- length(inp) - length(pat)
  if (ln_diff &lt; 0) {
    inp &lt;- append(inp, rep(&quot;&quot;, abs(ln_diff)))
  if (ln_diff &gt; 0) {
    pat &lt;- append(pat, rep(&quot;&quot;, abs(ln_diff)))
  idx &lt;- map2_lgl(inp, pat, ~ identical(.x, .y))
  rle_idx &lt;- rle(idx)
  if (rle_idx$values[1]) {
    idx2 &lt;- seq_len(rle_idx$length[1])
  } else {
    idx2 &lt;- 0
  paste0(inp[idx2], collapse = delimiter)

check_str(x, y, &quot; &quot;)
#&gt; [1] &quot;Here is a&quot;
check_str(x, z, &quot; &quot;)
#&gt; [1] &quot;&quot;

<sup>Created on 2023-02-13 with reprex v2.0.2</sup>


得分: 1


common_start <- function(x, y) {
  i <- 1
  last <- NA
  while (i <= nchar(x) & i <= nchar(x)) {
    if (substr(x, i, i) == substr(y, i, i)) {
      if (grepl("[[:space:][:punct:]]", substr(x, i, i), perl = TRUE)) {
        last <- i
    } else {
    i <- i + 1
  if (!is.na(last)) {
    substr(x, 1, last - 1)
  } else {


common_start(x, y)
# [1] "Here is a"

这个思路是检查每个字符,同时跟踪最后一个非单词字符,它仍然匹配。使用 while 循环可能不够复杂,但它意味着一旦发现不匹配,你可以提前结束,而不必处理整个字符串。


You could write a helper function to do the check for you

common_start&lt;-function(x, y) {
  i &lt;- 1
  last &lt;- NA
  while (i &lt;= nchar(x) &amp; i &lt;= nchar(x)) {
    if (substr(x,i,i) == substr(y,i,i)) {
      if (grepl(&quot;[[:space:][:punct:]]&quot;, substr(x,i,i), perl=T)) {
        last &lt;- i
    } else {
    i &lt;- i + 1
  if (!is.na(last)) {
    substr(x, 1, last-1)
  } else {

and use that with your sample stirngs

# [1] &quot;Here is a&quot;

The idea is to check every character, keeping track of the last non-word character that still matches. Using a while loop may not be fancy but it does mean you get to break early without processing the whole string as soon as a mismatch is found.

  • 本文由 发表于 2023年2月14日 04:06:53
  • 转载请务必保留本文链接:https://go.coder-hub.com/75440694.html



:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:
