提取字符串中匹配的单词,按顺序。

huangapple go评论81阅读模式
英文:

Extract matching words from strings in order

问题

Sure, here's the translated code portion:

  1. 如果我有两个看起来像这样的字符串:
  2. x <- "这是一个单词和内容测试。"
  3. y <- "这是一个更好的单词和内容测试。"
  4. 是否有一种简单的方法来从左到右检查这些单词并创建一个新的匹配单词的字符串,然后当单词不再匹配时停止,输出看起来像这样:
  5. > "这是一个"
  6. 我不想找到两个字符串之间的所有匹配单词,而是只想找到按顺序匹配的单词。所以 "单词和内容测试。" 存在于两个字符串中,但我不希望选择它。

I've translated the code portion as requested, and you can see the code with the original variable names and code structure.

英文:

If I have two strings that look like this:

  1. x &lt;- &quot;Here is a test of words and stuff.&quot;
  2. y &lt;- &quot;Here is a better test of words and stuff.&quot;

Is there an easy way to check the words from left to right and create a new string of matching words and then stop when the words no longer match so the output would look like:

  1. &gt; &quot;Here is a&quot;

I don't want to find all matching words between the two strings but rather just the words that match in order. So "words and stuff." is in both string but I don't want that to be selected.

答案1

得分: 3

将字符串拆分,计算两个拆分中长度的最小值,从每个拆分的开头取相应数量的单词,并附加一个"FALSE"以确保在匹配相应的单词时不匹配。然后使用"which.min"找到第一个不匹配的位置,取该位置减1的单词数量,然后将它们拼接在一起。

  1. L <- strsplit(c(x, y), " +")
  2. wx <- which.min(c(do.call(`==`, lapply(L, head, min(lengths(L))), FALSE))
  3. paste(head(L[[1]], wx - 1), collapse = " ")
  4. ## [1] "Here is a"

不翻译代码部分。

英文:

Split the strings, compute the minimum of the length of the two splits, take that number of words from the head of each and append a FALSE to ensure a non-match can occur when matching the corresponding words. Then use which.min to find the first non-match and take that number minus 1 of the words and paste back together.

  1. L &lt;- strsplit(c(x, y), &quot; +&quot;)
  2. wx &lt;- which.min(c(do.call(`==`, lapply(L, head, min(lengths(L)))), FALSE))
  3. paste(head(L[[1]], wx - 1), collapse = &quot; &quot;)
  4. ## [1] &quot;Here is a&quot;

答案2

得分: 1

这显示了与第一个 n 个匹配的单词:

  1. xvec <- strsplit(x, " +")[[1]]
  2. yvec <- strsplit(y, " +")[[1]]
  3. (len <- min(c(length(xvec), length(yvec))))
  4. # [1] 8
  5. i <- which.max(cumsum(head(xvec, len) != head(yvec, len)))
  6. list(xvec[1:i], yvec[1:i])
  7. # [[1]]
  8. # [1] "Here" "is" "a" "test" "of" "words" "and" "stuff."
  9. # [[2]]
  10. # [1] "Here" "is" "a" "better" "test" "of" "words" "and"
  11. cumsum(head(xvec, len) != head(yvec, len))
  12. # [1] 0 0 0 1 2 3 4 5
  13. i <- which.max(cumsum(head(xvec, len) != head(yvec, len)) > 0)
  14. list(xvec[1:(i-1)], yvec[1:(i-1)])
  15. # [[1]]
  16. # [1] "Here" "is" "a"
  17. # [[2]]
  18. # [1] "Here" "is" "a"

从这里,我们可以轻松地获得前导字符串:

  1. paste(xvec[1:(i-1)], collapse = " ")
  2. # [1] "Here is a"

以及剩余的字符串:

  1. paste(xvec[-(1:(i-1))], collapse = " ")
  2. # [1] "test of words and stuff."
英文:

This shows you the first n words that match:

  1. xvec &lt;- strsplit(x, &quot; +&quot;)[[1]]
  2. yvec &lt;- strsplit(y, &quot; +&quot;)[[1]]
  3. (len &lt;- min(c(length(xvec), length(yvec))))
  4. # [1] 8
  5. i &lt;- which.max(cumsum(head(xvec, len) != head(yvec, len)))
  6. list(xvec[1:i], yvec[1:i])
  7. # [[1]]
  8. # [1] &quot;Here&quot; &quot;is&quot; &quot;a&quot; &quot;test&quot; &quot;of&quot; &quot;words&quot; &quot;and&quot; &quot;stuff.&quot;
  9. # [[2]]
  10. # [1] &quot;Here&quot; &quot;is&quot; &quot;a&quot; &quot;better&quot; &quot;test&quot; &quot;of&quot; &quot;words&quot; &quot;and&quot;
  11. cumsum(head(xvec, len) != head(yvec, len))
  12. # [1] 0 0 0 1 2 3 4 5
  13. i &lt;- which.max(cumsum(head(xvec, len) != head(yvec, len)) &gt; 0)
  14. list(xvec[1:(i-1)], yvec[1:(i-1)])
  15. # [[1]]
  16. # [1] &quot;Here&quot; &quot;is&quot; &quot;a&quot;
  17. # [[2]]
  18. # [1] &quot;Here&quot; &quot;is&quot; &quot;a&quot;

From here, we can easily derive the leading string:

  1. paste(xvec[1:(i-1)], collapse = &quot; &quot;)
  2. # [1] &quot;Here is a&quot;

and the remaining strings with

  1. paste(xvec[-(1:(i-1))], collapse = &quot; &quot;)
  2. # [1] &quot;test of words and stuff.&quot;

答案3

得分: 1

我写了一个函数,它将检查字符串并返回所需的输出:

  1. x &lt;- &quot;Here is a test of words and stuff.&quot;
  2. y &lt;- &quot;Here is a better test of words and stuff.&quot;
  3. z &lt;- &quot;This string doesn&#39;t match&quot;
  4. library(purrr)
  5. check_str &lt;- function(inp, pat, delimiter = &quot;\\s&quot;) {
  6. inp &lt;- unlist(strsplit(inp, delimiter))
  7. pat &lt;- unlist(strsplit(pat, delimiter))
  8. ln_diff &lt;- length(inp) - length(pat)
  9. if (ln_diff &lt; 0) {
  10. inp &lt;- append(inp, rep(&quot;&quot;, abs(ln_diff)))
  11. }
  12. if (ln_diff &gt; 0) {
  13. pat &lt;- append(pat, rep(&quot;&quot;, abs(ln_diff)))
  14. }
  15. idx &lt;- map2_lgl(inp, pat, ~ identical(.x, .y))
  16. rle_idx &lt;- rle(idx)
  17. if (rle_idx$values[1]) {
  18. idx2 &lt;- seq_len(rle_idx$length[1])
  19. } else {
  20. idx2 &lt;- 0
  21. }
  22. paste0(inp[idx2], collapse = delimiter)
  23. }
  24. check_str(x, y, &quot; &quot;)
  25. #&gt; [1] &quot;Here is a&quot;
  26. check_str(x, z, &quot; &quot;)
  27. #&gt; [1] &quot;&quot;

<sup>创建于2023-02-13,使用 reprex v2.0.2</sup>

英文:

I wrote a function which will check the string and return the desired output:

  1. x &lt;- &quot;Here is a test of words and stuff.&quot;
  2. y &lt;- &quot;Here is a better test of words and stuff.&quot;
  3. z &lt;- &quot;This string doesn&#39;t match&quot;
  4. library(purrr)
  5. check_str &lt;- function(inp, pat, delimiter = &quot;\\s&quot;) {
  6. inp &lt;- unlist(strsplit(inp, delimiter))
  7. pat &lt;- unlist(strsplit(pat, delimiter))
  8. ln_diff &lt;- length(inp) - length(pat)
  9. if (ln_diff &lt; 0) {
  10. inp &lt;- append(inp, rep(&quot;&quot;, abs(ln_diff)))
  11. }
  12. if (ln_diff &gt; 0) {
  13. pat &lt;- append(pat, rep(&quot;&quot;, abs(ln_diff)))
  14. }
  15. idx &lt;- map2_lgl(inp, pat, ~ identical(.x, .y))
  16. rle_idx &lt;- rle(idx)
  17. if (rle_idx$values[1]) {
  18. idx2 &lt;- seq_len(rle_idx$length[1])
  19. } else {
  20. idx2 &lt;- 0
  21. }
  22. paste0(inp[idx2], collapse = delimiter)
  23. }
  24. check_str(x, y, &quot; &quot;)
  25. #&gt; [1] &quot;Here is a&quot;
  26. check_str(x, z, &quot; &quot;)
  27. #&gt; [1] &quot;&quot;

<sup>Created on 2023-02-13 with reprex v2.0.2</sup>

答案4

得分: 1

你可以编写一个辅助函数来为你执行检查:

  1. common_start <- function(x, y) {
  2. i <- 1
  3. last <- NA
  4. while (i <= nchar(x) & i <= nchar(x)) {
  5. if (substr(x, i, i) == substr(y, i, i)) {
  6. if (grepl("[[:space:][:punct:]]", substr(x, i, i), perl = TRUE)) {
  7. last <- i
  8. }
  9. } else {
  10. break;
  11. }
  12. i <- i + 1
  13. }
  14. if (!is.na(last)) {
  15. substr(x, 1, last - 1)
  16. } else {
  17. NA
  18. }
  19. }

然后将其与你的示例字符串一起使用:

  1. common_start(x, y)
  2. # [1] "Here is a"

这个思路是检查每个字符,同时跟踪最后一个非单词字符,它仍然匹配。使用 while 循环可能不够复杂,但它意味着一旦发现不匹配,你可以提前结束,而不必处理整个字符串。

英文:

You could write a helper function to do the check for you

  1. common_start&lt;-function(x, y) {
  2. i &lt;- 1
  3. last &lt;- NA
  4. while (i &lt;= nchar(x) &amp; i &lt;= nchar(x)) {
  5. if (substr(x,i,i) == substr(y,i,i)) {
  6. if (grepl(&quot;[[:space:][:punct:]]&quot;, substr(x,i,i), perl=T)) {
  7. last &lt;- i
  8. }
  9. } else {
  10. break;
  11. }
  12. i &lt;- i + 1
  13. }
  14. if (!is.na(last)) {
  15. substr(x, 1, last-1)
  16. } else {
  17. NA
  18. }
  19. }

and use that with your sample stirngs

  1. common_start(x,y)
  2. # [1] &quot;Here is a&quot;

The idea is to check every character, keeping track of the last non-word character that still matches. Using a while loop may not be fancy but it does mean you get to break early without processing the whole string as soon as a mismatch is found.

huangapple
  • 本文由 发表于 2023年2月14日 04:06:53
  • 转载请务必保留本文链接:https://go.coder-hub.com/75440694.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定