2023年6月29日 22:14:16go评论117阅读模式

英文:

Quanteda and stringr in R: (Correct) regex cannot be parsed

问题

我想使用quanteda和stringr库运行正则表达式搜索，但我一直收到错误消息。我的目标是匹配模式(VP (V.. ...)，使用正则表达式\(\h+VP\h+\(V\w*\h+\w*\)。这是一个最小化示例：

library(quanteda)
library(dplyr)
library(stringr)
text <- "(ROOT (S (NP (PRP It)) (VP (VBZ is) (RB not) (VP (VBN transmitted) (PP (IN from) (: :) (S (VP (VBG giving) (NP (NP (NP (NP (NML (NN blood)"
kwic_regex <- kwic(
  # define text
  text, 
  # define search pattern
  "\\(VP\\h+\\(V\\w*\\h+\\w*\\)", 
  window = 20, 
  # define valuetype
  valuetype = "regex") %>%
  # make it a data frame
  as.data.frame()

这是错误消息：

Error: '\(' is an unrecognized escape in character string starting ""

我觉得这令人困惑，因为正则表达式应该是正确的（参考 https://regex101.com/r/3hbZ0R/1）。我也尝试了转义转义符（例如 \\(），但没有成功。我会非常感激任何关于如何改进我的查询的想法。

英文:

I want to run a regex search using the quanteda and stringr libraries, but I continue to receive errors. My goal is to match the patterns (VP (V.. ...) using the regex \(VP\h+\(V\w*\h+\w*\). Here is a MWE:

library(quanteda)
library(dplyr)
library(stringr)
text &lt;- &quot;(ROOT (S (NP (PRP It)) (VP (VBZ is) (RB not) (VP (VBN transmitted) (PP (IN from) (: :) (S (VP (VBG giving) (NP (NP (NP (NP (NML (NN blood)&quot;
kwic_regex &lt;- kwic(
  # define text
  text, 
  # define search pattern
  &quot;\(VP\h+\(V\w*\h+\w*\)&quot;, 
  window = 20, 
  # define valuetype
  valuetype = &quot;regex&quot;) %&gt;%
  # make it a data frame
  as.data.frame()

And this is the error message:

Error: &#39;\(&#39; is an unrecognized escape in character string starting &quot;&quot;\(&quot;

I find it puzzling because the regex should be correct (cf. https://regex101.com/r/3hbZ0R/1). I've also tried escaping the escapes (e.g., \\() to no avail. I would really appreciate any ideas on how to improve my query.

答案1

得分: 1

R语言中似乎通过首先检查是否找到一个转义的允许字符来解析双引号字符串，然后可以替换生成的控制代码。由于普遍认可的转义字符"escape"会解析为单引号文字，所以在解析后，所有的转义都会解析为传递给该函数的 raw 正则表达式字符串。

因此，你的双引号字符串应该是 "\\(VP\\h+\\(V\\w*\\h+\\w*\\)"，它被解析为 \(VP\h+\(V\w*\h+\w*\)，然后传递给 stringr 函数。

library(stringr)
str_match_all(
&quot;(ROOT (S (NP (PRP It)) (VP (VBZ is) (RB not) (VP (VBN transmitted) (PP (IN from) (: :) (S (VP (VBG giving) (NP (NP (NP (NP (NML (NN blood)&quot;,
&quot;\\(VP\\h+\\(V\\w*\\h+\\w*\\)&quot; )

输出结果：

[[1]]
      [,1]                   
[1,] &quot;(VP (VBZ is)&quot;         
[2,] &quot;(VP (VBN transmitted)&quot;
[3,] &quot;(VP (VBG giving)&quot;

每种编程语言都有不同的解析规则。一些语言在遇到未知的转义序列时会抛出错误，例如 \(，而其他语言则会简单地将转义符号转换为 (，并不会提醒你。

英文:

R strings seems to parse a double quoted string by checking first if they find
an escaped allowed character that they can substitute the resultant control code.
Since escaped escape is universally recognized to resolve to a single quote literal,
all escapes will resolve after parsing to the raw regex string passed to the function.

So your double quoted string should be "\\(VP\\h+\\(V\\w*\\h+\\w*\\)" which gets parsed to \(VP\h+\(V\w*\h+\w*\) which is handed to the stringr function.

 library(stringr)
 str_match_all(
 &quot;(ROOT (S (NP (PRP It)) (VP (VBZ is) (RB not) (VP (VBN transmitted) (PP (IN from) (: :) (S (VP (VBG giving) (NP (NP (NP (NP (NML (NN blood)&quot;,
 &quot;\\(VP\\h+\\(V\\w*\\h+\\w*\\)&quot; )

https://www.mycompiler.io/view/BjjkPXQUNpT

Output

 [[1]]
      [,1]                   
 [1,] &quot;(VP (VBZ is)&quot;         
 [2,] &quot;(VP (VBN transmitted)&quot;
 [3,] &quot;(VP (VBG giving)&quot;

Each language enforces different parsing rules.
Some will throw an error if an unknown escape sequence is encounterred
like \( others will simply strip the escape to this ( and not tell you about it.

答案2

得分: 0

我已经找到问题了：显然，kwic() 函数不再支持空格（参见 https://stackoverflow.com/questions/60710730/kwic-in-quanteda-r-does-not-identify-more-than-one-word-in-regex-pattern）。在运行搜索之前，我还使用了 token() 函数，并将表达式包装在 phrase() 中。

以下是已更正的代码：

library(quanteda)
library(dplyr)
library(stringr)
library(tidyverse)
rm(list=ls(all=T))
text <- "(ROOT (S (NP (PRP It)) (VP (VBZ is) (RB not) (VP (VBN transmitted) (PP (IN from) (: :) (S (VP (VBG giving) (NP (NP (NP (NP (NML (NN blood)"
text2 <- tokens(text)
kwic_regex <- kwic(
  text2, 
  phrase("\\( VP \\V\\w* \\w* \\w* \\)"), 
  window = 10, 
  separator = " ",
  case_insensitive = F,
  valuetype = "regex"
) %>%
  as.data.frame()
kwic_regex

输出：

  docname from to                        pre                  keyword
1   text1   12 17 ROOT ( S ( NP ( PRP It ) )          ( VP ( VBZ is )
2   text1   22 27 ( VP ( VBZ is ) ( RB not ) ( VP ( VBN transmitted )
3   text1   40 45    ( IN from ) ( : : ) ( S      ( VP ( VBG giving )
                                 post                      pattern
1 ( RB not ) ( VP ( VBN transmitted ) \\( VP \\V\\w* \\w* \\w* \\)
2            ( PP ( IN from ) ( : : ) \\( VP \\V\\w* \\w* \\w* \\)
3           ( NP ( NP ( NP ( NP (NML \\( VP \\V\\w* \\w* \\w* \\)

英文:

I've identified the problem: Apparently, the kwic() function no longer supports spaces (cf. https://stackoverflow.com/questions/60710730/kwic-in-quanteda-r-does-not-identify-more-than-one-word-in-regex-pattern). I've also used the token() function before running the search and wrapped the expression in phrase().

Here is the corrected code:

library(quanteda)
library(dplyr)
library(stringr)
library(tidyverse)
rm(list=ls(all=T))
text &lt;- &quot;(ROOT (S (NP (PRP It)) (VP (VBZ is) (RB not) (VP (VBN transmitted) (PP (IN from) (: :) (S (VP (VBG giving) (NP (NP (NP (NP (NML (NN blood)&quot;
text2 &lt;- tokens(text)
kwic_regex &lt;- kwic(
  text2, 
  phrase(&quot;\\( VP \\V\\w* \\w* \\w* \\)&quot;), 
  window = 10, 
  separator = &quot; &quot;,
  case_insensitive = F,
  valuetype = &quot;regex&quot;) %&gt;%
  as.data.frame(); kwic_regex

Output:

  docname from to                        pre                  keyword
1   text1   12 17 ROOT ( S ( NP ( PRP It ) )          ( VP ( VBZ is )
2   text1   22 27 ( VP ( VBZ is ) ( RB not ) ( VP ( VBN transmitted )
3   text1   40 45    ( IN from ) ( : : ) ( S      ( VP ( VBG giving )
                                 post                      pattern
1 ( RB not ) ( VP ( VBN transmitted ) \\( VP \\V\\w* \\w* \\w* \\)
2            ( PP ( IN from ) ( : : ) \\( VP \\V\\w* \\w* \\w* \\)
3           ( NP ( NP ( NP ( NP ( NML \\( VP \\V\\w* \\w* \\w* \\)
</details>
# 答案3
**得分**: 0
为了使这个工作正常运行，您必须理解**quanteda**中的标记化工作方式以及`pattern`如何与多标记序列一起使用。
首先，默认情况下，标记化会删除您在正则表达式模式中包含的空格。但对于您的模式来说，这不是重要的部分；相反，序列是重要的部分。另外，当前默认的标记化程序会将括号与POS标签和文本分开。因此，您需要使用不同的标记化程序来控制这一点，它会在括号上（并删除括号）进行分割。请参考`?tokens`和`?pattern`。
其次，为了匹配标记序列，您需要将多标记模式包装在`phrase()`中，它会根据空格进行分割。请参考`?phrase`。
所以这将有效（而且非常高效）：
```r
library("quanteda")
#> Package version: 3.3.1
#> Unicode version: 14.0
#> ICU version: 71.1
#> Parallel computing: 12 of 12 threads used.
#> See https://quanteda.io for tutorials and examples.
txt <- "(ROOT (S (NP (PRP It)) (VP (VBZ is) (RB not) (VP (VBN transmitted) (PP (IN from) (: :) (S (VP (VBG giving) (NP (NP (NP (NP (NML (NN blood)"
toks <- tokens(txt, what = "fasterword", remove_separators = TRUE)
print(toks, -1, -1)
#> Tokens consisting of 1 document.
#> text1 :
#>  [1] "(ROOT"        "(S"           "(NP"          "(PRP"         "It))"
#>  [6] "(VP"          "(VBZ"         "is)"          "(RB"          "not)"
#> [11] "(VP"          "(VBN"         "transmitted)" "(PP"          "(IN"
#> [16] "from)"        "(:)"          ":)"           "(S"           "(VP"
#> [21] "(VBG"         "giving)"      "(NP"          "(NP"          "(NP"
#> [26] "(NP"          "(NML"         "(NN"          "blood)"
kwic(toks, phrase("\\(VP \\(V \\)"), window = 3, valuetype = "regex")
#> Keyword-in-context with 3 matches.                                                                     
#>    [text1, 6:8] (NP (PRP It)) |     (VP (VBZ is)      | (RB not) (VP 
#>  [text1, 11:13]  is) (RB not) | (VP (VBN transmitted) | (PP (IN from)
#>  [text1, 20:22]       (::) (S |   (VP (VBG giving)    | (NP (NP (NP

请注意，您需要在正则表达式模式中对保留字符进行双重转义。

英文:

To get this to work, you have to understand how tokenisation works in quanteda and how pattern works with multi-token sequences.

First, tokenisation (by default) removes the whitespace that you are including in your regex pattern. But for your pattern, this is not the important part; rather, the sequence is the important part. Also, the current default tokeniser will split parentheses from the POS tags and text. So you want to control this by using a different tokeniser that splits on (and removes) whitespace. See ?tokens and ?pattern.

Second, to match sequences of tokens, you need to wrap your multi-token pattern in phrase(), which will split it on whitespace. See ?phrase.

So this will work (and very efficiently):

library(&quot;quanteda&quot;)
#&gt; Package version: 3.3.1
#&gt; Unicode version: 14.0
#&gt; ICU version: 71.1
#&gt; Parallel computing: 12 of 12 threads used.
#&gt; See https://quanteda.io for tutorials and examples.
txt &lt;- &quot;(ROOT (S (NP (PRP It)) (VP (VBZ is) (RB not) (VP (VBN transmitted) (PP (IN from) (: :) (S (VP (VBG giving) (NP (NP (NP (NP (NML (NN blood)&quot;
toks &lt;- tokens(txt, what = &quot;fasterword&quot;, remove_separators = TRUE)
print(toks, -1, -1)
#&gt; Tokens consisting of 1 document.
#&gt; text1 :
#&gt;  [1] &quot;(ROOT&quot;        &quot;(S&quot;           &quot;(NP&quot;          &quot;(PRP&quot;         &quot;It))&quot;        
#&gt;  [6] &quot;(VP&quot;          &quot;(VBZ&quot;         &quot;is)&quot;          &quot;(RB&quot;          &quot;not)&quot;        
#&gt; [11] &quot;(VP&quot;          &quot;(VBN&quot;         &quot;transmitted)&quot; &quot;(PP&quot;          &quot;(IN&quot;         
#&gt; [16] &quot;from)&quot;        &quot;(:&quot;           &quot;:)&quot;           &quot;(S&quot;           &quot;(VP&quot;         
#&gt; [21] &quot;(VBG&quot;         &quot;giving)&quot;      &quot;(NP&quot;          &quot;(NP&quot;          &quot;(NP&quot;         
#&gt; [26] &quot;(NP&quot;          &quot;(NML&quot;         &quot;(NN&quot;          &quot;blood)&quot;
kwic(toks, phrase(&quot;\\(VP \\(V \\)&quot;), window = 3, valuetype = &quot;regex&quot;)
#&gt; Keyword-in-context with 3 matches.                                                                     
#&gt;    [text1, 6:8] (NP (PRP It)) |     (VP (VBZ is)      | (RB not) (VP 
#&gt;  [text1, 11:13]  is) (RB not) | (VP (VBN transmitted) | (PP (IN from)
#&gt;  [text1, 20:22]       (::) (S |   (VP (VBG giving)    | (NP (NP (NP

<sup>Created on 2023-07-03 with reprex v2.0.2</sup>

Note how you do need to double-escape the reserved characters in the regular expression pattern.

<sup>Created on 2023-07-03 with reprex v2.0.2</sup>

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

Quanteda和stringr在R中：(正确) 正则表达式无法解析

问题

答案1

答案2

URL正则匹配，包括查询字符串和不包括查询字符串。

R中用于“连接”/“拼接”单词列表的函数。

从字符串中删除ASCII控制字符

R: save a regex match to a new variable while removing the regex match from the existing variable using `str_extract()`

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。