Quanteda和stringr在R中:(正确) 正则表达式无法解析

huangapple go评论92阅读模式
英文:

Quanteda and stringr in R: (Correct) regex cannot be parsed

问题

我想使用quantedastringr库运行正则表达式搜索,但我一直收到错误消息。我的目标是匹配模式(VP (V.. ...),使用正则表达式\(\h+VP\h+\(V\w*\h+\w*\)。这是一个最小化示例:

library(quanteda)
library(dplyr)
library(stringr)

text <- "(ROOT (S (NP (PRP It)) (VP (VBZ is) (RB not) (VP (VBN transmitted) (PP (IN from) (: :) (S (VP (VBG giving) (NP (NP (NP (NP (NML (NN blood)"

kwic_regex <- kwic(
  # define text
  text, 
  # define search pattern
  "\\(VP\\h+\\(V\\w*\\h+\\w*\\)", 
  window = 20, 
  # define valuetype
  valuetype = "regex") %>%
  # make it a data frame
  as.data.frame()

这是错误消息:

Error: '\(' is an unrecognized escape in character string starting ""

我觉得这令人困惑,因为正则表达式应该是正确的(参考 https://regex101.com/r/3hbZ0R/1)。我也尝试了转义转义符(例如 \\(),但没有成功。我会非常感激任何关于如何改进我的查询的想法。

英文:

I want to run a regex search using the quanteda and stringr libraries, but I continue to receive errors. My goal is to match the patterns (VP (V.. ...) using the regex \(VP\h+\(V\w*\h+\w*\). Here is a MWE:

library(quanteda)
library(dplyr)
library(stringr)

text &lt;- &quot;(ROOT (S (NP (PRP It)) (VP (VBZ is) (RB not) (VP (VBN transmitted) (PP (IN from) (: :) (S (VP (VBG giving) (NP (NP (NP (NP (NML (NN blood)&quot;


kwic_regex &lt;- kwic(
  # define text
  text, 
  # define search pattern
  &quot;\(VP\h+\(V\w*\h+\w*\)&quot;, 
  window = 20, 
  # define valuetype
  valuetype = &quot;regex&quot;) %&gt;%
  # make it a data frame
  as.data.frame()

And this is the error message:

Error: &#39;\(&#39; is an unrecognized escape in character string starting &quot;&quot;\(&quot;

I find it puzzling because the regex should be correct (cf. https://regex101.com/r/3hbZ0R/1). I've also tried escaping the escapes (e.g., \\() to no avail. I would really appreciate any ideas on how to improve my query.

答案1

得分: 1

R语言中似乎通过首先检查是否找到一个转义的允许字符来解析双引号字符串,然后可以替换生成的控制代码。由于普遍认可的转义字符"escape"会解析为单引号文字,所以在解析后,所有的转义都会解析为传递给该函数的 raw 正则表达式字符串。

因此,你的双引号字符串应该是 &quot;\\(VP\\h+\\(V\\w*\\h+\\w*\\)&quot;,它被解析为 \(VP\h+\(V\w*\h+\w*\),然后传递给 stringr 函数。

library(stringr)
str_match_all(
&quot;(ROOT (S (NP (PRP It)) (VP (VBZ is) (RB not) (VP (VBN transmitted) (PP (IN from) (: :) (S (VP (VBG giving) (NP (NP (NP (NP (NML (NN blood)&quot;,
&quot;\\(VP\\h+\\(V\\w*\\h+\\w*\\)&quot; )

输出结果:

[[1]]
      [,1]                   
[1,] &quot;(VP (VBZ is)&quot;         
[2,] &quot;(VP (VBN transmitted)&quot;
[3,] &quot;(VP (VBG giving)&quot;     

每种编程语言都有不同的解析规则。一些语言在遇到未知的转义序列时会抛出错误,例如 \(,而其他语言则会简单地将转义符号转换为 (,并不会提醒你。

英文:

R strings seems to parse a double quoted string by checking first if they find
an escaped allowed character that they can substitute the resultant control code.
Since escaped escape is universally recognized to resolve to a single quote literal,
all escapes will resolve after parsing to the raw regex string passed to the function.

So your double quoted string should be &quot;\\(VP\\h+\\(V\\w*\\h+\\w*\\)&quot; which gets parsed to \(VP\h+\(V\w*\h+\w*\) which is handed to the stringr function.

 library(stringr)
 str_match_all(
 &quot;(ROOT (S (NP (PRP It)) (VP (VBZ is) (RB not) (VP (VBN transmitted) (PP (IN from) (: :) (S (VP (VBG giving) (NP (NP (NP (NP (NML (NN blood)&quot;,
 &quot;\\(VP\\h+\\(V\\w*\\h+\\w*\\)&quot; )

https://www.mycompiler.io/view/BjjkPXQUNpT

Output

 [[1]]
      [,1]                   
 [1,] &quot;(VP (VBZ is)&quot;         
 [2,] &quot;(VP (VBN transmitted)&quot;
 [3,] &quot;(VP (VBG giving)&quot;     

Each language enforces different parsing rules.
Some will throw an error if an unknown escape sequence is encounterred
like \( others will simply strip the escape to this ( and not tell you about it.

答案2

得分: 0

我已经找到问题了:显然,kwic() 函数不再支持空格(参见 https://stackoverflow.com/questions/60710730/kwic-in-quanteda-r-does-not-identify-more-than-one-word-in-regex-pattern)。在运行搜索之前,我还使用了 token() 函数,并将表达式包装在 phrase() 中。

以下是已更正的代码:

library(quanteda)
library(dplyr)
library(stringr)
library(tidyverse)

rm(list=ls(all=T))

text <- "(ROOT (S (NP (PRP It)) (VP (VBZ is) (RB not) (VP (VBN transmitted) (PP (IN from) (: :) (S (VP (VBG giving) (NP (NP (NP (NP (NML (NN blood)"

text2 <- tokens(text)

kwic_regex <- kwic(
  text2, 
  phrase("\\( VP \\V\\w* \\w* \\w* \\)"), 
  window = 10, 
  separator = " ",
  case_insensitive = F,
  valuetype = "regex"
) %>%
  as.data.frame()

kwic_regex

输出:

  docname from to                        pre                  keyword
1   text1   12 17 ROOT ( S ( NP ( PRP It ) )          ( VP ( VBZ is )
2   text1   22 27 ( VP ( VBZ is ) ( RB not ) ( VP ( VBN transmitted )
3   text1   40 45    ( IN from ) ( : : ) ( S      ( VP ( VBG giving )
                                 post                      pattern
1 ( RB not ) ( VP ( VBN transmitted ) \\( VP \\V\\w* \\w* \\w* \\)
2            ( PP ( IN from ) ( : : ) \\( VP \\V\\w* \\w* \\w* \\)
3           ( NP ( NP ( NP ( NP (NML \\( VP \\V\\w* \\w* \\w* \\)
英文:

I've identified the problem: Apparently, the kwic() function no longer supports spaces (cf. https://stackoverflow.com/questions/60710730/kwic-in-quanteda-r-does-not-identify-more-than-one-word-in-regex-pattern). I've also used the token() function before running the search and wrapped the expression in phrase().

Here is the corrected code:

library(quanteda)
library(dplyr)
library(stringr)
library(tidyverse)

rm(list=ls(all=T))

text &lt;- &quot;(ROOT (S (NP (PRP It)) (VP (VBZ is) (RB not) (VP (VBN transmitted) (PP (IN from) (: :) (S (VP (VBG giving) (NP (NP (NP (NP (NML (NN blood)&quot;

text2 &lt;- tokens(text)


kwic_regex &lt;- kwic(
  text2, 
  phrase(&quot;\\( VP \\V\\w* \\w* \\w* \\)&quot;), 
  window = 10, 
  separator = &quot; &quot;,
  case_insensitive = F,
  valuetype = &quot;regex&quot;) %&gt;%
  as.data.frame(); kwic_regex

Output:

  docname from to                        pre                  keyword
1   text1   12 17 ROOT ( S ( NP ( PRP It ) )          ( VP ( VBZ is )
2   text1   22 27 ( VP ( VBZ is ) ( RB not ) ( VP ( VBN transmitted )
3   text1   40 45    ( IN from ) ( : : ) ( S      ( VP ( VBG giving )
                                 post                      pattern
1 ( RB not ) ( VP ( VBN transmitted ) \\( VP \\V\\w* \\w* \\w* \\)
2            ( PP ( IN from ) ( : : ) \\( VP \\V\\w* \\w* \\w* \\)
3           ( NP ( NP ( NP ( NP ( NML \\( VP \\V\\w* \\w* \\w* \\)

</details>



# 答案3
**得分**: 0

为了使这个工作正常运行,您必须理解**quanteda**中的标记化工作方式以及`pattern`如何与多标记序列一起使用。

首先,默认情况下,标记化会删除您在正则表达式模式中包含的空格。但对于您的模式来说,这不是重要的部分;相反,序列是重要的部分。另外,当前默认的标记化程序会将括号与POS标签和文本分开。因此,您需要使用不同的标记化程序来控制这一点,它会在括号上(并删除括号)进行分割。请参考`?tokens`和`?pattern`。

其次,为了匹配标记序列,您需要将多标记模式包装在`phrase()`中,它会根据空格进行分割。请参考`?phrase`。

所以这将有效(而且非常高效):

```r
library("quanteda")
#> Package version: 3.3.1
#> Unicode version: 14.0
#> ICU version: 71.1
#> Parallel computing: 12 of 12 threads used.
#> See https://quanteda.io for tutorials and examples.

txt <- "(ROOT (S (NP (PRP It)) (VP (VBZ is) (RB not) (VP (VBN transmitted) (PP (IN from) (: :) (S (VP (VBG giving) (NP (NP (NP (NP (NML (NN blood)"

toks <- tokens(txt, what = "fasterword", remove_separators = TRUE)
print(toks, -1, -1)
#> Tokens consisting of 1 document.
#> text1 :
#>  [1] "(ROOT"        "(S"           "(NP"          "(PRP"         "It))"
#>  [6] "(VP"          "(VBZ"         "is)"          "(RB"          "not)"
#> [11] "(VP"          "(VBN"         "transmitted)" "(PP"          "(IN"
#> [16] "from)"        "(:)"          ":)"           "(S"           "(VP"
#> [21] "(VBG"         "giving)"      "(NP"          "(NP"          "(NP"
#> [26] "(NP"          "(NML"         "(NN"          "blood)"

kwic(toks, phrase("\\(VP \\(V \\)"), window = 3, valuetype = "regex")
#> Keyword-in-context with 3 matches.                                                                     
#>    [text1, 6:8] (NP (PRP It)) |     (VP (VBZ is)      | (RB not) (VP 
#>  [text1, 11:13]  is) (RB not) | (VP (VBN transmitted) | (PP (IN from)
#>  [text1, 20:22]       (::) (S |   (VP (VBG giving)    | (NP (NP (NP

请注意,您需要在正则表达式模式中对保留字符进行双重转义。

英文:

To get this to work, you have to understand how tokenisation works in quanteda and how pattern works with multi-token sequences.

First, tokenisation (by default) removes the whitespace that you are including in your regex pattern. But for your pattern, this is not the important part; rather, the sequence is the important part. Also, the current default tokeniser will split parentheses from the POS tags and text. So you want to control this by using a different tokeniser that splits on (and removes) whitespace. See ?tokens and ?pattern.

Second, to match sequences of tokens, you need to wrap your multi-token pattern in phrase(), which will split it on whitespace. See ?phrase.

So this will work (and very efficiently):

library(&quot;quanteda&quot;)
#&gt; Package version: 3.3.1
#&gt; Unicode version: 14.0
#&gt; ICU version: 71.1
#&gt; Parallel computing: 12 of 12 threads used.
#&gt; See https://quanteda.io for tutorials and examples.

txt &lt;- &quot;(ROOT (S (NP (PRP It)) (VP (VBZ is) (RB not) (VP (VBN transmitted) (PP (IN from) (: :) (S (VP (VBG giving) (NP (NP (NP (NP (NML (NN blood)&quot;

toks &lt;- tokens(txt, what = &quot;fasterword&quot;, remove_separators = TRUE)
print(toks, -1, -1)
#&gt; Tokens consisting of 1 document.
#&gt; text1 :
#&gt;  [1] &quot;(ROOT&quot;        &quot;(S&quot;           &quot;(NP&quot;          &quot;(PRP&quot;         &quot;It))&quot;        
#&gt;  [6] &quot;(VP&quot;          &quot;(VBZ&quot;         &quot;is)&quot;          &quot;(RB&quot;          &quot;not)&quot;        
#&gt; [11] &quot;(VP&quot;          &quot;(VBN&quot;         &quot;transmitted)&quot; &quot;(PP&quot;          &quot;(IN&quot;         
#&gt; [16] &quot;from)&quot;        &quot;(:&quot;           &quot;:)&quot;           &quot;(S&quot;           &quot;(VP&quot;         
#&gt; [21] &quot;(VBG&quot;         &quot;giving)&quot;      &quot;(NP&quot;          &quot;(NP&quot;          &quot;(NP&quot;         
#&gt; [26] &quot;(NP&quot;          &quot;(NML&quot;         &quot;(NN&quot;          &quot;blood)&quot;

kwic(toks, phrase(&quot;\\(VP \\(V \\)&quot;), window = 3, valuetype = &quot;regex&quot;)
#&gt; Keyword-in-context with 3 matches.                                                                     
#&gt;    [text1, 6:8] (NP (PRP It)) |     (VP (VBZ is)      | (RB not) (VP 
#&gt;  [text1, 11:13]  is) (RB not) | (VP (VBN transmitted) | (PP (IN from)
#&gt;  [text1, 20:22]       (::) (S |   (VP (VBG giving)    | (NP (NP (NP

<sup>Created on 2023-07-03 with reprex v2.0.2</sup>

Note how you do need to double-escape the reserved characters in the regular expression pattern.

<sup>Created on 2023-07-03 with reprex v2.0.2</sup>

huangapple
  • 本文由 发表于 2023年6月29日 22:14:16
  • 转载请务必保留本文链接:https://go.coder-hub.com/76581901.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定