英文:
Quanteda and stringr in R: (Correct) regex cannot be parsed
问题
我想使用quanteda
和stringr
库运行正则表达式搜索,但我一直收到错误消息。我的目标是匹配模式(VP (V.. ...)
,使用正则表达式\(\h+VP\h+\(V\w*\h+\w*\)
。这是一个最小化示例:
library(quanteda)
library(dplyr)
library(stringr)
text <- "(ROOT (S (NP (PRP It)) (VP (VBZ is) (RB not) (VP (VBN transmitted) (PP (IN from) (: :) (S (VP (VBG giving) (NP (NP (NP (NP (NML (NN blood)"
kwic_regex <- kwic(
# define text
text,
# define search pattern
"\\(VP\\h+\\(V\\w*\\h+\\w*\\)",
window = 20,
# define valuetype
valuetype = "regex") %>%
# make it a data frame
as.data.frame()
这是错误消息:
Error: '\(' is an unrecognized escape in character string starting ""
我觉得这令人困惑,因为正则表达式应该是正确的(参考 https://regex101.com/r/3hbZ0R/1)。我也尝试了转义转义符(例如 \\(
),但没有成功。我会非常感激任何关于如何改进我的查询的想法。
英文:
I want to run a regex search using the quanteda
and stringr
libraries, but I continue to receive errors. My goal is to match the patterns (VP (V.. ...)
using the regex \(VP\h+\(V\w*\h+\w*\)
. Here is a MWE:
library(quanteda)
library(dplyr)
library(stringr)
text <- "(ROOT (S (NP (PRP It)) (VP (VBZ is) (RB not) (VP (VBN transmitted) (PP (IN from) (: :) (S (VP (VBG giving) (NP (NP (NP (NP (NML (NN blood)"
kwic_regex <- kwic(
# define text
text,
# define search pattern
"\(VP\h+\(V\w*\h+\w*\)",
window = 20,
# define valuetype
valuetype = "regex") %>%
# make it a data frame
as.data.frame()
And this is the error message:
Error: '\(' is an unrecognized escape in character string starting ""\("
I find it puzzling because the regex should be correct (cf. https://regex101.com/r/3hbZ0R/1). I've also tried escaping the escapes (e.g., \\(
) to no avail. I would really appreciate any ideas on how to improve my query.
答案1
得分: 1
R语言中似乎通过首先检查是否找到一个转义的允许字符来解析双引号字符串,然后可以替换生成的控制代码。由于普遍认可的转义字符"escape"会解析为单引号文字,所以在解析后,所有的转义都会解析为传递给该函数的 raw
正则表达式字符串。
因此,你的双引号字符串应该是 "\\(VP\\h+\\(V\\w*\\h+\\w*\\)"
,它被解析为 \(VP\h+\(V\w*\h+\w*\)
,然后传递给 stringr 函数。
library(stringr)
str_match_all(
"(ROOT (S (NP (PRP It)) (VP (VBZ is) (RB not) (VP (VBN transmitted) (PP (IN from) (: :) (S (VP (VBG giving) (NP (NP (NP (NP (NML (NN blood)",
"\\(VP\\h+\\(V\\w*\\h+\\w*\\)" )
输出结果:
[[1]]
[,1]
[1,] "(VP (VBZ is)"
[2,] "(VP (VBN transmitted)"
[3,] "(VP (VBG giving)"
每种编程语言都有不同的解析规则。一些语言在遇到未知的转义序列时会抛出错误,例如 \(
,而其他语言则会简单地将转义符号转换为 (
,并不会提醒你。
英文:
R strings seems to parse a double quoted string by checking first if they find
an escaped allowed character that they can substitute the resultant control code.
Since escaped escape is universally recognized to resolve to a single quote literal,
all escapes will resolve after parsing to the raw
regex string passed to the function.
So your double quoted string should be "\\(VP\\h+\\(V\\w*\\h+\\w*\\)"
which gets parsed to \(VP\h+\(V\w*\h+\w*\)
which is handed to the stringr function.
library(stringr)
str_match_all(
"(ROOT (S (NP (PRP It)) (VP (VBZ is) (RB not) (VP (VBN transmitted) (PP (IN from) (: :) (S (VP (VBG giving) (NP (NP (NP (NP (NML (NN blood)",
"\\(VP\\h+\\(V\\w*\\h+\\w*\\)" )
https://www.mycompiler.io/view/BjjkPXQUNpT
Output
[[1]]
[,1]
[1,] "(VP (VBZ is)"
[2,] "(VP (VBN transmitted)"
[3,] "(VP (VBG giving)"
Each language enforces different parsing rules.
Some will throw an error if an unknown escape sequence is encounterred
like \(
others will simply strip the escape to this (
and not tell you about it.
答案2
得分: 0
我已经找到问题了:显然,kwic()
函数不再支持空格(参见 https://stackoverflow.com/questions/60710730/kwic-in-quanteda-r-does-not-identify-more-than-one-word-in-regex-pattern)。在运行搜索之前,我还使用了 token()
函数,并将表达式包装在 phrase()
中。
以下是已更正的代码:
library(quanteda)
library(dplyr)
library(stringr)
library(tidyverse)
rm(list=ls(all=T))
text <- "(ROOT (S (NP (PRP It)) (VP (VBZ is) (RB not) (VP (VBN transmitted) (PP (IN from) (: :) (S (VP (VBG giving) (NP (NP (NP (NP (NML (NN blood)"
text2 <- tokens(text)
kwic_regex <- kwic(
text2,
phrase("\\( VP \\V\\w* \\w* \\w* \\)"),
window = 10,
separator = " ",
case_insensitive = F,
valuetype = "regex"
) %>%
as.data.frame()
kwic_regex
输出:
docname from to pre keyword
1 text1 12 17 ROOT ( S ( NP ( PRP It ) ) ( VP ( VBZ is )
2 text1 22 27 ( VP ( VBZ is ) ( RB not ) ( VP ( VBN transmitted )
3 text1 40 45 ( IN from ) ( : : ) ( S ( VP ( VBG giving )
post pattern
1 ( RB not ) ( VP ( VBN transmitted ) \\( VP \\V\\w* \\w* \\w* \\)
2 ( PP ( IN from ) ( : : ) \\( VP \\V\\w* \\w* \\w* \\)
3 ( NP ( NP ( NP ( NP (NML \\( VP \\V\\w* \\w* \\w* \\)
英文:
I've identified the problem: Apparently, the kwic()
function no longer supports spaces (cf. https://stackoverflow.com/questions/60710730/kwic-in-quanteda-r-does-not-identify-more-than-one-word-in-regex-pattern). I've also used the token()
function before running the search and wrapped the expression in phrase()
.
Here is the corrected code:
library(quanteda)
library(dplyr)
library(stringr)
library(tidyverse)
rm(list=ls(all=T))
text <- "(ROOT (S (NP (PRP It)) (VP (VBZ is) (RB not) (VP (VBN transmitted) (PP (IN from) (: :) (S (VP (VBG giving) (NP (NP (NP (NP (NML (NN blood)"
text2 <- tokens(text)
kwic_regex <- kwic(
text2,
phrase("\\( VP \\V\\w* \\w* \\w* \\)"),
window = 10,
separator = " ",
case_insensitive = F,
valuetype = "regex") %>%
as.data.frame(); kwic_regex
Output:
docname from to pre keyword
1 text1 12 17 ROOT ( S ( NP ( PRP It ) ) ( VP ( VBZ is )
2 text1 22 27 ( VP ( VBZ is ) ( RB not ) ( VP ( VBN transmitted )
3 text1 40 45 ( IN from ) ( : : ) ( S ( VP ( VBG giving )
post pattern
1 ( RB not ) ( VP ( VBN transmitted ) \\( VP \\V\\w* \\w* \\w* \\)
2 ( PP ( IN from ) ( : : ) \\( VP \\V\\w* \\w* \\w* \\)
3 ( NP ( NP ( NP ( NP ( NML \\( VP \\V\\w* \\w* \\w* \\)
</details>
# 答案3
**得分**: 0
为了使这个工作正常运行,您必须理解**quanteda**中的标记化工作方式以及`pattern`如何与多标记序列一起使用。
首先,默认情况下,标记化会删除您在正则表达式模式中包含的空格。但对于您的模式来说,这不是重要的部分;相反,序列是重要的部分。另外,当前默认的标记化程序会将括号与POS标签和文本分开。因此,您需要使用不同的标记化程序来控制这一点,它会在括号上(并删除括号)进行分割。请参考`?tokens`和`?pattern`。
其次,为了匹配标记序列,您需要将多标记模式包装在`phrase()`中,它会根据空格进行分割。请参考`?phrase`。
所以这将有效(而且非常高效):
```r
library("quanteda")
#> Package version: 3.3.1
#> Unicode version: 14.0
#> ICU version: 71.1
#> Parallel computing: 12 of 12 threads used.
#> See https://quanteda.io for tutorials and examples.
txt <- "(ROOT (S (NP (PRP It)) (VP (VBZ is) (RB not) (VP (VBN transmitted) (PP (IN from) (: :) (S (VP (VBG giving) (NP (NP (NP (NP (NML (NN blood)"
toks <- tokens(txt, what = "fasterword", remove_separators = TRUE)
print(toks, -1, -1)
#> Tokens consisting of 1 document.
#> text1 :
#> [1] "(ROOT" "(S" "(NP" "(PRP" "It))"
#> [6] "(VP" "(VBZ" "is)" "(RB" "not)"
#> [11] "(VP" "(VBN" "transmitted)" "(PP" "(IN"
#> [16] "from)" "(:)" ":)" "(S" "(VP"
#> [21] "(VBG" "giving)" "(NP" "(NP" "(NP"
#> [26] "(NP" "(NML" "(NN" "blood)"
kwic(toks, phrase("\\(VP \\(V \\)"), window = 3, valuetype = "regex")
#> Keyword-in-context with 3 matches.
#> [text1, 6:8] (NP (PRP It)) | (VP (VBZ is) | (RB not) (VP
#> [text1, 11:13] is) (RB not) | (VP (VBN transmitted) | (PP (IN from)
#> [text1, 20:22] (::) (S | (VP (VBG giving) | (NP (NP (NP
请注意,您需要在正则表达式模式中对保留字符进行双重转义。
英文:
To get this to work, you have to understand how tokenisation works in quanteda and how pattern
works with multi-token sequences.
First, tokenisation (by default) removes the whitespace that you are including in your regex pattern. But for your pattern, this is not the important part; rather, the sequence is the important part. Also, the current default tokeniser will split parentheses from the POS tags and text. So you want to control this by using a different tokeniser that splits on (and removes) whitespace. See ?tokens
and ?pattern
.
Second, to match sequences of tokens, you need to wrap your multi-token pattern in phrase()
, which will split it on whitespace. See ?phrase
.
So this will work (and very efficiently):
library("quanteda")
#> Package version: 3.3.1
#> Unicode version: 14.0
#> ICU version: 71.1
#> Parallel computing: 12 of 12 threads used.
#> See https://quanteda.io for tutorials and examples.
txt <- "(ROOT (S (NP (PRP It)) (VP (VBZ is) (RB not) (VP (VBN transmitted) (PP (IN from) (: :) (S (VP (VBG giving) (NP (NP (NP (NP (NML (NN blood)"
toks <- tokens(txt, what = "fasterword", remove_separators = TRUE)
print(toks, -1, -1)
#> Tokens consisting of 1 document.
#> text1 :
#> [1] "(ROOT" "(S" "(NP" "(PRP" "It))"
#> [6] "(VP" "(VBZ" "is)" "(RB" "not)"
#> [11] "(VP" "(VBN" "transmitted)" "(PP" "(IN"
#> [16] "from)" "(:" ":)" "(S" "(VP"
#> [21] "(VBG" "giving)" "(NP" "(NP" "(NP"
#> [26] "(NP" "(NML" "(NN" "blood)"
kwic(toks, phrase("\\(VP \\(V \\)"), window = 3, valuetype = "regex")
#> Keyword-in-context with 3 matches.
#> [text1, 6:8] (NP (PRP It)) | (VP (VBZ is) | (RB not) (VP
#> [text1, 11:13] is) (RB not) | (VP (VBN transmitted) | (PP (IN from)
#> [text1, 20:22] (::) (S | (VP (VBG giving) | (NP (NP (NP
<sup>Created on 2023-07-03 with reprex v2.0.2</sup>
Note how you do need to double-escape the reserved characters in the regular expression pattern.
<sup>Created on 2023-07-03 with reprex v2.0.2</sup>
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论