英文:
different results of regex matching among ICU library, Rust and PCRE(https://regexr.com/)
问题
ICU 和 Rust 在匹配 "戦場のヴァルキュリア3" 时给出不同的结果与 PCRE (https://regexr.com/) 不同,似乎 "戦場のヴァルキュリア3" 应该匹配为 2 部分。
英文:
here is the pattern I used :
"\w+|[^\w\s]+"
when I match string "abc.efg" and "戦場のヴァルキュリア3" using PCRE in https://regexr.com/,
it give me results like this:
"abc" "." "efg" => 3 parts
"戦場のヴァルキュリア" "3" => 2 parts
that looks like right.
But when I using icu like this :
//std::string ldata = "abc.efg";
std::string ldata = "戦場のヴァルキュリア3";
std::string m_regex = "\\w+|[^\\w\\s]+";
UErrorCode status = U_ZERO_ERROR;
icu::RegexMatcher matcher(m_regex.c_str(), 0, status);
icu::StringPiece data((char*)ldata.data(), ldata.length());
icu::UnicodeString input = icu::UnicodeString::fromUTF8(data);
matcher.reset(input);
int count = 0;
while (matcher.find(status) && U_SUCCESS(status))
{
auto start_index = matcher.start(status);
auto end_index = matcher.end(status);
count++;
}
the input string "abc.efg" give me:
"abc" "." "efg" => 3 parts
but the input string "戦場のヴァルキュリア3" give me :
"戦場のヴァルキュリア3" => 1 part
when I using rust like this:
impl Pattern for &Regex {
fn find_matches(&self, inside: &str) -> Result<Vec<(Offsets, bool)>> {
if inside.is_empty() {
return Ok(vec![((0, 0), false)]);
}
let mut prev = 0;
let mut splits = Vec::with_capacity(inside.len());
for m in self.find_iter(inside) {
if prev != m.start() {
splits.push(((prev, m.start()), false));
}
splits.push(((m.start(), m.end()), true));
prev = m.end();
}
if prev != inside.len() {
splits.push(((prev, inside.len()), false))
}
Ok(splits)
}
}
the input string "abc.efg" give me:
"abc" "." "efg" => 3 parts
but the input string "戦場のヴァルキュリア3" give me :
"戦場のヴァルキュリア3" => 1 part
why ICU and Rust match "戦場のヴァルキュリア3" give different result from PCRE(https://regexr.com/)
It looks that "戦場のヴァルキュリア3" should be matched into 2 part.
答案1
得分: 5
ICU 和正则表达式默认使用 Unicode 语义,这意味着对于 \w
,它们使用 Unicode 意识的定义来表示“单词字符”。
对于正则表达式,它是:
[\p{Alphabetic}\p{M}\p{Nd}\p{Pc}\p{Join_Control}]
对于 ICU,它是:
[\p{Alphabetic}\p{Mark}\p{Decimal_Number}\p{Connector_Punctuation}\u200c\u200d]
其中 Alphabetic
根据 Unicode TR44 的定义是:
小写字母 + 大写字母 + Lt + Lm + Lo + Nl + Other_Alphabetic
CJK 字符通常被分类为“其他字母”(Lo),因此在 Unicode 意识的分类中属于 \w
。同样,数字 "3" 也是如此。因此,它们都可以匹配 \w+
。
PCRE 默认不使用 Unicode 语义1,因此它不会将 "戦場のヴァルキュリア" 视为字母。
正则表达式支持非 Unicode 匹配(使用基于字节的引擎或 (?-u:)
标志),我不知道 ICU 是否支持,但我认为它不支持,因为那会打破它的目的。
如果你只想要 ASCII 匹配,只需明确要求。
或者是你误解了 \w
的含义,认为它不包括数字?因此,PCRE 将 "戦場のヴァルキュリア" 匹配到了 \w+
,将 "3" 匹配到了 [^\w\s]+
?因为它实际上是相反的。
1:PCRE2_UCP
允许启用 Unicode 语义。
英文:
ICU and regex use unicode semantics by default, which means e.g. for \w
they use unicode-aware definition of "word characters".
For Regex it's
[\p{Alphabetic}\p{M}\p{Nd}\p{Pc}\p{Join_Control}]
For ICU it's
[\p{Alphabetic}\p{Mark}\p{Decimal_Number}\p{Connector_Punctuation}\u200c\u200d]
where per tr44 Alphabetic
is:
Lowercase + Uppercase + Lt + Lm + Lo + Nl + Other_Alphabetic
CJK characters are generally categorised as "letter, other" (Lo), hence are part of \w
in a unicode-aware classification. So is "3", obviously. Hence a single group, because it all matches \w+
just fine.
PCRE does not use unicode semantics by default<sup>1</sup> hence it does not treat "戦場のヴァルキュリア" as letters.
regex supports non-unicode matching (using either the bytes-based engines, or the (?-u:)
flag), I don't know whether ICU does though I rather doubt it as it would quite defeat the point.
If you want specifically ASCII matching, just ask for that.
Or is it that you misunderstand what \w
does and thought it didn't include numbers? And thus that PCRE matched "戦場のヴァルキュリア" to \w+
and "3" to [^\w\s]+
? Because what it does is the exact opposite.
<hr/>
1: PCRE2_UCP
allows enabling unicode semantics
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论