ICU库、Rust和PCRE(https://regexr.com/)之间的正则表达式匹配产生不同结果。

huangapple go评论71阅读模式
英文:

different results of regex matching among ICU library, Rust and PCRE(https://regexr.com/)

问题

ICU 和 Rust 在匹配 "戦場のヴァルキュリア3" 时给出不同的结果与 PCRE (https://regexr.com/) 不同,似乎 "戦場のヴァルキュリア3" 应该匹配为 2 部分。

英文:

here is the pattern I used :

"\w+|[^\w\s]+"

when I match string "abc.efg" and "戦場のヴァルキュリア3" using PCRE in https://regexr.com/,
it give me results like this:

"abc" "." "efg" => 3 parts

"戦場のヴァルキュリア" "3" => 2 parts

that looks like right.

But when I using icu like this :

    //std::string ldata = "abc.efg";
    std::string ldata = "戦場のヴァルキュリア3";
    std::string m_regex = "\\w+|[^\\w\\s]+";
    UErrorCode         status = U_ZERO_ERROR;
    icu::RegexMatcher  matcher(m_regex.c_str(), 0, status);
    icu::StringPiece   data((char*)ldata.data(), ldata.length());
    icu::UnicodeString input = icu::UnicodeString::fromUTF8(data);
    matcher.reset(input);
   
    
    int count = 0;
    while (matcher.find(status) && U_SUCCESS(status))
    {
        auto start_index = matcher.start(status);
        auto end_index   = matcher.end(status);
        count++;   
    }

the input string "abc.efg" give me:

"abc" "." "efg" => 3 parts

but the input string "戦場のヴァルキュリア3" give me :

"戦場のヴァルキュリア3" => 1 part

when I using rust like this:

impl Pattern for &Regex {
    fn find_matches(&self, inside: &str) -> Result<Vec<(Offsets, bool)>> {
        if inside.is_empty() {
            return Ok(vec![((0, 0), false)]);
        }

        let mut prev = 0;
        let mut splits = Vec::with_capacity(inside.len());
        for m in self.find_iter(inside) {
            if prev != m.start() {
                splits.push(((prev, m.start()), false));
            }
            splits.push(((m.start(), m.end()), true));
            prev = m.end();
        }
        if prev != inside.len() {
            splits.push(((prev, inside.len()), false))
        }
        Ok(splits)
    }
}

the input string "abc.efg" give me:

"abc" "." "efg" => 3 parts

but the input string "戦場のヴァルキュリア3" give me :

"戦場のヴァルキュリア3" => 1 part

why ICU and Rust match "戦場のヴァルキュリア3" give different result from PCRE(https://regexr.com/)

It looks that "戦場のヴァルキュリア3" should be matched into 2 part.

答案1

得分: 5

ICU 和正则表达式默认使用 Unicode 语义,这意味着对于 \w,它们使用 Unicode 意识的定义来表示“单词字符”。

对于正则表达式,它是:

[\p{Alphabetic}\p{M}\p{Nd}\p{Pc}\p{Join_Control}]

对于 ICU,它是:

[\p{Alphabetic}\p{Mark}\p{Decimal_Number}\p{Connector_Punctuation}\u200c\u200d]

其中 Alphabetic 根据 Unicode TR44 的定义是:

小写字母 + 大写字母 + Lt + Lm + Lo + Nl + Other_Alphabetic

CJK 字符通常被分类为“其他字母”(Lo),因此在 Unicode 意识的分类中属于 \w。同样,数字 "3" 也是如此。因此,它们都可以匹配 \w+

PCRE 默认不使用 Unicode 语义1,因此它不会将 "戦場のヴァルキュリア" 视为字母。

正则表达式支持非 Unicode 匹配(使用基于字节的引擎或 (?-u:) 标志),我不知道 ICU 是否支持,但我认为它不支持,因为那会打破它的目的。

如果你只想要 ASCII 匹配,只需明确要求。

或者是你误解了 \w 的含义,认为它不包括数字?因此,PCRE 将 "戦場のヴァルキュリア" 匹配到了 \w+,将 "3" 匹配到了 [^\w\s]+?因为它实际上是相反的。


1:PCRE2_UCP 允许启用 Unicode 语义。

英文:

ICU and regex use unicode semantics by default, which means e.g. for \w they use unicode-aware definition of "word characters".

For Regex it's

[\p{Alphabetic}\p{M}\p{Nd}\p{Pc}\p{Join_Control}]

For ICU it's

[\p{Alphabetic}\p{Mark}\p{Decimal_Number}\p{Connector_Punctuation}\u200c\u200d]

where per tr44 Alphabetic is:

Lowercase + Uppercase + Lt + Lm + Lo + Nl + Other_Alphabetic

CJK characters are generally categorised as "letter, other" (Lo), hence are part of \w in a unicode-aware classification. So is "3", obviously. Hence a single group, because it all matches \w+ just fine.

PCRE does not use unicode semantics by default<sup>1</sup> hence it does not treat "戦場のヴァルキュリア" as letters.

regex supports non-unicode matching (using either the bytes-based engines, or the (?-u:) flag), I don't know whether ICU does though I rather doubt it as it would quite defeat the point.

If you want specifically ASCII matching, just ask for that.

Or is it that you misunderstand what \w does and thought it didn't include numbers? And thus that PCRE matched "戦場のヴァルキュリア" to \w+ and "3" to [^\w\s]+? Because what it does is the exact opposite.

<hr/>

1: PCRE2_UCP allows enabling unicode semantics

huangapple
  • 本文由 发表于 2023年6月26日 14:47:35
  • 转载请务必保留本文链接:https://go.coder-hub.com/76554148.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定