2023年6月26日 14:47:35go评论71阅读模式

英文:

different results of regex matching among ICU library, Rust and PCRE(https://regexr.com/)

问题

ICU 和 Rust 在匹配 "戦場のヴァルキュリア3" 时给出不同的结果与 PCRE (https://regexr.com/) 不同，似乎 "戦場のヴァルキュリア3" 应该匹配为 2 部分。

英文:

here is the pattern I used :

&quot;\w+|[^\w\s]+&quot;

when I match string "abc.efg" and "戦場のヴァルキュリア3" using PCRE in https://regexr.com/,
it give me results like this:

&quot;abc&quot; &quot;.&quot; &quot;efg&quot; =&gt; 3 parts

&quot;戦場のヴァルキュリア&quot; &quot;3&quot; =&gt; 2 parts

that looks like right.

But when I using icu like this :

    //std::string ldata = &quot;abc.efg&quot;;
    std::string ldata = &quot;戦場のヴァルキュリア3&quot;;
    std::string m_regex = &quot;\\w+|[^\\w\\s]+&quot;;
    UErrorCode         status = U_ZERO_ERROR;
    icu::RegexMatcher  matcher(m_regex.c_str(), 0, status);
    icu::StringPiece   data((char*)ldata.data(), ldata.length());
    icu::UnicodeString input = icu::UnicodeString::fromUTF8(data);
    matcher.reset(input);
   
    
    int count = 0;
    while (matcher.find(status) &amp;&amp; U_SUCCESS(status))
    {
        auto start_index = matcher.start(status);
        auto end_index   = matcher.end(status);
        count++;   
    }

the input string "abc.efg" give me:

&quot;abc&quot; &quot;.&quot; &quot;efg&quot; =&gt; 3 parts

but the input string "戦場のヴァルキュリア3" give me :

&quot;戦場のヴァルキュリア3&quot; =&gt; 1 part

when I using rust like this:

impl Pattern for &amp;Regex {
    fn find_matches(&amp;self, inside: &amp;str) -&gt; Result&lt;Vec&lt;(Offsets, bool)&gt;&gt; {
        if inside.is_empty() {
            return Ok(vec![((0, 0), false)]);
        }

        let mut prev = 0;
        let mut splits = Vec::with_capacity(inside.len());
        for m in self.find_iter(inside) {
            if prev != m.start() {
                splits.push(((prev, m.start()), false));
            }
            splits.push(((m.start(), m.end()), true));
            prev = m.end();
        }
        if prev != inside.len() {
            splits.push(((prev, inside.len()), false))
        }
        Ok(splits)
    }
}

the input string "abc.efg" give me:

&quot;abc&quot; &quot;.&quot; &quot;efg&quot; =&gt; 3 parts

but the input string "戦場のヴァルキュリア3" give me :

&quot;戦場のヴァルキュリア3&quot; =&gt; 1 part

why ICU and Rust match "戦場のヴァルキュリア3" give different result from PCRE(https://regexr.com/)

It looks that "戦場のヴァルキュリア3" should be matched into 2 part.

答案1

得分: 5

ICU 和正则表达式默认使用 Unicode 语义，这意味着对于 \w，它们使用 Unicode 意识的定义来表示“单词字符”。

对于正则表达式，它是：

[\p{Alphabetic}\p{M}\p{Nd}\p{Pc}\p{Join_Control}]

对于 ICU，它是：

[\p{Alphabetic}\p{Mark}\p{Decimal_Number}\p{Connector_Punctuation}\u200c\u200d]

其中 Alphabetic 根据 Unicode TR44 的定义是：

小写字母 + 大写字母 + Lt + Lm + Lo + Nl + Other_Alphabetic

CJK 字符通常被分类为“其他字母”（Lo），因此在 Unicode 意识的分类中属于 \w。同样，数字 "3" 也是如此。因此，它们都可以匹配 \w+。

PCRE 默认不使用 Unicode 语义¹，因此它不会将 "戦場のヴァルキュリア" 视为字母。

正则表达式支持非 Unicode 匹配（使用基于字节的引擎或 (?-u:) 标志），我不知道 ICU 是否支持，但我认为它不支持，因为那会打破它的目的。

如果你只想要 ASCII 匹配，只需明确要求。

或者是你误解了 \w 的含义，认为它不包括数字？因此，PCRE 将 "戦場のヴァルキュリア" 匹配到了 \w+，将 "3" 匹配到了 [^\w\s]+？因为它实际上是相反的。

1：PCRE2_UCP 允许启用 Unicode 语义。

英文:

ICU and regex use unicode semantics by default, which means e.g. for \w they use unicode-aware definition of "word characters".

For Regex it's

[\p{Alphabetic}\p{M}\p{Nd}\p{Pc}\p{Join_Control}]

For ICU it's

[\p{Alphabetic}\p{Mark}\p{Decimal_Number}\p{Connector_Punctuation}\u200c\u200d]

where per tr44 Alphabetic is:

Lowercase + Uppercase + Lt + Lm + Lo + Nl + Other_Alphabetic

CJK characters are generally categorised as "letter, other" (Lo), hence are part of \w in a unicode-aware classification. So is "3", obviously. Hence a single group, because it all matches \w+ just fine.

PCRE does not use unicode semantics by default<sup>1</sup> hence it does not treat "戦場のヴァルキュリア" as letters.

regex supports non-unicode matching (using either the bytes-based engines, or the (?-u:) flag), I don't know whether ICU does though I rather doubt it as it would quite defeat the point.

If you want specifically ASCII matching, just ask for that.

Or is it that you misunderstand what \w does and thought it didn't include numbers? And thus that PCRE matched "戦場のヴァルキュリア" to \w+ and "3" to [^\w\s]+? Because what it does is the exact opposite.

<hr/>

1: PCRE2_UCP allows enabling unicode semantics

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

ICU库、Rust和PCRE（https://regexr.com/）之间的正则表达式匹配产生不同结果。

问题

答案1

Golang正则表达式替换域名为代理URL

从树莓派 Pico 使用 Rust 读取 SPI 的值

在方括号内突出显示文本（正则表达式？）Android Kotlin

使用正则表达式进行输入模式验证时出错。

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论