2021年10月11日 19:23:56go评论77阅读模式

英文:

Czech characters in regexp search

问题

我正在尝试实现一个非常简单的捷克语单词匹配器。由于捷克语非常依赖后缀，我想定义单词的开头，然后贪婪地匹配单词的其余部分。这是我目前的实现：

    r := regexp.MustCompile("(?i)\\by\\w+\\b")
	text := "x yž z"
	matches := r.FindAllString(text, -1)
	fmt.Println(matches) //得到的结果是[]，但期望的结果是[yž]

我研究了Go的正则表达式语法：
https://github.com/google/re2/wiki/Syntax

但我不知道如何在其中定义捷克语字符。使用\w只会匹配ASCII字符，而不是捷克语的UTF字符。

你能帮助我吗？

英文:

I am trying to implement very simple text matcher for Czech words. Since Czech language is very suffix heavy I want to define start of the word and then just greedy match rest of the word. This is my implementation so far:

    r := regexp.MustCompile(&quot;(?i)\\by\\w+\\b&quot;)
	text := &quot;x yž z&quot;
	matches := r.FindAllString(text, -1)
	fmt.Println(matches) //have [], want [yž]

I studied Go's regexp syntax:
https://github.com/google/re2/wiki/Syntax

but I don't know, how to define czech language characters there? Using \w just matches ASCII characters, not Czech UTF characters.

Can you please help me?

答案1

得分: 1

在RE2中，\w和\b都不支持Unicode：

> \b表示ASCII单词边界（一侧是\w，另一侧是\W、\A或\z）<br/>
> \w表示单词字符（等同于[0-9A-Za-z_]）

一个更通用的示例是根据一个或多个非字母字符进行分割，然后只收集符合条件的项：

package main

import (
	"fmt"
	"strings"
	"regexp"
)

func main() {
    output := []string{}
	r := regexp.MustCompile(`\P{L}+`)
	str := "x--++yž,,,.z..00"
	words := r.Split(str, -1)
	for i := range words {
		if len(words[i]) > 0 && (strings.HasPrefix(words[i], "y") || strings.HasPrefix(words[i], "Y")) {
			output = append(output, words[i])
		}
	}
	fmt.Println(output)
}

请参见Go演示。

请注意，像下面这样的简单方法：

package main

import (
	"fmt"
	"regexp"
)

func main() {
	output := []string{}
	r := regexp.MustCompile(`(?i)(?:\P{L}|^)(y\p{L}*)(?:\P{L}|$)`)
	str := "x--++yž,,,.z..00..."
	matches := r.FindAllStringSubmatch(str, -1)
	for _, v := range matches {
		output = append(output, v[1])
	}
	fmt.Println(output)
}

在字符串中如果有连续的匹配项，比如match1,match2 match3，它将无法正常工作，因为它只会获取奇数次出现的匹配项，因为最后一个非捕获组模式会消耗掉下一个匹配应该匹配的字符。

对于上面的代码，一个解决方法是在非字母连续串的末尾添加一些非字母字符，比如：

package main

import (
	"fmt"
	"regexp"
)

func main() {
	output := []string{}
	r := regexp.MustCompile(`(?i)(?:\P{L}|^)(u\p{L}*)(?:\P{L}|$)`)
	str := "uhličit&#225;,uhličit&#233;,uhličitou,uhličit&#233;ho,yz,my"
	matches := r.FindAllStringSubmatch(regexp.MustCompile(`\P{L}+`).ReplaceAllString(str, "$0 "), -1)
	for _, v := range matches {
		output = append(output, v[1])
	}
	fmt.Println(output)
}
// => [uhličit&#225; uhličit&#233; uhličitou uhličit&#233;ho]

请参见此Go演示。

在这里，regexp.MustCompile(`\P{L}+`).ReplaceAllString(str, "$0 ")在所有非字母字符块后面添加了一个空格。

英文:

In RE2, both \w and \b are not Unicode-aware:

> \b at ASCII word boundary («\w» on one side and «\W», «\A», or «\z» on the other)<br/>
> \w word characters (== [0-9A-Za-z_])

A more generalized example will be to split with any chunk of one or more non-letter chars, and then collect only those items that meet your criteria:

package main

import (
	&quot;fmt&quot;
	&quot;strings&quot;
	&quot;regexp&quot;
)

func main() {
    output := []string{}
	r := regexp.MustCompile(`\P{L}+`)
	str := &quot;x--++yž,,,.z..00&quot;
	words := r.Split(str, -1)
	for i := range words {
		if len(words[i]) &gt; 0 &amp;&amp; (strings.HasPrefix(words[i], `y`) || (strings.HasPrefix(words[i], `Y`)) {
			output = append(output, words[i])
		}
	}
	fmt.Println(output)
}

See the Go demo.

Note that a naive approach like

package main

import (
	&quot;fmt&quot;
	&quot;regexp&quot;
)

func main() {
	output := []string{}
	r := regexp.MustCompile(`(?i)(?:\P{L}|^)(y\p{L}*)(?:\P{L}|$)`)
	str := &quot;x--++yž,,,.z..00...&quot;
	matches := r.FindAllStringSubmatch(str, -1)
	for _, v := range matches {
		output = append(output, v[1])
	}
	fmt.Println(output)
}

won't work in case you have match1,match2 match3 like consecutive matches in the string as it will only getch the odd occurrences since the last non-capturing group pattern will consume the char that is supposed to be matched by the first non-capturing group pattern upon the next match.

A workaround for the above code would be adding some non-letter char to the end of the non-letter streaks, say

package main

import (
	&quot;fmt&quot;
	&quot;regexp&quot;
)

func main() {
	output := []string{}
	r := regexp.MustCompile(`(?i)(?:\P{L}|^)(u\p{L}*)(?:\P{L}|$)`)
	str := &quot;uhličit&#225;,uhličit&#233;,uhličitou,uhličit&#233;ho,yz,my&quot;
	matches := r.FindAllStringSubmatch(regexp.MustCompile(`\P{L}+`).ReplaceAllString(str, `$0 `), -1)
	for _, v := range matches {
		output = append(output, v[1])
	}
	fmt.Println(output)
}
// =&gt; [uhličit&#225; uhličit&#233; uhličitou uhličit&#233;ho]

See this Go demo.

Here, regexp.MustCompile(`\P{L}+`).ReplaceAllString(str, `$0 `) adds a space after all chunks of non-letter chars.

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

在正则表达式搜索中使用捷克字符。

问题

答案1

Concrete type vs Not concrete type in Golang

一个Glide项目如何导入另一个带有vendor/目录的项目？

使用反射迭代结构体中的切片结构体。

在Visual Studio Code中使用调试模式进行构建

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论