英文:
Czech characters in regexp search
问题
我正在尝试实现一个非常简单的捷克语单词匹配器。由于捷克语非常依赖后缀,我想定义单词的开头,然后贪婪地匹配单词的其余部分。这是我目前的实现:
r := regexp.MustCompile("(?i)\\by\\w+\\b")
text := "x yž z"
matches := r.FindAllString(text, -1)
fmt.Println(matches) //得到的结果是[],但期望的结果是[yž]
我研究了Go的正则表达式语法:
https://github.com/google/re2/wiki/Syntax
但我不知道如何在其中定义捷克语字符。使用\w
只会匹配ASCII字符,而不是捷克语的UTF字符。
你能帮助我吗?
英文:
I am trying to implement very simple text matcher for Czech words. Since Czech language is very suffix heavy I want to define start of the word and then just greedy match rest of the word. This is my implementation so far:
r := regexp.MustCompile("(?i)\\by\\w+\\b")
text := "x yž z"
matches := r.FindAllString(text, -1)
fmt.Println(matches) //have [], want [yž]
I studied Go's regexp syntax:
https://github.com/google/re2/wiki/Syntax
but I don't know, how to define czech language characters there? Using \w
just matches ASCII characters, not Czech UTF characters.
Can you please help me?
答案1
得分: 1
在RE2中,\w
和\b
都不支持Unicode:
> \b
表示ASCII单词边界(一侧是\w
,另一侧是\W
、\A
或\z
)<br/>
> \w
表示单词字符(等同于[0-9A-Za-z_]
)
一个更通用的示例是根据一个或多个非字母字符进行分割,然后只收集符合条件的项:
package main
import (
"fmt"
"strings"
"regexp"
)
func main() {
output := []string{}
r := regexp.MustCompile(`\P{L}+`)
str := "x--++yž,,,.z..00"
words := r.Split(str, -1)
for i := range words {
if len(words[i]) > 0 && (strings.HasPrefix(words[i], "y") || strings.HasPrefix(words[i], "Y")) {
output = append(output, words[i])
}
}
fmt.Println(output)
}
请参见Go演示。
请注意,像下面这样的简单方法:
package main
import (
"fmt"
"regexp"
)
func main() {
output := []string{}
r := regexp.MustCompile(`(?i)(?:\P{L}|^)(y\p{L}*)(?:\P{L}|$)`)
str := "x--++yž,,,.z..00..."
matches := r.FindAllStringSubmatch(str, -1)
for _, v := range matches {
output = append(output, v[1])
}
fmt.Println(output)
}
在字符串中如果有连续的匹配项,比如match1,match2 match3
,它将无法正常工作,因为它只会获取奇数次出现的匹配项,因为最后一个非捕获组模式会消耗掉下一个匹配应该匹配的字符。
对于上面的代码,一个解决方法是在非字母连续串的末尾添加一些非字母字符,比如:
package main
import (
"fmt"
"regexp"
)
func main() {
output := []string{}
r := regexp.MustCompile(`(?i)(?:\P{L}|^)(u\p{L}*)(?:\P{L}|$)`)
str := "uhličitá,uhličité,uhličitou,uhličitého,yz,my"
matches := r.FindAllStringSubmatch(regexp.MustCompile(`\P{L}+`).ReplaceAllString(str, "$0 "), -1)
for _, v := range matches {
output = append(output, v[1])
}
fmt.Println(output)
}
// => [uhličitá uhličité uhličitou uhličitého]
请参见此Go演示。
在这里,regexp.MustCompile(`\P{L}+`).ReplaceAllString(str, "$0 ")
在所有非字母字符块后面添加了一个空格。
英文:
In RE2, both \w
and \b
are not Unicode-aware:
> \b
at ASCII word boundary («\w
» on one side and «\W
», «\A
», or «\z
» on the other)<br/>
> \w
word characters (== [0-9A-Za-z_]
)
A more generalized example will be to split with any chunk of one or more non-letter chars, and then collect only those items that meet your criteria:
package main
import (
"fmt"
"strings"
"regexp"
)
func main() {
output := []string{}
r := regexp.MustCompile(`\P{L}+`)
str := "x--++yž,,,.z..00"
words := r.Split(str, -1)
for i := range words {
if len(words[i]) > 0 && (strings.HasPrefix(words[i], `y`) || (strings.HasPrefix(words[i], `Y`)) {
output = append(output, words[i])
}
}
fmt.Println(output)
}
See the Go demo.
Note that a naive approach like
package main
import (
"fmt"
"regexp"
)
func main() {
output := []string{}
r := regexp.MustCompile(`(?i)(?:\P{L}|^)(y\p{L}*)(?:\P{L}|$)`)
str := "x--++yž,,,.z..00..."
matches := r.FindAllStringSubmatch(str, -1)
for _, v := range matches {
output = append(output, v[1])
}
fmt.Println(output)
}
won't work in case you have match1,match2 match3
like consecutive matches in the string as it will only getch the odd occurrences since the last non-capturing group pattern will consume the char that is supposed to be matched by the first non-capturing group pattern upon the next match.
A workaround for the above code would be adding some non-letter char to the end of the non-letter streaks, say
package main
import (
"fmt"
"regexp"
)
func main() {
output := []string{}
r := regexp.MustCompile(`(?i)(?:\P{L}|^)(u\p{L}*)(?:\P{L}|$)`)
str := "uhličitá,uhličité,uhličitou,uhličitého,yz,my"
matches := r.FindAllStringSubmatch(regexp.MustCompile(`\P{L}+`).ReplaceAllString(str, `$0 `), -1)
for _, v := range matches {
output = append(output, v[1])
}
fmt.Println(output)
}
// => [uhličitá uhličité uhličitou uhličitého]
See this Go demo.
Here, regexp.MustCompile(`\P{L}+`).ReplaceAllString(str, `$0 `)
adds a space after all chunks of non-letter chars.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论