在正则表达式搜索中使用捷克字符。

huangapple go评论71阅读模式
英文:

Czech characters in regexp search

问题

我正在尝试实现一个非常简单的捷克语单词匹配器。由于捷克语非常依赖后缀,我想定义单词的开头,然后贪婪地匹配单词的其余部分。这是我目前的实现:

    r := regexp.MustCompile("(?i)\\by\\w+\\b")
	text := "x yž z"
	matches := r.FindAllString(text, -1)
	fmt.Println(matches) //得到的结果是[],但期望的结果是[yž]

我研究了Go的正则表达式语法:
https://github.com/google/re2/wiki/Syntax

但我不知道如何在其中定义捷克语字符。使用\w只会匹配ASCII字符,而不是捷克语的UTF字符。

你能帮助我吗?

英文:

I am trying to implement very simple text matcher for Czech words. Since Czech language is very suffix heavy I want to define start of the word and then just greedy match rest of the word. This is my implementation so far:

    r := regexp.MustCompile("(?i)\\by\\w+\\b")
	text := "x yž z"
	matches := r.FindAllString(text, -1)
	fmt.Println(matches) //have [], want [yž]

I studied Go's regexp syntax:
https://github.com/google/re2/wiki/Syntax

but I don't know, how to define czech language characters there? Using \w just matches ASCII characters, not Czech UTF characters.

Can you please help me?

答案1

得分: 1

在RE2中,\w\b不支持Unicode

> \b表示ASCII单词边界(一侧是\w,另一侧是\W\A\z)<br/>
> \w表示单词字符(等同于[0-9A-Za-z_]

一个更通用的示例是根据一个或多个非字母字符进行分割,然后只收集符合条件的项:

package main

import (
	"fmt"
	"strings"
	"regexp"
)

func main() {
    output := []string{}
	r := regexp.MustCompile(`\P{L}+`)
	str := "x--++yž,,,.z..00"
	words := r.Split(str, -1)
	for i := range words {
		if len(words[i]) > 0 && (strings.HasPrefix(words[i], "y") || strings.HasPrefix(words[i], "Y")) {
			output = append(output, words[i])
		}
	}
	fmt.Println(output)
}

请参见Go演示

请注意,像下面这样的简单方法:

package main

import (
	"fmt"
	"regexp"
)

func main() {
	output := []string{}
	r := regexp.MustCompile(`(?i)(?:\P{L}|^)(y\p{L}*)(?:\P{L}|$)`)
	str := "x--++yž,,,.z..00..."
	matches := r.FindAllStringSubmatch(str, -1)
	for _, v := range matches {
		output = append(output, v[1])
	}
	fmt.Println(output)
}

在字符串中如果有连续的匹配项,比如match1,match2 match3,它将无法正常工作,因为它只会获取奇数次出现的匹配项,因为最后一个非捕获组模式会消耗掉下一个匹配应该匹配的字符。

对于上面的代码,一个解决方法是在非字母连续串的末尾添加一些非字母字符,比如:

package main

import (
	"fmt"
	"regexp"
)

func main() {
	output := []string{}
	r := regexp.MustCompile(`(?i)(?:\P{L}|^)(u\p{L}*)(?:\P{L}|$)`)
	str := "uhličit&#225;,uhličit&#233;,uhličitou,uhličit&#233;ho,yz,my"
	matches := r.FindAllStringSubmatch(regexp.MustCompile(`\P{L}+`).ReplaceAllString(str, "$0 "), -1)
	for _, v := range matches {
		output = append(output, v[1])
	}
	fmt.Println(output)
}
// => [uhličit&#225; uhličit&#233; uhličitou uhličit&#233;ho]

请参见此Go演示

在这里,regexp.MustCompile(`\P{L}+`).ReplaceAllString(str, "$0 ")在所有非字母字符块后面添加了一个空格。

英文:

In RE2, both \w and \b are not Unicode-aware:

> \b at ASCII word boundary («\w» on one side and «\W», «\A», or «\z» on the other)<br/>
> \w word characters (== [0-9A-Za-z_])

A more generalized example will be to split with any chunk of one or more non-letter chars, and then collect only those items that meet your criteria:

package main

import (
	&quot;fmt&quot;
	&quot;strings&quot;
	&quot;regexp&quot;
)

func main() {
    output := []string{}
	r := regexp.MustCompile(`\P{L}+`)
	str := &quot;x--++,,,.z..00&quot;
	words := r.Split(str, -1)
	for i := range words {
		if len(words[i]) &gt; 0 &amp;&amp; (strings.HasPrefix(words[i], `y`) || (strings.HasPrefix(words[i], `Y`)) {
			output = append(output, words[i])
		}
	}
	fmt.Println(output)
}

See the Go demo.

Note that a naive approach like

package main

import (
	&quot;fmt&quot;
	&quot;regexp&quot;
)

func main() {
	output := []string{}
	r := regexp.MustCompile(`(?i)(?:\P{L}|^)(y\p{L}*)(?:\P{L}|$)`)
	str := &quot;x--++,,,.z..00...&quot;
	matches := r.FindAllStringSubmatch(str, -1)
	for _, v := range matches {
		output = append(output, v[1])
	}
	fmt.Println(output)
}

won't work in case you have match1,match2 match3 like consecutive matches in the string as it will only getch the odd occurrences since the last non-capturing group pattern will consume the char that is supposed to be matched by the first non-capturing group pattern upon the next match.

A workaround for the above code would be adding some non-letter char to the end of the non-letter streaks, say

package main

import (
	&quot;fmt&quot;
	&quot;regexp&quot;
)

func main() {
	output := []string{}
	r := regexp.MustCompile(`(?i)(?:\P{L}|^)(u\p{L}*)(?:\P{L}|$)`)
	str := &quot;uhličit&#225;,uhličit&#233;,uhličitou,uhličit&#233;ho,yz,my&quot;
	matches := r.FindAllStringSubmatch(regexp.MustCompile(`\P{L}+`).ReplaceAllString(str, `$0 `), -1)
	for _, v := range matches {
		output = append(output, v[1])
	}
	fmt.Println(output)
}
// =&gt; [uhličit&#225; uhličit&#233; uhličitou uhličit&#233;ho]

See this Go demo.

Here, regexp.MustCompile(`\P{L}+`).ReplaceAllString(str, `$0 `) adds a space after all chunks of non-letter chars.

huangapple
  • 本文由 发表于 2021年10月11日 19:23:56
  • 转载请务必保留本文链接:https://go.coder-hub.com/69525414.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定