在Golang的正则表达式中匹配多个Unicode字符。

huangapple go评论82阅读模式
英文:

Matching multiple unicode characters in Golang Regexp

问题

作为一个简化的例子,我想要将 ^⬛+$⬛⬛⬛ 进行匹配,以得到匹配结果 ⬛⬛⬛

r := regexp.MustCompile("^⬛+$")
matches := r.FindString("⬛️⬛️⬛️")
fmt.Println(matches)

但是尽管这对于常规的ASCII字符有效,但它并不能成功匹配。

我猜测这可能与Unicode匹配有关,但我还没有在文档中找到任何合理的解释。

有人能解释一下这个问题吗?

Go Play

英文:

As a simplified example, I want to get ^⬛+$ matched against ⬛⬛⬛ to yield a find match of ⬛⬛⬛.

	r := regexp.MustCompile("^⬛+$")
	matches := r.FindString("⬛️⬛️⬛️")
	fmt.Println(matches)

But it doesn't match successfully even though this would work with regular ASCII characters.

I'm guessing there's something I don't know about Unicode matching, but I haven't found any decent explanation in documentation yet.

Can someone explain the problem?

Go Play

答案1

得分: 4

你需要考虑字符串中的所有字符。如果你分析字符串,你会发现它包含:

在Golang的正则表达式中匹配多个Unicode字符。

所以你需要一个正则表达式,它能匹配包含一个或多个\x{2B1B}\x{FE0F}字符的字符串,直到字符串的末尾。

所以你需要使用:

^(?:\x{2B1B}\x{FE0F})+$

请参考正则表达式演示

注意,你可以使用\p{M}来匹配任何变音符号:

^(?:\x{2B1B}\p{M})+$

请参考Go演示

package main

import (
	"fmt"
	"regexp"
)

func main() {
	r := regexp.MustCompile(`^(?:\x{2B1B}\x{FE0F})+$`)
	matches := r.FindString("⬛️⬛️⬛️")
	fmt.Println(matches)
}
英文:

You need to account for all chars in the string. If you analyze the string you will see it contains:

在Golang的正则表达式中匹配多个Unicode字符。

So you need a regex that will match a string containing one or more combinations of \x{2B1B} and \x{FE0F} chars till end of string.

So you need to use

^(?:\x{2B1B}\x{FE0F})+$

See the regex demo.

Note you can use \p{M} to match any diacritic mark:

^(?:\x{2B1B}\p{M})+$

See the Go demo:

package main

import (
	"fmt"
	"regexp"
)

func main() {
	r := regexp.MustCompile(`^(?:\x{2B1B}\x{FE0F})+$`)
	matches := r.FindString("")
	fmt.Println(matches)
}

答案2

得分: 0

正则表达式匹配一个包含一个或多个⬛(黑色方块)的字符串。

主题字符串是三对黑色方块和变异选择器-16。变异选择器在我的终端上是不可见的,并且阻止了匹配。

通过从主题字符串中删除变异选择器或将变异选择器添加到模式中来修复。

这是第一个修复:https://go.dev/play/p/oKIVnkC7TZ1

英文:

The regular expression matches a string containing one or more ⬛ (black square box).

The subject string is three pairs of black square box and variation selector-16. The variation selectors are invisible (on my terminal) and prevent a match.

Fix by removing the variation selectors from the subject string or adding the variation selector to the pattern.

Here's the first fix: https://go.dev/play/p/oKIVnkC7TZ1

huangapple
  • 本文由 发表于 2022年5月22日 12:03:08
  • 转载请务必保留本文链接:https://go.coder-hub.com/72334719.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定