Golang正则表达式与非拉丁字符

huangapple go评论88阅读模式
英文:

Golang regexp with non-latin characters

问题

我正在从一些句子中解析单词,我的\w+正则表达式在拉丁字符方面运行良好。然而,它在一些西里尔字符方面完全失败。

这是一个示例应用程序:

package main

import (
	"fmt"
	"regexp"
)

func get_words_from(text string) []string {
	words := regexp.MustCompile(`\w+`)
	return words.FindAllString(text, -1)
}

func main() {
	text := "One, two three!"
	text2 := "Раз, два три!"
	text3 := "Jedna, dva tři čtyři pět!"
	fmt.Println(get_words_from(text))
	fmt.Println(get_words_from(text2))
	fmt.Println(get_words_from(text3))
}

它产生以下结果:

[One two three]
[]
[Jedna dva t i ty i p t]

对于俄语,它返回空值,对于捷克语,它返回额外的音节。
我不知道如何解决这个问题。有人能给我一些建议吗?

或者也许有更好的方法来将句子按单词拆分而不包括标点符号?

英文:

I am parsing words from some sentences and my \w+ regexp works fine with Latin characters. However, it totally fails with some Cyrillic characters.

Here is a sample app:

package main

import (
	"fmt"
	"regexp"
)

func get_words_from(text string) []string {
	words := regexp.MustCompile("\\w+")
	return words.FindAllString(text, -1)
}

func main() {
	text := "One, two three!"
	text2 := "Раз, два три!"
	text3 := "Jedna, dva tři čtyři pět!"
	fmt.Println(get_words_from(text))
	fmt.Println(get_words_from(text2))
	fmt.Println(get_words_from(text3))
}

It yields the following results:

 [One two three]
 []
 [Jedna dva t i ty i p t]

It returns empty values for Russian, and extra syllables for Czech.
I have no idea how to solve this issue. Could someone give me a piece of advice?

Or maybe there is a better way to split a sentence into words without punctuation?

答案1

得分: 31

\w简写类在GO正则表达式中只匹配ASCII字母,因此您需要一个Unicode字符类\p{L}

> \w 单词字符(== [0-9A-Za-z_]

使用字符类来包括数字和下划线:

    regexp.MustCompile("[\\p{L}\\d_]+")

demo的输出:

[One two three]
[Раз два три]
[Jedna dva tři čtyři pět]
英文:

The \w shorthand class only matches ASCII letters in GO regex, thus, you need a Unicode character class \p{L}.

> \w word characters (== [0-9A-Za-z_])

Use a character class to include the digits and underscore:

    regexp.MustCompile("[\\p{L}\\d_]+")

Output of the demo:

[One two three]
[Раз два три]
[Jedna dva tři čtyři pět]

huangapple
  • 本文由 发表于 2015年5月27日 20:42:11
  • 转载请务必保留本文链接:https://go.coder-hub.com/30482793.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定