英文:
Golang regexp with non-latin characters
问题
我正在从一些句子中解析单词,我的\w+
正则表达式在拉丁字符方面运行良好。然而,它在一些西里尔字符方面完全失败。
这是一个示例应用程序:
package main
import (
"fmt"
"regexp"
)
func get_words_from(text string) []string {
words := regexp.MustCompile(`\w+`)
return words.FindAllString(text, -1)
}
func main() {
text := "One, two three!"
text2 := "Раз, два три!"
text3 := "Jedna, dva tři čtyři pět!"
fmt.Println(get_words_from(text))
fmt.Println(get_words_from(text2))
fmt.Println(get_words_from(text3))
}
它产生以下结果:
[One two three]
[]
[Jedna dva t i ty i p t]
对于俄语,它返回空值,对于捷克语,它返回额外的音节。
我不知道如何解决这个问题。有人能给我一些建议吗?
或者也许有更好的方法来将句子按单词拆分而不包括标点符号?
英文:
I am parsing words from some sentences and my \w+
regexp works fine with Latin characters. However, it totally fails with some Cyrillic characters.
Here is a sample app:
package main
import (
"fmt"
"regexp"
)
func get_words_from(text string) []string {
words := regexp.MustCompile("\\w+")
return words.FindAllString(text, -1)
}
func main() {
text := "One, two three!"
text2 := "Раз, два три!"
text3 := "Jedna, dva tři čtyři pět!"
fmt.Println(get_words_from(text))
fmt.Println(get_words_from(text2))
fmt.Println(get_words_from(text3))
}
It yields the following results:
[One two three]
[]
[Jedna dva t i ty i p t]
It returns empty values for Russian, and extra syllables for Czech.
I have no idea how to solve this issue. Could someone give me a piece of advice?
Or maybe there is a better way to split a sentence into words without punctuation?
答案1
得分: 31
\w
简写类在GO正则表达式中只匹配ASCII字母,因此您需要一个Unicode字符类\p{L}
。
> \w
单词字符(== [0-9A-Za-z_]
)
使用字符类来包括数字和下划线:
regexp.MustCompile("[\\p{L}\\d_]+")
demo的输出:
[One two three]
[Раз два три]
[Jedna dva tři čtyři pět]
英文:
The \w
shorthand class only matches ASCII letters in GO regex, thus, you need a Unicode character class \p{L}
.
> \w
word characters (== [0-9A-Za-z_]
)
Use a character class to include the digits and underscore:
regexp.MustCompile("[\\p{L}\\d_]+")
Output of the demo:
[One two three]
[Раз два три]
[Jedna dva tři čtyři pět]
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论