Golang正则表达式中的边界与拉丁字符

huangapple go评论78阅读模式
英文:

Golang regexp Boundary with Latin Character

问题

我有一个关于 Golang 正则表达式的小问题。
似乎 \b 边界选项不起作用
当我输入类似这样的拉丁字符时。

我期望 é 应该被视为普通字符..
但它被视为边界词之一。

package main

import (
	"fmt"
	"regexp"
)

func main() {	
	r, _ := regexp.Compile(`\b(vis)\b`)
	fmt.Println(r.MatchString("re vis e"))
	fmt.Println(r.MatchString("revise"))
	fmt.Println(r.MatchString("révisé"))
}

结果是:

true 
false 
true

请给我任何建议,如何将 r.MatchString("révisé") 处理为 false

谢谢

英文:

I have a small tricky issue about golang regex.
seems \b boundering option doesn't work
when I put latein chars like this.

I expected that é should be treated as a regular char..
but it's treated as one of boundering wards.

package main

import (
	"fmt"
	"regexp"
)

func main() {	
	r, _ := regexp.Compile(`\b(vis)\b`)
	fmt.Println(r.MatchString("re vis e"))
	fmt.Println(r.MatchString("revise"))
	fmt.Println(r.MatchString("révisé"))
}

result was:

true 
false 
true

Please give me any suggestion how to deal with r.MatchString("révisé") as false ?

Thank you

答案1

得分: 6

问题在于\b只适用于ASCII字符周围的边界,正如文档中所述:

> 在ASCII单词边界上(\w在一侧,\W、\A或\z在另一侧)

é不是ASCII字符。但是,你可以通过组合其他正则表达式快捷方式来创建自己的\b替代方案。这里有一个简单的解决方案,可以解决问题中提到的情况,但你可能希望添加更全面的匹配:

package main

import (
    "fmt"
    "regexp"
)

func main() {   
    r, _ := regexp.Compile(`(?:\A|\s)(vis)(?:\s|\z)`)
    fmt.Println(r.MatchString("vis")) // 添加了这个案例
    fmt.Println(r.MatchString("re vis e"))
    fmt.Println(r.MatchString("revise"))
    fmt.Println(r.MatchString("révisé"))
}

运行此代码得到的结果是:

true
true
false
false

这个解决方案的作用是将\b替换为(?:\A|\z|\s),意思是“一个非捕获组,其中包含以下之一:字符串的开头、字符串的结尾或空白字符”。你可能希望在这里添加其他可能性,比如标点符号。

英文:

The issue is that \b is only for boundaries around ASCII characters, as stated in the docs:

> at ASCII word boundary (\w on one side and \W, \A, or \z on the other)

And é is not ASCII. But, you can make your own \b replacement by combining other regex shortcuts. Here is a simple solution that solves the case given in the question, though you may want to add more thorough matching:

package main

import (
    "fmt"
    "regexp"
)

func main() {   
    r, _ := regexp.Compile(`(?:\A|\s)(vis)(?:\s|\z)`)
    fmt.Println(r.MatchString("vis")) // added this case
    fmt.Println(r.MatchString("re vis e"))
    fmt.Println(r.MatchString("revise"))
    fmt.Println(r.MatchString("révisé"))
}

Running this gives:

true
true
false
false

What this solution does is essentially replace \b with (?:\A|\z|\s), which means "a non-capturing group with one of the following: start of string, end of string or whitespace". You may want to add other possibilities here, like punctuation.

huangapple
  • 本文由 发表于 2016年2月4日 12:50:23
  • 转载请务必保留本文链接:https://go.coder-hub.com/35192744.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定