regexp.FindSubmatch使用十六进制字符代码

huangapple go评论78阅读模式
英文:

regexp.FindSubmatch with hex character codes

问题

我无法在某些简单情况下使用regexp.FindSubmatch。例如,以下代码可以正常工作:

assigned := regexp.MustCompile(`\x7f`)
group := assigned.FindSubmatch([]byte{0x7f})
fmt.Println(group)

(在playground中打印的是[[127]]

但是,如果我将字节更改为0x80,它就无法正常工作。为什么呢?

英文:

I cannot regexp.FindSubmatch in certain simple cases. For example, following code works properly:

assigned := regexp.MustCompile(`\x7f`)
group := assigned.FindSubmatch([]byte{0x7f})
fmt.Println(group)

(in playground it prints [[127]])

But if I change byte to 0x80 it does not work. Why?

答案1

得分: 2

根据包文档中的说明:

> 所有字符都是UTF-8编码的码点。

因此,正则表达式\x80不匹配字节值0x80,而是匹配字符U+0080的UTF-8表示。如果我们将测试程序更改为:

func main() {
	assigned := regexp.MustCompile(`\x80`)
	group := assigned.FindSubmatch([]byte{1, 2, 3, 0xc2, 0x80})
	fmt.Println(group)
}

现在我们得到了一个匹配的两个字节序列[[194 128]],表示该字符。

无法将regexp包切换到二进制模式,因此您需要将输入转换为有效的UTF-8,或者使用其他包来匹配您的数据。

英文:

As mentioned in the package documentation:

> All characters are UTF-8-encoded code points.

So the regular expression \x80 does not match the byte value 0x80, but rather the UTF-8 representation of the character U+0080. This is evident if we change your test program to:

func main() {
	assigned := regexp.MustCompile(`\x80`)
	group := assigned.FindSubmatch([]byte{1, 2, 3, 0xc2, 0x80})
	fmt.Println(group)
}

We now get a match for the two byte sequence [[194 128]], which represents that character in question.

There is no way to switch the regexp package into a binary mode, so you will either need to convert your inputs to valid UTF-8, or use a different package to match your data.

huangapple
  • 本文由 发表于 2015年5月27日 03:51:16
  • 转载请务必保留本文链接:https://go.coder-hub.com/30467647.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定