英文:
regexp.FindSubmatch with hex character codes
问题
我无法在某些简单情况下使用regexp.FindSubmatch。例如,以下代码可以正常工作:
assigned := regexp.MustCompile(`\x7f`)
group := assigned.FindSubmatch([]byte{0x7f})
fmt.Println(group)
(在playground中打印的是[[127]]
)
但是,如果我将字节更改为0x80,它就无法正常工作。为什么呢?
英文:
I cannot regexp.FindSubmatch in certain simple cases. For example, following code works properly:
assigned := regexp.MustCompile(`\x7f`)
group := assigned.FindSubmatch([]byte{0x7f})
fmt.Println(group)
(in playground it prints [[127]])
But if I change byte to 0x80 it does not work. Why?
答案1
得分: 2
根据包文档中的说明:
> 所有字符都是UTF-8编码的码点。
因此,正则表达式\x80
不匹配字节值0x80
,而是匹配字符U+0080
的UTF-8表示。如果我们将测试程序更改为:
func main() {
assigned := regexp.MustCompile(`\x80`)
group := assigned.FindSubmatch([]byte{1, 2, 3, 0xc2, 0x80})
fmt.Println(group)
}
现在我们得到了一个匹配的两个字节序列[[194 128]]
,表示该字符。
无法将regexp
包切换到二进制模式,因此您需要将输入转换为有效的UTF-8,或者使用其他包来匹配您的数据。
英文:
As mentioned in the package documentation:
> All characters are UTF-8-encoded code points.
So the regular expression \x80
does not match the byte value 0x80
, but rather the UTF-8 representation of the character U+0080
. This is evident if we change your test program to:
func main() {
assigned := regexp.MustCompile(`\x80`)
group := assigned.FindSubmatch([]byte{1, 2, 3, 0xc2, 0x80})
fmt.Println(group)
}
We now get a match for the two byte sequence [[194 128]]
, which represents that character in question.
There is no way to switch the regexp
package into a binary mode, so you will either need to convert your inputs to valid UTF-8, or use a different package to match your data.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论