英文:
runes within strings
问题
我正在阅读Go By Example,其中的字符串和符文部分非常令人困惑。
运行以下代码:
sample := "\xbd\xb2\x3d\xbc\x20\xe2\x8c\x98"
fmt.Println(sample)
fmt.Printf("%%q: %q\n", sample)
fmt.Printf("%%+q: %+q\n", sample)
输出结果为:
��=� ⌘
%q: "\xbd\xb2=\xbc ⌘"
%+q: "\xbd\xb2=\xbc \u2318"
这是正常的。第一个、第二个和第四个符文似乎是不可打印的,我猜这意味着\xbd
、\xb2
和\xbc
在Unicode中不被支持,所以它们显示为�。%q
和%+q
都正确地转义了这三个不可打印的符文。
但是当我像这样迭代字符串时:
for _, runeValue := range sample {
fmt.Printf("% x, %q, %+q\n", runeValue, runeValue, runeValue)
}
突然间,%q
没有转义这三个不可打印的符文,仍然显示为�,而%+q
试图显示它们的底层代码点,这显然是不正确的:
fffd, '�', '\ufffd'
fffd, '�', '\ufffd'
3d, '=', '='
fffd, '�', '\ufffd'
20, ' ' , ' '
2318, '⌘', '\u2318'
更奇怪的是,如果我将字符串作为字节切片进行迭代:
for _, runeValue := range []byte(sample) {
fmt.Printf("% x, %q, %+q\n", runeValue, runeValue, runeValue)
}
突然间,这些符文不再是不可打印的,它们的底层代码点也是正确的:
bd, '½', '\u00bd'
b2, '²', '\u00b2'
3d, '=', '='
bc, '¼', '\u00bc'
20, ' ', ' '
e2, 'â', '\u00e2'
8c, '\u008c', '\u008c'
98, '\u0098', '\u0098'
有人能解释一下这里发生了什么吗?
英文:
I am going through Go By Example, and the strings and runes section is terribly confusing.
Running this:
sample := "\xbd\xb2\x3d\xbc\x20\xe2\x8c\x98"
fmt.Println(sample)
fmt.Printf("%%q: %q\n", sample)
fmt.Printf("%%+q: %+q\n", sample)
yields this:
��=� ⌘
%q: "\xbd\xb2=\xbc ⌘"
%+q: "\xbd\xb2=\xbc \u2318"
..which is fine. The 1st, 2nd and 4th rune seem to be non-printable, which I guess means that \xbd
, \xb2
and \xbc
are simply not supported by Unicode or something (correct me if im wrong), and so they show up as �. Both %q
and %+q
also correctly escape those 3 non-printable runes.
But now when I iterate over the string like so:
for _, runeValue := range sample {
fmt.Printf("% x, %q, %+q\n", runeValue, runeValue, runeValue)
}
suddenly the 3 non-printable runes are not escaped by %q
and remain as �, and %+q
attempts to reveal their underlying code point, which is obviously incorrect:
fffd, '�', '\ufffd'
fffd, '�', '\ufffd'
3d, '=' , '='
fffd, '�', '\ufffd'
20, ' ' , ' '
2318, '⌘', '\u2318'
Even strangely, if I iterate over the string as a byte slice:
for _, runeValue := range []byte(sample) {
fmt.Printf("% x, %q, %+q\n", runeValue, runeValue, runeValue)
}
suddenly, these runes are no longer non-printable, and their underlying code points are correct:
bd, '½', '\u00bd'
b2, '²', '\u00b2'
3d, '=', '='
bc, '¼', '\u00bc'
20, ' ', ' '
e2, 'â', '\u00e2'
8c, '\u008c', '\u008c'
98, '\u0098', '\u0098'
Can someone explain whats happening here?
答案1
得分: 0
fmt.Printf
会在内部执行很多操作,通过类型检查等方式来呈现尽可能多的有用信息。如果你想验证一个字符串(或字节切片)是否是有效的UTF-8
编码,可以使用标准库包encoding/utf8
。
例如:
import "unicode/utf8"
var sample = "\xbd\xb2\x3d\xbc\x20\xe2\x8c\x98"
fmt.Printf("%q valid? %v\n", sample, utf8.ValidString(sample)) // 输出 "false"
通过扫描字符串的每个符文,我们可以确定导致该字符串无效(从UTF-8
编码角度)的原因。注意:十六进制值0xfffd
表示遇到了无效的符文。这个错误值被定义为一个包常量utf8.RuneError:
for _, r := range sample {
validRune := r != utf8.RuneError // 是否为0xfffd?即无效的符文?
if validRune {
fmt.Printf("'%c' validRune: true hex: %4x\n", r, r)
} else {
fmt.Printf("'%c' validRune: false\n", r)
}
}
输出结果为:
'�' validRune: false
'�' validRune: false
'=' validRune: true hex: 3d
'�' validRune: false
' ' validRune: true hex: 20
'⌘' validRune: true hex: 2318
你可以在这里查看完整的示例代码。
英文:
fmt.Printf
will do lot of magic under the covers to render as much useful information via type inspection etc. If you want to verify if a string (or a byte slice) is valid UTF-8
use the standard library package encoding/utf8
.
For example:
import "unicode/utf8"
var sample = "\xbd\xb2\x3d\xbc\x20\xe2\x8c\x98"
fmt.Printf("%q valid? %v\n", sample, utf8.ValidString(sample)) // reports "false"
Scanning the individual runes of the string we can identify what makes this string invalid (from a UTF-8
encoding perspective). Note: the hex value 0xfffd
indicates a bad rune was encounter. This error value is defined as a package constant utf8.RuneError:
for _, r := range sample {
validRune := r != utf8.RuneError // is 0xfffd? i.e. bad rune?
if validRune {
fmt.Printf("'%c' validRune: true hex: %4x\n", r, r)
} else {
fmt.Printf("'%c' validRune: false\n", r)
}
}
https://go.dev/play/p/9NO9xMvcxCp
produces:
'�' validRune: false
'�' validRune: false
'=' validRune: true hex: 3d
'�' validRune: false
' ' validRune: true hex: 20
'⌘' validRune: true hex: 2318
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论