2023年3月26日 01:24:11go评论123阅读模式

英文:

runes within strings

问题

我正在阅读Go By Example，其中的字符串和符文部分非常令人困惑。

运行以下代码：

    sample := "\xbd\xb2\x3d\xbc\x20\xe2\x8c\x98"
    fmt.Println(sample)
    fmt.Printf("%%q: %q\n", sample)
    fmt.Printf("%%+q: %+q\n", sample)

输出结果为：

��=� ⌘
%q: "\xbd\xb2=\xbc ⌘"
%+q: "\xbd\xb2=\xbc \u2318"

这是正常的。第一个、第二个和第四个符文似乎是不可打印的，我猜这意味着\xbd、\xb2和\xbc在Unicode中不被支持，所以它们显示为�。%q和%+q都正确地转义了这三个不可打印的符文。

但是当我像这样迭代字符串时：

    for _, runeValue := range sample {
        fmt.Printf("% x, %q, %+q\n", runeValue, runeValue, runeValue)
    }

突然间，%q没有转义这三个不可打印的符文，仍然显示为�，而%+q试图显示它们的底层代码点，这显然是不正确的：

 fffd, '�', '\ufffd'
 fffd, '�', '\ufffd'
 3d,   '=',  '='
 fffd, '�', '\ufffd'
 20,   ' ' ,  ' '
 2318, '⌘', '\u2318'

更奇怪的是，如果我将字符串作为字节切片进行迭代：

    for _, runeValue := range []byte(sample) {
        fmt.Printf("% x, %q, %+q\n", runeValue, runeValue, runeValue)
    }

突然间，这些符文不再是不可打印的，它们的底层代码点也是正确的：

 bd, '½', '\u00bd'
 b2, '²', '\u00b2'
 3d, '=',  '='
 bc, '¼', '\u00bc'
 20, ' ',  ' '
 e2, 'â', '\u00e2'
 8c, '\u008c', '\u008c'
 98, '\u0098', '\u0098'

有人能解释一下这里发生了什么吗？

英文:

I am going through Go By Example, and the strings and runes section is terribly confusing.

Running this:

    sample := &quot;\xbd\xb2\x3d\xbc\x20\xe2\x8c\x98&quot;
    fmt.Println(sample)
    fmt.Printf(&quot;%%q: %q\n&quot;, sample)
    fmt.Printf(&quot;%%+q: %+q\n&quot;, sample)

yields this:

��=� ⌘
%q: &quot;\xbd\xb2=\xbc ⌘&quot;
%+q: &quot;\xbd\xb2=\xbc \u2318&quot;

..which is fine. The 1st, 2nd and 4th rune seem to be non-printable, which I guess means that \xbd, \xb2 and \xbc are simply not supported by Unicode or something (correct me if im wrong), and so they show up as �. Both %q and %+q also correctly escape those 3 non-printable runes.

But now when I iterate over the string like so:

    for _, runeValue := range sample {
        fmt.Printf(&quot;% x, %q, %+q\n&quot;, runeValue, runeValue, runeValue)
    }

suddenly the 3 non-printable runes are not escaped by %q and remain as �, and %+q attempts to reveal their underlying code point, which is obviously incorrect:

 fffd, &#39;�&#39;, &#39;\ufffd&#39;
 fffd, &#39;�&#39;, &#39;\ufffd&#39;
 3d,   &#39;=&#39; ,  &#39;=&#39;
 fffd, &#39;�&#39;, &#39;\ufffd&#39;
 20,   &#39; &#39; ,  &#39; &#39;
 2318, &#39;⌘&#39;, &#39;\u2318&#39;

Even strangely, if I iterate over the string as a byte slice:

    for _, runeValue := range []byte(sample) {
        fmt.Printf(&quot;% x, %q, %+q\n&quot;, runeValue, runeValue, runeValue)
    }

suddenly, these runes are no longer non-printable, and their underlying code points are correct:

 bd, &#39;&#189;&#39;, &#39;\u00bd&#39;
 b2, &#39;&#178;&#39;, &#39;\u00b2&#39;
 3d, &#39;=&#39;, &#39;=&#39;
 bc, &#39;&#188;&#39;, &#39;\u00bc&#39;
 20, &#39; &#39;, &#39; &#39;
 e2, &#39;&#226;&#39;, &#39;\u00e2&#39;
 8c, &#39;\u008c&#39;, &#39;\u008c&#39;
 98, &#39;\u0098&#39;, &#39;\u0098&#39;

Can someone explain whats happening here?

答案1

得分: 0

fmt.Printf会在内部执行很多操作，通过类型检查等方式来呈现尽可能多的有用信息。如果你想验证一个字符串（或字节切片）是否是有效的UTF-8编码，可以使用标准库包encoding/utf8。

例如：

import "unicode/utf8"
var sample = "\xbd\xb2\x3d\xbc\x20\xe2\x8c\x98"
fmt.Printf("%q valid? %v\n", sample, utf8.ValidString(sample)) // 输出 "false"

通过扫描字符串的每个符文，我们可以确定导致该字符串无效（从UTF-8编码角度）的原因。注意：十六进制值0xfffd表示遇到了无效的符文。这个错误值被定义为一个包常量utf8.RuneError：

for _, r := range sample {
    validRune := r != utf8.RuneError // 是否为0xfffd？即无效的符文？
    if validRune {
        fmt.Printf("'%c' validRune: true   hex: %4x\n", r, r)
    } else {
        fmt.Printf("'%c' validRune: false\n", r)
    }
}

输出结果为：

'�' validRune: false
'�' validRune: false
'=' validRune: true   hex:   3d
'�' validRune: false
' ' validRune: true   hex:   20
'⌘' validRune: true   hex: 2318

你可以在这里查看完整的示例代码。

英文:

fmt.Printf will do lot of magic under the covers to render as much useful information via type inspection etc. If you want to verify if a string (or a byte slice) is valid UTF-8 use the standard library package encoding/utf8.

For example:

import &quot;unicode/utf8&quot;
var sample = &quot;\xbd\xb2\x3d\xbc\x20\xe2\x8c\x98&quot;
fmt.Printf(&quot;%q valid? %v\n&quot;, sample, utf8.ValidString(sample)) // reports &quot;false&quot;

Scanning the individual runes of the string we can identify what makes this string invalid (from a UTF-8 encoding perspective). Note: the hex value 0xfffd indicates a bad rune was encounter. This error value is defined as a package constant utf8.RuneError:

for _, r := range sample {
	validRune := r != utf8.RuneError // is 0xfffd? i.e. bad rune?
	if validRune {
		fmt.Printf(&quot;&#39;%c&#39; validRune: true   hex: %4x\n&quot;, r, r)
	} else {
		fmt.Printf(&quot;&#39;%c&#39; validRune: false\n&quot;, r)
	}
}

https://go.dev/play/p/9NO9xMvcxCp

produces:

&#39;�&#39; validRune: false
&#39;�&#39; validRune: false
&#39;=&#39; validRune: true   hex:   3d
&#39;�&#39; validRune: false
&#39; &#39; validRune: true   hex:   20
&#39;⌘&#39; validRune: true   hex: 2318

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

字符串中的符文

问题

答案1

在Go语言中的对象工厂（Object Factory）

如何在Revel控制器中添加新字段？

Go语言中关于上下文取消函数的最佳实践方法

What should I do if I do not want to use '{{ }}' in template of golang?

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。