golang,£字符导致奇怪的Â字符

huangapple go评论87阅读模式
英文:

golang, £ char causing weird  character

问题

我有一个函数,它从一串有效字符中生成一个随机字符串。当它选择了一个 £ 字符时,我偶尔会得到奇怪的结果。

我已经将其复现为以下的最小示例:

func foo() string {
	validChars := "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789~@:!£$%^&*"
	var result strings.Builder

	for i := 0; i < len(validChars); i++ {

		currChar := validChars[i]
		result.WriteString(string(currChar))
	}
	return result.String()
}

我期望它返回:

abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789~@:!&#163;$%^&amp;*

但实际上它并不返回这个结果,而是产生了:

abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789~@:!&#194;&#163;$%^&amp;*
                                                                  ^
                                             你从哪里来的

如果我将原始的 validChars 字符串中的 £ 符号去掉,那个奇怪的 A 就消失了。

func foo() string {
	validChars := "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789~@:!$%^&amp;*"
	var result strings.Builder

	for i := 0; i < len(validChars); i++ {

		currChar := validChars[i]
		result.WriteString(string(currChar))
	}
	return result.String()
}

这将产生:

abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789~@:!$%^&amp;*

英文:

I have a function that generates a random string from a string of valid characters. I'm occasionally getting weird results when it selects a £

I've reproduced it to the following minimal example:

func foo() string {
	validChars := &quot;abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789~@:!&#163;$%^&amp;*&quot;
	var result strings.Builder

	for i := 0; i &lt; len(validChars); i++ {

		currChar := validChars[i]
		result.WriteString(string(currChar))
	}
	return result.String()
}

I would expect this to return

abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789~@:!&#163;$%^&amp;*

But it doesn't, it produces

abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789~@:!&#194;&#163;$%^&amp;*
                                                                  ^
                                             where did you come from ?

if I take the £ sign out of the original validChars string, that weird A goes away.

func foo() string {
	validChars := &quot;abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789~@:!$%^&amp;*&quot;
	var result strings.Builder

	for i := 0; i &lt; len(validChars); i++ {

		currChar := validChars[i]
		result.WriteString(string(currChar))
	}
	return result.String()
}

This produces
abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789~@:!$%^&amp;*

答案1

得分: 9

一个string[]byte的类型别名。你对string的心理模型可能是它由字符片段组成,或者正如我们在Go中称之为的:rune的片段。

对于validChars字符串中的许多rune来说,这是没问题的,因为它们是ASCII字符的一部分,因此可以用UTF-8的一个字节表示。然而,&#163;这个rune是由2个字节表示的。

现在,如果我们考虑一个字符串&#163;,它由1个rune和2个字节组成。正如我之前提到的,一个字符串实际上只是一个[]byte。如果我们像你在示例中所做的那样,获取第一个元素,我们只会得到表示&#163;的两个字节中的第一个字节。当你将其转换回字符串时,它会给你一个意外的rune

解决你的问题的方法是首先将字符串validChars转换为[]rune。然后,你可以通过索引访问它的单个rune(而不是字节),这样foo函数就会按预期工作。你可以在这个playground中看到它的实际效果。

还要注意,len(validChars)将给出字符串中字节的数量。要获取rune的数量,请使用utf8.RuneCountInString

最后,这里是Rob Pike关于这个主题的一篇博文,你可能会觉得有趣。

英文:

A string is a type alias for []byte. Your mental model of a string is probably that it consists of a slice of characters - or, as we call it in Go: a slice of rune.

For many runes in your validChars string this is fine, as they are part of the ASCII chars and can therefore be represented in a single byte in UTF-8. However, the &#163; rune is represented as 2 bytes.

Now if we consider a string &#163;, it consists of 1 rune but 2 bytes. As I've mentioned, a string is really just a []byte. If we grab the first element like you are effectively doing in your sample, we will only get the first of the two bytes that represent &#163;. When you convert it back to a string, it gives you an unexpected rune.


The fix for your problem is to first convert string validChars to a []rune. Then, you can access its individual runes (rather than bytes) by index, and foo will work as expected. You can see it in action in this playground.

Also note that len(validChars) will give you the count of bytes in the string. To get the count of runes, use utf8.RuneCountInString instead.

Finally, here's a blog post from Rob Pike on the subject that you may find interesting.

huangapple
  • 本文由 发表于 2021年7月23日 19:00:10
  • 转载请务必保留本文链接:https://go.coder-hub.com/68498123.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定