英文:
To make Golang rune to utf-8 result same as js string.fromCharCode
问题
go
var int32s = []int32{
8, 253, 80, 56, 30, 220, 217, 42, 235, 33, 211, 23, 231, 216, 234, 26,
}
var result string
for _, val := range int32s {
result += string(rune(val))
}
fmt.Println("word:", result)
上述的 Go 代码可以生成与 JavaScript 代码相同的结果。通过遍历 int32s 数组,将每个元素转换为对应的字符,并将其拼接到 result 字符串中。最后打印出 "word:" 和 result 的值即可。这样就可以得到与 JavaScript 代码相同的输出结果。
英文:
go
var int32s = []int32{
8, 253, 80, 56, 30, 220, 217, 42, 235, 33, 211, 23, 231, 216, 234, 26,
}
fmt.Println("word: ", string(int32s))
js
let int32s = [8, 253, 80, 56, 30, 220, 217, 42, 235, 33, 211, 23, 231, 216, 234, 26]
str = String.fromCharCode.apply(null, int32s);
console.log("word: " + String.fromCharCode.apply(null, int32s))
2 results above are not the same for some empty characters.
Is there any solution for modify go code to generate same result to the js one?
答案1
得分: 2
静态方法
String.fromCharCode()
返回由指定的一系列UTF-16代码单元组成的字符串。
因此,你的int32s
数组中的每个数字都被解释为一个16位整数,提供一个Unicode代码单元,整个序列被解释为一系列代码单元,形成一个UTF-16编码的字符串。我要强调最后一点,因为从变量的命名——int32s
来看,无论是谁编写的JS代码,他们对于发生的情况似乎有错误的理解。
现在回到Go的对应部分。Go没有内置对UTF-16编码的支持;它的字符串通常使用UTF-8进行编码(尽管不是必需的,但我们不要离题),而且Go提供了rune
数据类型,它是int32
的别名。
一个rune是一个Unicode代码点,也就是一个能够包含完整Unicode字符的数字。
(我一会儿会回到这个事实及其与JS代码的关系。)
现在,你的string(int32s)
有什么问题是它将你的int32s
切片解释为与[]rune
相同的方式(记住rune
是int32
的别名),所以它将切片中的每个数字都视为一个单独的Unicode字符,并生成一个包含这些字符的字符串。
(这个字符串在内部被编码为UTF-8,但这个事实对问题并不重要。)
换句话说,区别在于:
- JS代码将数组解释为表示UTF-16编码字符串的16位值序列,并将其转换为某种内部字符串表示形式。
- Go代码将切片解释为32位Unicode代码点的序列,并生成包含这些代码点的字符串。
Go标准库提供了一个处理UTF-16编码的包:encoding/utf16
,我们可以使用它来完成JS代码的工作——将一个UTF-16编码的字符串解码为一系列Unicode代码点,然后将其转换为Go字符串:
package main
import (
"fmt"
"unicode/utf16"
)
func main() {
var uint16s = []uint16{
8, 253, 80, 56, 30, 220, 217, 42, 235, 33, 211, 23, 231, 216, 234, 26,
}
runes := utf16.Decode(uint16s)
fmt.Println("word: ", string(runes))
}
(请注意,我已经将切片的类型更改为[]unit16
并相应地重命名它。此外,我已将源切片解码为一个明确命名的变量;这是为了清晰起见,以突出显示发生的情况。)
这段代码在Firefox控制台中产生与JS代码相同的无意义字符串。
关于未触及的
2个结果对于某些空字符不相同。
问题的更新。
据我理解,问题在于你的Go代码打印出类似于
ýP8ÜÙ*ë!ÓçØê
而JS代码打印出
�ýP8�ÜÙ*ë!Ó�çØê�
对吗?
问题在于fmt.Println
和console.log
对生成的字符串的不同解释。
首先,让我声明,你的Go代码在不使用我建议的正确解码的情况下也能正常工作——因为切片中的所有整数都是UTF-16代码单元在“基本”范围内,所以“愚蠢”的转换有效,并生成与JS代码相同的字符串。
要查看两个字符串“原样”,你可以这样做:
-
对于Go,使用
fmt.Printf
和%q
占位符,以使用Go规则将“特殊”Unicode(和ASCII)字符“转义”并打印出来:fmt.Println("%q\n", string(int32s))
产生
"\\býP8\\x1eÜÙ*ë!Ó\\x17çØê\\x1a"
注意这些
\b
、\x1e
和其他转义符:\b
是ASCII BS(退格)控制字符,代码为0x08——参见http://man-ascii.com/。\x1e
是一个代码为0x1E的字节,它是ASCII RS(记录分隔符)。- …等等。
如你所见,这些是控制字符,它们不可打印。
-
对于JS,在不使用
console.log
的情况下打印生成字符串的值——只需将其值保存在一个变量中,然后在控制台中输入其名称并按Enter键,以便打印出其值:> let int32s = [8, 253, 80, 56, 30, 220, 217, 42, 235, 33, 211, 23, 231, 216, 234, 26] > str = String.fromCharCode.apply(null, int32s); > str "\u0008ýP8\u001eÜÙ*ë!Ó\u0017çØê\u001a"
请注意,该字符串包含
\uXXXX
转义符。它们定义Unicode代码点(顺便说一句,Go也支持相同的语法),这些转义符定义了与Go示例中所见相同的代码点:\u0008
是一个代码为8或0x08的字符。\u001e
是一个代码为0x1E的字符。- …等等。
如你所见,生成的字符串是相同的,唯一的区别是Go的字符串以UTF-8编码,因此使用fmt.Printf
和%q
查看其内容时,它查看的是编码后的字节,这就是为什么Go使用“最小化”编码打印它们的“转义”,但我们也可以使用JS示例中的转义:你可以检查运行
fmt.Println("\býP8\x1eÜÙ*ë!Ó\x17çØê\x1a" == "\u0008ýP8\u001eÜÙ*ë!Ó\u0017çØê\u001a")
会打印出true
。
因此,正如你现在所看到的,console.log
将每个不可打印字符替换为特殊的Unicode代码点U+FFFD,它被称为Unicode替换字符,通常显示为一个带有白色问号的黑色菱形。
Go的fmt.Println
不会这样做:它只是将这些字节“原样”发送到输出。
希望这解释了观察到的差异。
英文:
To cite the docs on String.fromCharCode
:
> The static String.fromCharCode()
method returns a string created from the specified sequence of UTF-16 code units.
So each number in your int32s
array is interpreted as a 16-bit integer providing a Unicode code unit, so that the whole sequence is interpreted as a series of code units forming an UTF-16-encoded string.
I'd stress the last point because judging from the naming of the variable—int32s
,—whoever is the author of the JS code, they appear to have incorrect idea about what is happening there.
Now back to the Go counterpart. Go does not have built-in support for UTF-16 encodings; its strings are normally encoded using UTF-8 (though they are not required to, but let's not digress), and also Go provides the rune
data type which is an alias to int32
.
A rune is a Unicode code point, that is, a number which is able to contain a complete Unicode character.
(I'll get back to this fact and its relation to the JS code in a moment.)
Now, what's wrong with your string(int32s)
is that it interpets your slice of int32
s in the same way as []rune
(remember that a rune
is an alias to int32
), so it takes each number in the slice to represent a single Unicode character and produces a string of them.
(This string is internally encoded as UTF-8 but this fact is not really relevant to the problem.)
In other words, the difference is this:
- The JS code interprets the array as a sequence of 16-bit values representing an UTF-16-encoded string and converts it to some internal string representation.
- The Go code interprets the slice as a sequence of 32-bit Unicode code points and produces a string containing these code points.
The Go standard library produces a package to deal with UTF-16 encoding: encoding/utf16
, and we can use it to do what the JS code codes—to decode an UTF-16-encoded string into a sequence of Unicode code points, which we can then convert to a Go string:
package main
import (
"fmt"
"unicode/utf16"
)
func main() {
var uint16s = []uint16{
8, 253, 80, 56, 30, 220, 217, 42, 235, 33, 211, 23, 231, 216, 234, 26,
}
runes := utf16.Decode(uint16s)
fmt.Println("word: ", string(runes))
}
(Note that I've change the type of the slice to []unit16
and renamed it accordingly. Also, I've decoded the source slice to an explicitly named variable; this is done for clarity—to highlight what's happening.)
This code produces the same gibberish as the JS code does in the Firefox console.
Update on the
> 2 results above are not the same for some empty characters.
bit which I did not touch.
The problem, as I understand it, is that your Go code prints something like
ýP8ÜÙ*ë!ÓçØê
while the JS code prints
�ýP8�ÜÙ*ë!Ó�çØê�
right?
The problem here is in the different interpretation of the resulting string fmt.Println
and console.log
do.
Let me first state that your Go code happens to work correctly without using proper decoding as I've suggested—because all the integers in the slice are UTF-16 code units in the "basic" range, so "dumb" conversion works, and produces the same string as the JS code does.
To see the both strings "as is" you could do this:
-
For Go, use
fmt.Printf
with the%q
verb to see "special" Unicode (and ASCII) characters "escaped" using the Go rules in the printout:fmt.Println("%q\n", string(int32s))
produces
"\býP8\x1eÜÙ*ë!Ó\x17çØê\x1a"
Notice these '\b', '\x1e' and other escapes:
- '\b' is ASCII BS (backspace) control character, code 0x08 — see <http://man-ascii.com/>.
- '\x1e' is a byte with the code 0x1E, which is ASCII RS (record separator).
- …and so on.
As you can see, these are control characters, which are not printable.
-
For JS, print the value of the resulting string without using
console.log
—just save its value in a variable then enter its name at the console and hit Enter—to have its value printed "as is":> let int32s = [8, 253, 80, 56, 30, 220, 217, 42, 235, 33, 211, 23, 231, 216, 234, 26] > str = String.fromCharCode.apply(null, int32s); > str "\u0008ýP8\u001eÜÙ*ë!Ó\u0017çØê\u001a"
Note that the string contains the "\uXXXX" escapes. They define Unicode code points (BTW Go supports the same syntax), and these escapes define the same code points as can be seen in the Go example:
- "\u0008" is a character with code 8, or 0x08.
- "\u001e" is a character with code 0x1E.
- …and so on.
As you can see, the strings produced are the same, with the only difference is that Go's string is encoded in UTF-8, and because of this, peering into its contents using fmt.Printf
and %q
looks at the encoded bytes, and that's why Go prints their "escapes" using "minimal" encoding, but we could use escaping from the JS example as well: you can check than running
fmt.Println("\býP8\x1eÜÙ*ë!Ó\x17çØê\x1a" == "\u0008ýP8\u001eÜÙ*ë!Ó\u0017çØê\u001a")
prints true
.
So, as you can see by now, console.log
replaces each non-printable character with the special Unicode code point U+FFFD, which is called Unicode replacement character, usually rendered as a black rhombus with a white question mark in it.
Go's fmt.Println
does not do that: it merely sends these bytes "as is" to the output.
Hope this explains the observed difference.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论