意外的代码点转换错误

huangapple go评论88阅读模式
英文:

Unexpected code point conversion

问题

为什么在以下应用程序中,一个字节被转换为值为65533的rune,而不是132?

我有一个ASCII码转换表(旧的ASCII码->新的ASCII码),我应该实现它,所以我需要正确的ASCII值(在这种情况下是132)在转换器中。

示例程序:

package main

import (
    "io/ioutil"
    "flag"
    "bytes"
    "fmt"
)

func converter(r rune) rune {
    fmt.Printf("%v ", int(r))
    return r
}

func main() {

    // 解析命令行参数
    var infile string
    flag.StringVar(&infile, "in", "", "input file")
    flag.Parse()

    // 一次性读取整个文件
    b, err := ioutil.ReadFile(infile)
    if err != nil {
        panic(err)
    }

    fmt.Printf("%v\n", b)

    // 转换字符集
    converted := bytes.Map(converter, b)

    fmt.Printf("\n%v\n", converted)
}

示例输入文件(十六进制):

4A 84 6C 6B 0D 0A

应用程序的示例输出:

[74 132 108 107 13 10]
74 65533 108 107 13 10
[74 239 191 189 108 107 13 10]
英文:

Why is one byte converted to rune with value 65533 instead of 132 in the following application?

I have ascii code conversion table (old ascii code -> new ascii code) that I should implement, so I need the correct ascii values (132 in this case) in converter.

Sample program:

package main

import (
    "io/ioutil"
    "flag"
    "bytes"
    "fmt"
)

func converter(r rune) rune {
    fmt.Printf("%v ", int(r))
    return r
}

func main() {

    // parse the command line
    var infile string
    flag.StringVar(&infile, "in", "", "input file")
    flag.Parse()

    // read the whole file at once
    b, err := ioutil.ReadFile(infile)
    if err != nil {
        panic(err)
    }

    fmt.Printf("%v\n", b)

    // convert charset
    converted := bytes.Map(converter, b)

    fmt.Printf("\n%v\n", converted)
}

Sample input file (in hex):

4A 84 6C 6B 0D 0A

Sample output from the application:

[74 132 108 107 13 10]
74 65533 108 107 13 10
[74 239 191 189 108 107 13 10]

答案1

得分: 1

Rune是Unicode值,而不是ASCII。因此,您的字节被解释为UTF8。

如果我们看一下您正在使用的函数:
https://golang.org/src/bytes/bytes.go?s=9029:9081#L344

我们可以看到对切片中的每个字节都进行了转换为Unicode rune。

r := rune(s[i])

它的作用是将字节从s[i]开始转换为UTF8字符。

在UTF8中,一个字符可以占用多个字节。这与ASCII编码相反,其中一个字符始终占用一个字节。

您可以在这里阅读更多关于UTF8的信息:https://en.wikipedia.org/wiki/UTF-8

这就是您得到错误结果的原因。

要修复它,您应该使用for range循环迭代您的字节,并将输出保存到新的切片中。

func converter(b byte) byte {
    fmt.Printf("%v ", int(r))
    return b
}

...

converted := make([]byte, len(b))

for i, v := range b {
   // v是您的字节值-在这里进行转换
   converted[i] = converter(v)
}
英文:

Rune is a Unicode value, not ASCII. So your bytes are interpreted as UTF8.

If we look at the function that you are using:
https://golang.org/src/bytes/bytes.go?s=9029:9081#L344

We can see that for every byte in slice it is converted to Unicode rune.

r := rune(s[i])

What it does is a conversion of bytes, starting from s[i] to UTF8 letter.

In UTF8 one letter can
occupy more that one byte
. This is opposite to ASCII encoding where
one letter always takes one byte.

You can read more about UTF8 here https://en.wikipedia.org/wiki/UTF-8

This is the reason you have the wrong result.

To fix it, you should iterate over your bytes using for range loop and save the output to new slice.

func converter(b byte) byte {
    fmt.Printf("%v ", int(r))
    return b
}

...

converted := make([]byte, len(b))

for i, v := range b {
   // v is your byte value - convert it here
   converted[i] = converter(v)
}

答案2

得分: 1

从文本中读取字节,然后你可以使用以下代码来处理 - 输出的最后一列将与ASCII值相对应。

package main

import (
	"encoding/hex"
	"fmt"
	"unicode/utf8"
)

func main() {
	//s := "Hello, 世界"
	//假设以下是你从文件中读取的十六进制值..
	b, err := hex.DecodeString("48656c6c6f2c20e4b896e7958c")
	if err != nil {
		fmt.Println(err)
	}
	fmt.Println(b)
	s := string(b)
	for i := 0; i < len(s); {
		r, size := utf8.DecodeRuneInString(s[i:])
		fmt.Printf("%d\t%c\t%d\n", i, r, r)
		i += size
	}
	anotherWay(s)

}
func anotherWay(s string) {
	fmt.Println("\nAnother way")
	for i, r := range s {
		fmt.Printf("%d\t%c\t%d\n", i, r, r)
	}
}

在 playground 上查看:https://play.golang.org/p/9WusGxWv8w

英文:

Read the bytes from the text, and then you can use something on these lines - the last column in the output will be comparable to the ASCII value.

package main

import (
	&quot;encoding/hex&quot;
	&quot;fmt&quot;
	&quot;unicode/utf8&quot;
)

func main() {
	//s := &quot;Hello, 世界&quot;
	//Assuming the following is the hex you have read in from the file..
	b, err := hex.DecodeString(&quot;48656c6c6f2c20e4b896e7958c&quot;)
	if err != nil {
		fmt.Println(err)
	}
	fmt.Println(b)
	s := string(b)
	for i := 0; i &lt; len(s); {
		r, size := utf8.DecodeRuneInString(s[i:])
		fmt.Printf(&quot;%d\t%c\t%d\n&quot;, i, r, r)
		i += size
	}
	anotherWay(s)

}
func anotherWay(s string) {
	fmt.Println(&quot;\nAnother way&quot;)
	for i, r := range s {
		fmt.Printf(&quot;%d\t%c\t%d\n&quot;, i, r, r)
	}
}

On playground : https://play.golang.org/p/9WusGxWv8w

huangapple
  • 本文由 发表于 2017年9月7日 18:16:11
  • 转载请务必保留本文链接:https://go.coder-hub.com/46093801.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定