将UTF-8字符串转换为ISO-8859-1编码。

huangapple go评论127阅读模式
英文:

Convert utf8 string to ISO-8859-1

问题

如何在Golang中将utf8字符串转换为ISO-8859-1

我尝试过搜索,但只能找到相反的转换方法,而且我找到的几个解决方案都不起作用。

我需要将包含特殊丹麦字符的字符串转换,例如æøå

ø => ø
等等。

英文:

How to convert a utf8 string to ISO-8859-1 in golang

Have tried to search but can only find conversions the other way and the few solutions I found didn't work

I need to convert string with special danish chars like æ, ø and å

ø => ø
etc.

答案1

得分: 4

请注意,ISO-8859-1 只支持与 Unicode 相比的一小部分字符。如果你确定你的 UTF-8 编码的字符串只包含 ISO-8859-1 支持的字符,你可以使用以下代码。

package main

import (
	"fmt"

	"golang.org/x/text/encoding/charmap"
)

func main() {
	str := "Räv"

	encoder := charmap.ISO8859_1.NewEncoder()
	out, err := encoder.Bytes([]byte(str))
	if err != nil {
		panic(err)
	}

	fmt.Printf("%x\n", out)
}

上述代码输出:

52e476

因此,0x520xE40x76,与 https://en.wikipedia.org/wiki/ISO/IEC_8859-1 中的内容相符。特别要注意的是第二个字符,因为在 UTF-8 中它将被编码为 0xC30xA4

如果字符串包含不受支持的字符,例如将 str 改为 "Räv🐱v",那么 encoder.Bytes([]byte(str)) 将返回一个错误:

panic: encoding: rune not supported by encoding.

goroutine 1 [running]:
main.main()
/Users/nj/Dev/scratch/main.go:15 +0x109

如果你希望接受无法转换的字符丢失,一个简单的解决方案是利用 EncodeRune,它返回一个布尔值,指示该符文是否在 charmap 的字符集中。

package main

import (
	"fmt"

	"golang.org/x/text/encoding/charmap"
)

func main() {
	str := "Räv🐱v"
	out := make([]byte, 0)

	for _, r := range str {
		if e, ok := charmap.ISO8859_1.EncodeRune(r); ok {
			out = append(out, e)
		}
	}

	fmt.Printf("%x\n", out)
}

上述代码输出:

52e47676

即表情符号已被删除。

英文:

Keep in mind that ISO-8859-1 only supports a tiny subset of characters compared to Unicode. If you know for certain that your UTF-8 encoded string only contains characters covered by ISO-8859-1, you can use the following code.

package main

import (
	"fmt"

	"golang.org/x/text/encoding/charmap"
)

func main() {
	str := "Räv"

	encoder := charmap.ISO8859_1.NewEncoder()
	out, err := encoder.Bytes([]byte(str))
	if err != nil {
		panic(err)
	}

	fmt.Printf("%x\n", out)
}

The above prints:

52e476

So 0x52, 0xE4, 0x76, which looks correct as per https://en.wikipedia.org/wiki/ISO/IEC_8859-1 - in particular the second character is of note, since it would be encoded as 0xC3, 0xA4 in UTF-8.

If the string contains characters that aren't supported, e.g. we change str to be "Räv💩v", then an error is going to be returned by encoder.Bytes([]byte(str)):

panic: encoding: rune not supported by encoding.

goroutine 1 [running]:
main.main()
/Users/nj/Dev/scratch/main.go:15 +0x109

If you wish to address that by accepting loss of unconvertible characters, a simple solution might be to leverage EncodeRune, which returns a boolean to indicate if the rune is in the charmap's repertoire.

package main

import (
	"fmt"

	"golang.org/x/text/encoding/charmap"
)

func main() {
	str := "Räv💩v"
	out := make([]byte, 0)

	for _, r := range str {
		if e, ok := charmap.ISO8859_1.EncodeRune(r); ok {
			out = append(out, e)
		}
	}

	fmt.Printf("%x\n", out)
}

The above prints

52e47676

i.e. the emoji has been stripped.

huangapple
  • 本文由 发表于 2022年10月22日 18:12:48
  • 转载请务必保留本文链接:https://go.coder-hub.com/74162649.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定