2015年5月31日 18:56:15go评论95阅读模式

英文:

String to UCS-2

问题

我想将我的Python程序转换为Go，以将Unicode字符串转换为UCS-2 HEX字符串。

在Python中，这非常简单：

u"Bien joué".encode('utf-16-be').encode('hex')
-> 004200690065006E0020006A006F007500E9

我是Go的初学者，我找到的最简单的方法是：

package main

import (
	"fmt"
	"strings"
)

func main() {
	str := "Bien joué"
	fmt.Printf("str: %s\n", str)

	ucs2HexArray := []rune(str)
	s := fmt.Sprintf("%U", ucs2HexArray)
	a := strings.Replace(s, "U+", "", -1)
	b := strings.Replace(a, "[", "", -1)
	c := strings.Replace(b, "]", "", -1)
	d := strings.Replace(c, " ", "", -1)
	fmt.Printf("->: %s", d)
}

输出结果为：

str: Bien joué
->: 004200690065006E0020006A006F007500E9

我真的觉得这不是很高效。我该如何改进它？

谢谢。

英文:

I want to translate in Go my python program to convert an unicode string to a UCS-2 HEX string.

In python, it's quite simple:

u&quot;Bien jou&#233;&quot;.encode(&#39;utf-16-be&#39;).encode(&#39;hex&#39;)
-&gt; 004200690065006e0020006a006f007500e9

I am a beginner in Go and the simplest way I found is:

package main

import (
	&quot;fmt&quot;
	&quot;strings&quot;
)

func main() {
	str := &quot;Bien jou&#233;&quot; 
	fmt.Printf(&quot;str: %s\n&quot;, str)
	
	ucs2HexArray := []rune(str)
	s := fmt.Sprintf(&quot;%U&quot;, ucs2HexArray)
	a := strings.Replace(s, &quot;U+&quot;, &quot;&quot;, -1)
	b := strings.Replace(a, &quot;[&quot;, &quot;&quot;, -1)
	c := strings.Replace(b, &quot;]&quot;, &quot;&quot;, -1)
	d := strings.Replace(c, &quot; &quot;, &quot;&quot;, -1)
	fmt.Printf(&quot;-&gt;: %s&quot;, d)
}

str: Bien jou&#233;
-&gt;: 004200690065006E0020006A006F007500E9
Program exited.

I really think it's clearly not efficient. How can-I improve it?

Thank you

答案1

得分: 3

将此转换为一个函数，然后您可以在将来轻松改进转换算法。例如，

package main

import (
	"fmt"
	"strings"
	"unicode/utf16"
)

func hexUTF16FromString(s string) string {
	hex := fmt.Sprintf("%04x", utf16.Encode([]rune(s)))
	return strings.Replace(hex[1:len(hex)-1], " ", "", -1)
}

func main() {
	str := "Bien jou&#233;"
	fmt.Println(str)
	hex := hexUTF16FromString(str)
	fmt.Println(hex)
}

输出：

Bien jou&#233;
004200690065006e0020006a006f007500e9

注意：

您说“将Unicode字符串转换为UCS-2字符串”，但是您的Python示例使用了UTF-16：

u"Bien jou&#233;".encode('utf-16-be').encode('hex')

Unicode联盟

UTF-16常见问题

问：UCS-2和UTF-16有什么区别？

答：UCS-2是过时的术语，指的是Unicode实现，适用于Unicode 1.1之前，在版本2.0中添加了代理代码点和UTF-16之前。现在应避免使用此术语。

UCS-2不描述与UTF-16不同的数据格式，因为两者都使用完全相同的16位代码单元表示。但是，UCS-2不解释代理代码点，因此不能用于符合性地表示补充字符。

过去，有时会将实现标记为“UCS-2”，以指示它不支持补充字符，并且不将代理代码点对解释为字符。这样的实现将无法处理补充字符的字符属性、代码点边界、排序等处理。

英文:

Make this conversion a function then you can easily improve the conversion algorithm in the future. For example,

package main

import (
	&quot;fmt&quot;
	&quot;strings&quot;
	&quot;unicode/utf16&quot;
)

func hexUTF16FromString(s string) string {
	hex := fmt.Sprintf(&quot;%04x&quot;, utf16.Encode([]rune(s)))
	return strings.Replace(hex[1:len(hex)-1], &quot; &quot;, &quot;&quot;, -1)
}

func main() {
	str := &quot;Bien jou&#233;&quot;
	fmt.Println(str)
	hex := hexUTF16FromString(str)
	fmt.Println(hex)
}

Output:

Bien jou&#233;
004200690065006e0020006a006f007500e9

NOTE:

You say "convert an unicode string to a UCS-2 string" but your Python example uses UTF-16:

u&quot;Bien jou&#233;&quot;.encode(&#39;utf-16-be&#39;).encode(&#39;hex&#39;)

> The Unicode Consortium
>
> UTF-16 FAQ
>
> Q: What is the difference between UCS-2 and UTF-16?
>
> A: UCS-2 is obsolete terminology which refers to a Unicode
> implementation up to Unicode 1.1, before surrogate code points and
> UTF-16 were added to Version 2.0 of the standard. This term should now
> be avoided.
>
> UCS-2 does not describe a data format distinct from UTF-16, because
> both use exactly the same 16-bit code unit representations. However,
> UCS-2 does not interpret surrogate code points, and thus cannot be
> used to conformantly represent supplementary characters.
>
> Sometimes in the past an implementation has been labeled "UCS-2" to
> indicate that it does not support supplementary characters and doesn't
> interpret pairs of surrogate code points as characters. Such an
> implementation would not handle processing of character properties,
> code point boundaries, collation, etc. for supplementary characters.

答案2

得分: 3

对于除了非常短的输入（甚至可能包括短输入），我会使用golang.org/x/text/encoding/unicode包将其转换为UTF-16（正如@peterSo和@JimB指出的，与过时的UCS-2略有不同）。

使用这个包（以及golang.org/x/text/transform包）的优点是，你可以获得BOM支持、大端或小端字节序，并且可以对短字符串或字节进行编码/解码，还可以将其作为过滤器应用于io.Reader或io.Writer，在处理数据时进行转换，而不是一次性全部处理（即对于大量数据流，你不需要一次性将其全部加载到内存中）。

例如：

package main

import (
	"bytes"
	"fmt"
	"io"
	"io/ioutil"
	"log"
	"strings"

	"golang.org/x/text/encoding/unicode"
	"golang.org/x/text/transform"
)

const input = "Bien joué"

func main() {
	// 获取用于编码的`transform.Transformer`。
	e := unicode.UTF16(unicode.BigEndian, unicode.IgnoreBOM)
	t := e.NewEncoder()
	// 对于解码，允许在开头有字节顺序标记（Byte Order Mark），
	// 以切换到相应的Unicode解码（UTF-8、UTF-16BE或UTF-16LE），
	// 否则我们使用`e`（UTF-16BE无BOM）：
	t2 := unicode.BOMOverride(e.NewDecoder())
	_ = t2 // 我们不显示/使用这个

	// 如果你有一个字符串：
	str := input
	outstr, n, err := transform.String(t, str)
	if err != nil {
		log.Fatal(err)
	}
	fmt.Printf("string:   n=%d, bytes=%02x\n", n, []byte(outstr))

	// 如果你有一个[]byte：
	b := []byte(input)
	outbytes, n, err := transform.Bytes(t, b)
	if err != nil {
		log.Fatal(err)
	}
	fmt.Printf("bytes:    n=%d, bytes=%02x\n", n, outbytes)

	// 如果你有一个用于输入的io.Reader：
	ir := strings.NewReader(input)
	r := transform.NewReader(ir, t)
	// 现在只需像正常读取r一样读取，编码将在读取时进行，
	// 对于大型源，这样可以避免预先对所有内容进行编码。
	// 这里我们将一次性读取所有内容，这样就抵消了这个好处（通常避免使用ioutil.ReadAll）。
	outbytes, err = ioutil.ReadAll(r)
	if err != nil {
		log.Fatal(err)
	}
	fmt.Printf("reader: len=%d, bytes=%02x\n", len(outbytes), outbytes)

	// 如果你有一个用于输出的io.Writer：
	var buf bytes.Buffer
	w := transform.NewWriter(&buf, t)
	_, err = fmt.Fprint(w, input) // 或者从io.Reader复制，或者其他操作
	if err != nil {
		log.Fatal(err)
	}
	fmt.Printf("writer: len=%d, bytes=%02x\n", buf.Len(), buf.Bytes())
}

// 你可以将下面的任何一个放在一个简单的函数中。
// NewUTF16BEWriter 返回一个新的写入器，它将w包装起来，
// 将写入的字节转换为UTF-16-BE。
func NewUTF16BEWriter(w io.Writer) io.Writer {
	e := unicode.UTF16(unicode.BigEndian, unicode.IgnoreBOM)
	return transform.NewWriter(w, e.NewEncoder())
}

// ToUTFBE 将UTF8的`b`转换为UTF-16-BE。
func ToUTF16BE(b []byte) ([]byte, error) {
	e := unicode.UTF16(unicode.BigEndian, unicode.IgnoreBOM)
	out, _, err := transform.Bytes(e.NewEncoder(), b)
	return out, err
}

输出：

string:   n=10, bytes=004200690065006e0020006a006f007500e9
bytes:    n=10, bytes=004200690065006e0020006a006f007500e9
reader: len=18, bytes=004200690065006e0020006a006f007500e9
writer: len=18, bytes=004200690065006e0020006a006f007500e9

英文:

For anything other than trivially short input (and possibly even then), I'd use the golang.org/x/text/encoding/unicode package to convert to UTF-16 (as @peterSo and @JimB point out, slightly different from obsolete UCS-2).

The advantage (over unicode/utf16) of using this (and the golang.org/x/text/transform package) is that you get BOM support, big or little endian, and that you can encode/decode short strings or bytes, but you can also apply this as a filter to an io.Reader or to an io.Writer to transform your data as you process it instead of all up front (i.e. for a large stream of data you don't need to have it all in memory at once).

E.g.:

package main
import (
&quot;bytes&quot;
&quot;fmt&quot;
&quot;io&quot;
&quot;io/ioutil&quot;
&quot;log&quot;
&quot;strings&quot;
&quot;golang.org/x/text/encoding/unicode&quot;
&quot;golang.org/x/text/transform&quot;
)
const input = &quot;Bien jou&#233;&quot;
func main() {
// Get a `transform.Transformer` for encoding.
e := unicode.UTF16(unicode.BigEndian, unicode.IgnoreBOM)
t := e.NewEncoder()
// For decoding, allows a Byte Order Mark at the start to
// switch to corresponding Unicode decoding (UTF-8, UTF-16BE, or UTF-16LE)
// otherwise we use `e` (UTF-16BE without BOM):
t2 := unicode.BOMOverride(e.NewDecoder())
_ = t2 // we don&#39;t show/use this
// If you have a string:
str := input
outstr, n, err := transform.String(t, str)
if err != nil {
log.Fatal(err)
}
fmt.Printf(&quot;string:   n=%d, bytes=%02x\n&quot;, n, []byte(outstr))
// If you have a []byte:
b := []byte(input)
outbytes, n, err := transform.Bytes(t, b)
if err != nil {
log.Fatal(err)
}
fmt.Printf(&quot;bytes:    n=%d, bytes=%02x\n&quot;, n, outbytes)
// If you have an io.Reader for the input:
ir := strings.NewReader(input)
r := transform.NewReader(ir, t)
// Now just read from r as you normal would and the encoding will
// happen as you read, good for large sources to avoid pre-encoding
// everything. Here we&#39;ll just read it all in one go though which negates
// that benefit (normally avoid ioutil.ReadAll).
outbytes, err = ioutil.ReadAll(r)
if err != nil {
log.Fatal(err)
}
fmt.Printf(&quot;reader: len=%d, bytes=%02x\n&quot;, len(outbytes), outbytes)
// If you have an io.Writer for the output:
var buf bytes.Buffer
w := transform.NewWriter(&amp;buf, t)
_, err = fmt.Fprint(w, input) // or io.Copy from an io.Reader, or whatever
if err != nil {
log.Fatal(err)
}
fmt.Printf(&quot;writer: len=%d, bytes=%02x\n&quot;, buf.Len(), buf.Bytes())
}
// Whichever of these you need you could of
// course put in a single simple function. E.g.:
// NewUTF16BEWriter returns a new writer that wraps w
// by transforming the bytes written into UTF-16-BE.
func NewUTF16BEWriter(w io.Writer) io.Writer {
e := unicode.UTF16(unicode.BigEndian, unicode.IgnoreBOM)
return transform.NewWriter(w, e.NewEncoder())
}
// ToUTFBE converts UTF8 `b` into UTF-16-BE.
func ToUTF16BE(b []byte) ([]byte, error) {
e := unicode.UTF16(unicode.BigEndian, unicode.IgnoreBOM)
out, _, err := transform.Bytes(e.NewEncoder(), b)
return out, err
}

Gives:

string:   n=10, bytes=004200690065006e0020006a006f007500e9
bytes:    n=10, bytes=004200690065006e0020006a006f007500e9
reader: len=18, bytes=004200690065006e0020006a006f007500e9
writer: len=18, bytes=004200690065006e0020006a006f007500e9

答案3

得分: -1

标准库中有内置的utf16.Encode()函数（https://golang.org/pkg/unicode/utf16/#Encode），用于此目的。

英文:

The standard library has the built-in utf16.Encode() (https://golang.org/pkg/unicode/utf16/#Encode) function for this purpose.

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

将字符串转换为UCS-2编码。

问题

答案1

答案2

答案3

函数表与switch在golang中的比较

Go语言错误/接口机制

谷歌应用程序API 403错误与服务帐号

当捕获到SIGINT和SIGTERM信号时，Go进程会过早终止。

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论