如何创建UTF16字符串

huangapple go评论64阅读模式
英文:

How to create UTF16 Strings

问题

有没有一种方法可以从头开始创建一个UTF16字符串,或者从一个实际的UTF8字符串创建一个UTF16字符串,而不涉及像循环遍历每个字符并附加一个00字节来使其成为UTF16字符这样奇怪的“黑科技”?

理想情况下,我希望能够像这样做:

String s = new String("TestData".getBytes(), StandardCharsets.UTF_16);

但这不起作用,因为字符串字面量被解释为UTF8。

英文:

Is there a way to create an UTF16 string from scratch or from an actual UTF8 string that doesn't involve some weird "hack" like looping through each char and appending a 00 byte to make it an UTF16 char?

Ideally I would like to be able to do something like this:

String s = new String("TestData".getBytes(), StandardCharsets.UTF_16);

But that doesn't work as the string literal is interpreted as UTF8.

答案1

得分: 4

在Java中,String实例没有编码。它就是 - 它表示字符为字符,因此没有编码。

编码只在转换时才存在:当你将一堆字符转换为一堆字节,或者反之,除非提供了字符集,否则无法执行该操作。

举个例子,看看你的代码片段。它是错误的。你写了:

"TestData".getBytes()

这段代码可以编译。这是不幸的事情;这是Java中的API设计错误;你永远不应该使用这些方法(即默默忽略了字符集涉及的方法)。这是从字符(String)到字节的转换。如果你阅读getBytes()方法的Javadoc,它会告诉你将使用“平台默认编码”。这意味着这是一种编写代码的良好公式,在你的计算机上通过所有测试,然后在运行时失败。

虽然有使用平台默认编码的有效原因,但我强烈建议你永远不要使用getBytes()。如果你遇到其中一种罕见情况,请编写"TestData".getBytes(Charset.defaultCharset()),以明确表示此处正在进行使用字符集的转换,而且你打算使用平台默认字符集。

所以,回到你的问题:在Java中没有UTF-16字符串这个概念(如果“字符串”在这里被理解为java.lang.String,而不是俚语英语术语意思是“字节序列”)。

实际上存在“以UTF-16格式编码的Unicode字符表示的一系列字节”。换句话说,在Java中,“UTF-16字符串”看起来像byte[]而不是String

因此,你真正需要的是:

byte[] utf16 = "TestData".getBytes(StandardCharsets.UTF_16);

你写道:

但这不起作用,因为字符串文字被解释为UTF8。

这是代码的属性,而不是字符串的属性。如果你有一些无法更改的代码,它将使用UTF8字符集将字符串转换为字节,并且你不想这样做,那么找到源代码并修复它是唯一的解决方法。

特别是,试图通过某种方式来创建一个字符串,使其具有特殊属性,即如果你将这个字符串转换为字节使用UTF8字符集,然后将这些字节再使用UTF16字符集转换回字符串,你会得到你实际想要的内容 - 这是行不通的。理论上可能(但是绝对不建议)对于具有每个字节序列可表示的属性的字符集,比如ISO_8859_1,这是可能的,但UTF-8不遵循该属性。有一些字节序列在UTF-8中只是错误,会导致异常。反过来,不可能制作一个字符串,以便使用UTF-8解码为字节数组会产生某个特定的期望字节序列。

英文:

In java, a String instance doesn't have an encoding. It just is - it represents the characters as characters, and therefore, there is no encoding.

Encoding just isn't a thing except in transition: When you 'transition' a bunch of characters into a bunch of bytes, or vice versa - that operation cannot be performed unless a charset is provided.

Take, for example, your snippet. It is broken. You write:

"TestData".getBytes().

This compiles. That is unfortunate; this is an API design error in java; you should never use these methods (That'd be: Methods that silently paper over the fact that a charset IS involved). This IS a transition from characters (A String) to bytes. If you read the javadoc on the getBytes() method, it'll tell you that the 'platform default encoding' will be used. This means it's a fine formula for writing code that passes all tests on your machine and will then fail at runtime.

There are valid reasons to want platform default encoding, but I -strongly- encourage you to never use getBytes() regardless. If you run into one of these rare scenarios, write "TestData".getBytes(Charset.defaultCharset()) so that your code makes explicit that a charset-using conversion is occurring here, and that you intended it to be the platform default.

So, going back to your question: There is no such thing as a UTF-16 string. (If 'string' here is the be taken as meaning: java.lang.String, and not a slang english term meaning 'sequence of bytes').

There IS such a thing as a sequence of bytes, representing unicode characters encoded in UTF-16 format. In other words, 'a UTF-16 string', in java, would look like byte[]. Not String.

Thus, all you really need is:

byte[] utf16 = "TestData".GetBytes(StandardCharsets.UTF_16);

You write:

> But that doesn't work as the string literal is interpreted as UTF8.

That's a property of the code then, not of the string. If you have some code you can't change that will turn a string into bytes using the UTF8 charset, and you don't want that to happen, then find the source and fix it. There is no other solution.

In particular, trying to hack things such that you have a string with gobbledygook that has the crazy property that if you take this gobbledygook, turn it into bytes using the UTF8 charset, and then take those bytes and turn that back into a string using the UTF16 charset, that you get what you actually wanted - cannot work. This is theoretically possible (but a truly bad idea) for charsets that have the property that every sequence of bytes is representable, such as ISO_8859_1, but UTF-8 does not adhere to that property. There are sequences of bytes that are just an error in UTF-8 and will cause an exception. On the flipside, it is not possible to craft a string such that decoding it with UTF-8 into a byte array produces a certain desired sequence of bytes.

huangapple
  • 本文由 发表于 2020年8月10日 23:44:41
  • 转载请务必保留本文链接:https://go.coder-hub.com/63343486.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定