
huangapple go评论55阅读模式

How to create UTF16 Strings




String s = new String("TestData".getBytes(), StandardCharsets.UTF_16);



Is there a way to create an UTF16 string from scratch or from an actual UTF8 string that doesn't involve some weird "hack" like looping through each char and appending a 00 byte to make it an UTF16 char?

Ideally I would like to be able to do something like this:

String s = new String("TestData".getBytes(), StandardCharsets.UTF_16);

But that doesn't work as the string literal is interpreted as UTF8.


得分: 4

在Java中,String实例没有编码。它就是 - 它表示字符为字符,因此没有编码。









byte[] utf16 = "TestData".getBytes(StandardCharsets.UTF_16);




特别是,试图通过某种方式来创建一个字符串,使其具有特殊属性,即如果你将这个字符串转换为字节使用UTF8字符集,然后将这些字节再使用UTF16字符集转换回字符串,你会得到你实际想要的内容 - 这是行不通的。理论上可能(但是绝对不建议)对于具有每个字节序列可表示的属性的字符集,比如ISO_8859_1,这是可能的,但UTF-8不遵循该属性。有一些字节序列在UTF-8中只是错误,会导致异常。反过来,不可能制作一个字符串,以便使用UTF-8解码为字节数组会产生某个特定的期望字节序列。


In java, a String instance doesn't have an encoding. It just is - it represents the characters as characters, and therefore, there is no encoding.

Encoding just isn't a thing except in transition: When you 'transition' a bunch of characters into a bunch of bytes, or vice versa - that operation cannot be performed unless a charset is provided.

Take, for example, your snippet. It is broken. You write:


This compiles. That is unfortunate; this is an API design error in java; you should never use these methods (That'd be: Methods that silently paper over the fact that a charset IS involved). This IS a transition from characters (A String) to bytes. If you read the javadoc on the getBytes() method, it'll tell you that the 'platform default encoding' will be used. This means it's a fine formula for writing code that passes all tests on your machine and will then fail at runtime.

There are valid reasons to want platform default encoding, but I -strongly- encourage you to never use getBytes() regardless. If you run into one of these rare scenarios, write "TestData".getBytes(Charset.defaultCharset()) so that your code makes explicit that a charset-using conversion is occurring here, and that you intended it to be the platform default.

So, going back to your question: There is no such thing as a UTF-16 string. (If 'string' here is the be taken as meaning: java.lang.String, and not a slang english term meaning 'sequence of bytes').

There IS such a thing as a sequence of bytes, representing unicode characters encoded in UTF-16 format. In other words, 'a UTF-16 string', in java, would look like byte[]. Not String.

Thus, all you really need is:

byte[] utf16 = "TestData".GetBytes(StandardCharsets.UTF_16);

You write:

> But that doesn't work as the string literal is interpreted as UTF8.

That's a property of the code then, not of the string. If you have some code you can't change that will turn a string into bytes using the UTF8 charset, and you don't want that to happen, then find the source and fix it. There is no other solution.

In particular, trying to hack things such that you have a string with gobbledygook that has the crazy property that if you take this gobbledygook, turn it into bytes using the UTF8 charset, and then take those bytes and turn that back into a string using the UTF16 charset, that you get what you actually wanted - cannot work. This is theoretically possible (but a truly bad idea) for charsets that have the property that every sequence of bytes is representable, such as ISO_8859_1, but UTF-8 does not adhere to that property. There are sequences of bytes that are just an error in UTF-8 and will cause an exception. On the flipside, it is not possible to craft a string such that decoding it with UTF-8 into a byte array produces a certain desired sequence of bytes.

  • 本文由 发表于 2020年8月10日 23:44:41
  • 转载请务必保留本文链接:https://go.coder-hub.com/63343486.html



:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:
