Base64 UTF-16在Java、Python和JavaScript应用之间的编码。

huangapple go评论119阅读模式
英文:

Base64 UTF-16 encoding between java, python and javascript applications

问题

作为一个示例,我有以下字符串,我认为它是以 utf-16 编码的:"hühühüh"。

在 Python 中,当我进行编码时,得到以下结果:
```python
base64.b64encode("hühühüh".encode("utf-16"))
b'//5oAPwAaAD8AGgA/ABoAA=='

在 Java 中:

String test = "hühühüh";
byte[] encodedBytes = Base64.getEncoder().encode(test.getBytes(StandardCharsets.UTF_16));
String testBase64Encoded = new String(encodedBytes, StandardCharsets.US_ASCII);
System.out.println(testBase64Encoded);
/v8AaAD8AGgA/ABoAPwAaA==

在 JavaScript 中,我按照 Mozilla 开发指南 定义了一个二进制编码函数,然后对相同的字符串进行编码:

function toBinary(string) {
    const codeUnits = new Uint16Array(string.length);
    for (let i = 0; i < codeUnits.length; i++) {
        codeUnits[i] = string.charCodeAt(i);
    }
    return String.fromCharCode(...new Uint8Array(codeUnits.buffer));
}

atob(toBinary("h&#252;h&#252;h&#252;h"))

aAD8AGgA/ABoAPwAaAA=

正如你所看到的,每个编码器都创建了不同的 base64 字符串。现在让我们再次进行解码。

在 Python 中,所有生成的字符串都能够正常解码:

base64.b64decode("//5oAPwAaAD8AGgA/ABoAA==").decode("utf-16")
'h&#252;h&#252;h&#252;h'
base64.b64decode("/v8AaAD8AGgA/ABoAPwAaA==").decode("utf-16")
'h&#252;h&#252;h&#252;h'
base64.b64decode("aAD8AGgA/ABoAPwAaAA=").decode("utf-16")
'h&#252;h&#252;h&#252;h'

在 JavaScript 中,再次使用 Mozilla 开发指南 中的 fromBinary 函数:

function fromBinary(binary) {
    const bytes = new Uint8Array(binary.length);
    for (let i = 0; i < bytes.length; i++) {
        bytes[i] = binary.charCodeAt(i);
    }
    console.log(...bytes);
    return String.fromCharCode(...new Uint16Array(bytes.buffer));
}

fromBinary(window.atob("//5oAPwAaAD8AGgA/ABoAA=="))
"\ufeffh&#252;h&#252;h&#252;h"
fromBinary(window.atob("/v8AaAD8AGgA/ABoAPwAaA=="))
"\ufffe栀ﰀ栀ﰀ栀ﰀ栀"
fromBinary(window.atob("aAD8AGgA/ABoAPwAaAA="))
"h&#252;h&#252;h&#252;h"

最后在 Java 中:

String base64Encoded = "//5oAPwAaAD8AGgA/ABoAA==";
byte[] asBytes = Base64.getDecoder().decode(base64Encoded);
String base64Decoded = new String(asBytes, StandardCharsets.UTF_16);
System.out.println(base64Decoded);
h&#252;h&#252;h&#252;h
String base64Encoded = "/v8AaAD8AGgA/ABoAPwAaA==";
byte[] asBytes = Base64.getDecoder().decode(base64Encoded);
String base64Decoded = new String(asBytes, StandardCharsets.UTF_16);
System.out.println(base64Decoded);
h&#252;h&#252;h&#252;h
String base64Encoded = "aAD8AGgA/ABoAPwAaAA=";
byte[] asBytes = Base64.getDecoder().decode(base64Encoded);
String base64Decoded = new String(asBytes, StandardCharsets.UTF_16);
System.out.println("Decoded" + base64Decoded);
h&#252;h&#252;h&#252;h

我们可以看到,Python 的 base64 解码器能够对其他两种解析器的消息进行编码和解码。但是 Java 和 JavaScript 解析器之间的定义似乎不兼容。我不明白这是为什么。
这是否是 Java 和 JavaScript 中的 base64 库的问题?如果是的话,是否有其他工具或方法可以在 Java 和 JavaScript 应用程序之间传递 base64 编码的 utf-16 字符串?如何通过尽可能接近核心语言功能的工具,确保在 Java 和 JavaScript 应用程序之间安全地传输 base64 字符串?

编辑:
如接受的答案所述,问题在于不同的 utf16 编码。可以通过在 JavaScript 中以相反顺序生成 utf16 字节,或者将编码后的字符串接受为 StandardCharsets.UTF_16LE,从而解决 Java 和 JavaScript 之间的兼容性问题。


<details>
<summary>英文:</summary>

As a sample I have the following string, that I presume to be under utf-16 encoding: &quot;h&#252;h&#252;h&#252;h&quot;. 

In python I get the following result when encoding

>>> base64.b64encode("hühühüh".encode("utf-16"))
b'//5oAPwAaAD8AGgA/ABoAA=='

In java:

>>> String test = "hühühüh";
>>> byte[] encodedBytes = Base64.getEncoder().encode(test.getBytes(StandardCharsets.UTF_16));
>>> String testBase64Encoded = new String(encodedBytes, StandardCharsets.US_ASCII);
>>> System.out.println(testBase64Encoded);
/v8AaAD8AGgA/ABoAPwAaA==

In javascript I define a binary encoding function as per the [Mozilla dev guideline][1] and then encode the same string.

>> function toBinary(string) {
const codeUnits = new Uint16Array(string.length);
for (let i = 0; i < codeUnits.length; i++) {
codeUnits[i] = string.charCodeAt(i);
}
return String.fromCharCode(...new Uint8Array(codeUnits.buffer));
}
>> atob(toBinary("hühühüh"))

aAD8AGgA/ABoAPwAaAA=


As you can see, each encoder created a distinct base64 string. So lets reverse the encoding again.

In Python all the generated strings decode fine again:

>>> base64.b64decode("//5oAPwAaAD8AGgA/ABoAA==").decode("utf-16")
'hühühüh'
>>> base64.b64decode("/v8AaAD8AGgA/ABoAPwAaA==").decode("utf-16")
'hühühüh'
>>> base64.b64decode("aAD8AGgA/ABoAPwAaAA=").decode("utf-16")
'hühühüh'

In javascript using the fromBinary function again as per the [Mozilla dev guideline][1]:

>>> function fromBinary(binary) {
const bytes = new Uint8Array(binary.length);
for (let i = 0; i < bytes.length; i++) {
bytes[i] = binary.charCodeAt(i);
}
console.log(...bytes)
return String.fromCharCode(...new Uint16Array(bytes.buffer));
}
>>> fromBinary(window.atob("//5oAPwAaAD8AGgA/ABoAA=="))
"\ufeffhühühüh"
>>> fromBinary(window.atob("/v8AaAD8AGgA/ABoAPwAaA=="))
"\ufffe栀ﰀ栀ﰀ栀ﰀ栀"
>>> fromBinary(window.atob("aAD8AGgA/ABoAPwAaAA="))
"hühühüh"

And finally in Java:

>>> String base64Encoded = "//5oAPwAaAD8AGgA/ABoAA==";
>>> byte[] asBytes = Base64.getDecoder().decode(base64Encoded);
>>> String base64Decoded = new String(asBytes, StandardCharsets.UTF_16);
>>> System.out.println(base64Decoded);
hühühüh
>>> String base64Encoded = "/v8AaAD8AGgA/ABoAPwAaA==";
>>> byte[] asBytes = Base64.getDecoder().decode(base64Encoded);
>>> String base64Decoded = new String(asBytes, StandardCharsets.UTF_16);
>>> System.out.println(base64Decoded);
hühühüh
>>> String base64Encoded = "aAD8AGgA/ABoAPwAaAA=";
>>> byte[] asBytes = Base64.getDecoder().decode(base64Encoded);
>>> String base64Decoded = new String(asBytes, StandardCharsets.UTF_16);
>>> System.out.println("Decoded" + base64Decoded);
hühühüh


We can see that python&#39;s base64 decoder is able to encode and decode messages for and from the other two parsers. But the definitions between the Java and Javascript parsers do not seem to be compatible with each other. I do not understand why this is.
Is this a problem with the base64 libraries in Java and Javascript and if so, are there other tools or routes that let us pass base64 encoded utf-16 strings between a Java and Javascript application? How can I ensure safe base64 string transport between Java and Javscript applications by using tools as close to core language functionality as possible?

EDIT:
As said in the accepted answer, the problem is different utf16 encodings. The compatibility problem between Java and Javascript can either be solved by generating the utf16 bytes in Javascript in reverse order, or accepting the encoded string as `StandardCharsets.UTF_16LE`.

  [1]: https://developer.mozilla.org/en-US/docs/Web/API/WindowOrWorkerGlobalScope/btoa

</details>


# 答案1
**得分**: 4

问题在于存在4种变体的`UTF-16`。

这种字符编码每个编码单元使用两个字节。这两个字节中哪个应该先出现?这就产生了两种变体:

- UTF-16BE 将最高有效字节存储在前面。
- UTF-16LE 将最低有效字节存储在前面。

为了区分这两者,有一个可选的“字节顺序标记”(BOM)字符,U+FEFF,在文本开头。因此,带有 BOM 的 UTF-16BE 以字节 `fe ff` 开头,而带有 BOM 的 UTF-16LE 以 `ff fe` 开头。由于 BOM 是可选的,它的存在使可能的编码数量加倍。

看起来您正在使用4种可能的编码中的3种:

- Python 使用带有 BOM 的 UTF-16LE
- Java 使用带有 BOM 的 UTF-16BE
- JavaScript 使用不带 BOM 的 UTF-16LE

人们更喜欢使用 UTF-8 而不是 UTF-16 的一个原因是为了避免这种混淆。

<details>
<summary>英文:</summary>

The problem is that there are 4 variants of `UTF-16`.

This character encoding uses two bytes per code unit. Which of the two bytes should come first? This creates two variants:

- UTF-16BE stores the most significant byte first.
- UTF-16LE stores the least significant byte first.

To allow telling the difference between these two, there is an optional &quot;byte order mark&quot; (BOM) character, U+FEFF, at the start of the text. So UTF-16BE with BOM starts with the bytes `fe ff` while UTF-16LE with BOM starts with `ff fe`. Since BOM is optional, its presence doubles the number of possible encodings.

It looks like you are using 3 of the 4 possible encodings:

- Python used UTF-16LE with BOM
- Java used UTF-16BE with BOM
- JavaScript used UTF-16LE without BOM

One of the reasons why people prefer UTF-8 to UTF-16 is to avoid this confusion.

</details>



huangapple
  • 本文由 发表于 2020年4月6日 23:40:42
  • 转载请务必保留本文链接:https://go.coder-hub.com/61063536.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定