为什么会说:CharacterStream 类被用于执行16位Unicode字符的输入/输出?

huangapple go评论74阅读模式
英文:

Why is it said: CharacterStream classes are used to perform the input/output for the 16-bit Unicode characters?

问题

> 当I/O流管理8位比特的原始二进制数据时,它被称为字节流。而当I/O流管理16位Unicode字符时,它被称为字符流。

字节流 很清楚。它使用 8位字节。因此,如果我要写一个使用 3个字节字符,它只会写入最后的 8位!从而产生不正确的输出。

这就是为什么我们使用 字符流。假设我想写入拉丁大写字母 。我需要 3个字节 以在UTF-8中存储它。但是假设我还想存储 'normal' 的 A。现在只需要 1个字节 来存储。

你有看到规律吗?我们无法知道写入这些 字符 需要多少字节,直到我们将它们转换。所以,我的问题是为什么说 字符流管理16位Unicode字符?因为在我写入需要 3个字节 的情况下,它没有像 字节流 一样截取 最后的16位。这句话到底是什么意思呢?

英文:

> When an I/O stream manages 8-bit bytes of raw binary data, it is
> called a byte stream. And, when the I/O stream manages 16-bit Unicode
> characters, it is called a character stream.

Byte stream is clear. It uses 8-bit bytes. So if I were to write a character that uses 3 bytes it would only write its last 8 bits! Thus making incorrect output.

So that is why we use character streams. Say I want to write Latin Capital Letter . I would need 3 bytes for storing in UTF-8. But say I also want to store 'normal' A. Now it would take 1 byte to store.

Are you seeing pattern? We can't know how much bytes it will take for writing any of these characters until we convert them. So my question is why is it said that character streams manage 16-bit Unicode characters? When in case where I wrote that takes 3 bytes it didn't cut it to last 16-bits like byte streams cut last 8-bits. What does that quote even mean then?

答案1

得分: 3

在Java中,String 由一系列16位的char组成,表示以UTF-16编码存储的文本。

Charset 是一个描述如何将Unicode字符转换为字节序列的对象。UTF-8是Charset的一个示例。

Writer这样的字符流,在输出到包含字节的对象(例如文件或类似于OutputStream的字节输出流)时,会使用CharsetString转换为简单的字节序列进行输出。(从技术上讲,它将UTF-16字符转换为Unicode字符,然后使用Charset将其转换为字节序列。)Reader在从字节源读取时执行相反的转换。

在UTF-16中,Ạ的表示是16位的char,即0x1EA1。它在UTF-16中只占用16位,而在UTF-8中占用24位。

如果你使用UTF-8编码将其转换为字节,就像这样:

ByteArrayOutputStream baos = new ByteArrayOutputStream();
Writer writer = new OutputStreamWriter(baos, StandardCharsets.UTF_8);
writer.write("Ạ");
writer.close();
return baos.toByteArray();

那么你将得到预期的3字节序列0xE1 0xBA 0xA1

英文:

In Java, a String is composed of a sequence of 16-bit chars, representing text stored in the UTF-16 encoding.

A Charset is an object that describes how to convert Unicode characters to a sequence of bytes. UTF-8 is an example of a charset.

A character stream like Writer, when it outputs to a thing that contains bytes -- a file, or a byte output stream like OutputStream -- uses a Charset to convert Strings to simple byte sequences for output. (Technically, it converts the UTF-16 chars to Unicode characters and then converts those to byte sequences with the Charset.) A Reader, when reading from a byte source, does the reverse conversion.

In UTF-16, Ạ is represented as the 16-bit char 0x1EA1. It takes only 16 bits in UTF-16, not 24 bits as in UTF-8.

If you converted it to bytes with the UTF-8 encoding, as here:

ByteArrayOutputStream baos = new ByteArrayOutputStream();
Writer writer = new OutputStreamWriter(baos, StandardCharsets.UTF_8);
writer.write("Ạ");
writer.close();
return baos.toByteArray();

Then you would get the 3 byte sequence 0xE1 0xBA 0xA1 as expected.

答案2

得分: 2

在Java中,字符(char)始终是16位,正如从其最大值65535可以看出。这就是为什么引用没有错误。16位的确是一个字符。

“所有Unicode字符如何只存储在16位中?” 你可能会问。在Java中,这是通过使用UTF-16编码来实现的。以下是它如何工作的(用非常简化的术语):

基本多文种平面中的每个Unicode码点都使用16位进行编码。 (是的,16位对于这是足够的)平面之外的每个码点都使用一对16位字符进行编码,称为代理对。

“Ạ”(U+1EA0)在BMP内部,因此可以使用16位进行编码。

你说:

> 假设我想要写拉丁大写字母Ạ。我需要3个字节以在UTF-8中存储。但假设我还想存储'normal' A。现在只需要1个字节!

这并不使引用不正确。流仍然“管理16位字符”,因为这是您将通过Java代码提供给它的内容。当您在PrintStream上调用println时,您正在向它提供一个String,在底层是一堆char,即一堆16位。因此,它实际上是在管理一串16位字符的流。只是它以不同的编码输出它们。

可能值得提到的是,当您尝试打印不在BMP中的字符时会发生什么。这仍然不会使引用不正确。引用没有说“码点”。它说的是“字符”,这将指的是您正在打印的代理对的上/下代理。

英文:

In Java, a character (char) is always 16 bits, as can be seen from its max value - 65535. This is why the quote is not wrong. 16 bit is indeed a character.

"How can all the Unicode characters be stored in just 16 bits?" you might ask. This is done in Java using the UTF-16 encoding. Here's how it works (in very simplified terms):

Every Unicode code point in the Basic Multilingual Plane is encoded in 16 bits. (Yes 16 bit is enough for that) Every code point outside of the BMP is encoded with a pair of 16 bit characters, called surrogate pairs.

"Ạ" (U+1EA0) is inside the BMP, so can be encoded with 16 bits.

You said:

> Say I want to write Latin Capital Letter Ạ. I would need 3 bytes for storing in UTF-8. But say I also want to store 'normal' A. Now it would take 1 byte to store!

That does not make the quote incorrect. The stream still "manages 16-bit characters", because that's what you will give it with Java code. When you call println on a PrintStream, you are giving it a String, which is a bunch of chars under the hood, which is a bunch of 16-bits. So it is really managing a stream of 16-bit characters. It's just that it outputs them in a different encoding.

It's probably worth mentioning what happens when you try to print a character that is not in the BMP. This would still not make the quote incorrect. The quote does not say "code point". It says "character" which would refer to the upper/lower surrogates of the surrogate pair that you are printing.

huangapple
  • 本文由 发表于 2020年9月7日 09:29:05
  • 转载请务必保留本文链接:https://go.coder-hub.com/63770350.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定