Why does Java StandardCharsets provide three "UTF-16" encoding types but Notepad only provides 2 options (BE & LE)?

huangapple go评论53阅读模式

Why does Java StandardCharsets provide three "UTF-16" encoding types but Notepad only provides 2 options (BE & LE)?


StandardCharsets 提供了关于 UTF-16 的三个条目:

 * 十六位UCS变换格式,大端字节顺序
public static final Charset UTF_16BE = new sun.nio.cs.UTF_16BE();
 * 十六位UCS变换格式,小端字节顺序
public static final Charset UTF_16LE = new sun.nio.cs.UTF_16LE();
 * 十六位UCS变换格式,字节顺序由可选的字节顺序标记确定
public static final Charset UTF_16 = new sun.nio.cs.UTF_16();

Notepad(和 Notepad++)提供以下选项:

Why does Java StandardCharsets provide three "UTF-16" encoding types but Notepad only provides 2 options (BE & LE)?

为什么在Notepad中缺少UTF-16?(UTF-16和UTF-16 BE是同一件事吗?)


StandardCharsets provides three entries for UTF-16:

     * Sixteen-bit UCS Transformation Format, big-endian byte order
    public static final Charset UTF_16BE = new sun.nio.cs.UTF_16BE();
     * Sixteen-bit UCS Transformation Format, little-endian byte order
    public static final Charset UTF_16LE = new sun.nio.cs.UTF_16LE();
     * Sixteen-bit UCS Transformation Format, byte order identified by an
     * optional byte-order mark
    public static final Charset UTF_16 = new sun.nio.cs.UTF_16();

Notepad (& Notepad++) Provides following:

Why does Java StandardCharsets provide three "UTF-16" encoding types but Notepad only provides 2 options (BE & LE)?

Why is UTF-16 missing in Notepad? (Is UTF-16 and UTF-16 BE same thing?)


得分: 1


  • 在解码时,UTF-16BE 和 UTF-16LE 字符集将初始的字节顺序标记解释为零宽非断空格;在编码时,它们不会写入字节顺序标记。

  • 在解码时,UTF-16 字符集将输入流开头的字节顺序标记解释为流的字节顺序,但如果没有字节顺序标记,则默认为大端序;在编码时,它使用大端序并写入大端序的字节顺序标记。

简而言之,UTF_16BEUTF_16LE 不关心字节顺序标记,因此不对应于你似乎在暗示的 "UTF-16 BE BOM" 或 "UTF-16 LE BOM" 在 Notepad++ 中的选项。

另一方面,UTF_16 在编码时会写入大端序的字节顺序标记,因此对应于在 Notepad++ 中选择 "(转换为) UTF-16 BE BOM" 选项。请注意,在解码时,字节顺序标记是 "可选的"。

至于 NotePad 的选项,它们并未说明它们是否包括字节顺序标记,所以我不确定它们是否包括。如果不包括的话,那就相当于 UTF_16BEUTF_16LE 的编码行为。

至于为什么 Notepad++ 没有类似于 UTF_16BEUTF_16LE 选项,或者为什么 Java 没有 "UTF-16 LE BOM" 选项,这并不是一个有用的问题。正如Eric Lippert 所说

特性不是默认魔法般地实现,然后开发团队不得不出于某种充分理由将其移除。 相反,所有特性默认情况下都没有实现,必须经过思考、设计、规范、实现、测试、批准并交付给客户。 这一切都需要时间和精力。


These are documented here:

> - When decoding, the UTF-16BE and UTF-16LE charsets interpret the initial byte-order marks as a ZERO-WIDTH NON-BREAKING SPACE; when
> encoding, they do not write byte-order marks.
> - When decoding, the UTF-16 charset interprets the byte-order mark at the beginning of the input stream to indicate the byte-order of the
> stream but defaults to big-endian if there is no byte-order mark; when
> encoding, it uses big-endian byte order and writes a big-endian
> byte-order mark.

So in short, UTF_16BE and UTF_16LE do not care about BOM, and so does not correspond to the "UTF-16 BE BOM" or "UTF-16 LE BOM" options in notepad++ as you seem to imply.

On the other hand, UTF_16 does write a BE BOM when encoding, so would correspond to choosing the "(Convert to) UTF-16 BE BOM" option in notepad++. Note that for decoding, the BOM is "optional".

As for the NotePad options, they do not say whether they include a BOM, so I'm not sure if they do. If they do not, then it would be equivalent to UTF_16BE and UTF_16LE's encoding behaviour.

As for why notepad++ does not have the equivalent of the UTF_16BE and UTF_16LE options, or why Java doesn't have a "UTF-16 LE BOM" option, it is not really a useful question to ask. As Eric Lippert said,

> features are not magically implemented by default and then the implementations have to get removed by the development team for a good reason. Rather, all features are unimplemented by default and have to be thought of, designed, specified, implemented, tested, approved and shipped to customers. All that costs time and effort.

  • 本文由 发表于 2023年2月6日 15:05:28
  • 转载请务必保留本文链接:https://go.coder-hub.com/75358262.html



:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:
