Why does Java StandardCharsets provide three "UTF-16" encoding types but Notepad only provides 2 options (BE & LE)?

huangapple go评论62阅读模式
英文:

Why does Java StandardCharsets provide three "UTF-16" encoding types but Notepad only provides 2 options (BE & LE)?

问题

StandardCharsets 提供了关于 UTF-16 的三个条目:

/**
 * 十六位UCS变换格式,大端字节顺序
 */
public static final Charset UTF_16BE = new sun.nio.cs.UTF_16BE();
/**
 * 十六位UCS变换格式,小端字节顺序
 */
public static final Charset UTF_16LE = new sun.nio.cs.UTF_16LE();
/**
 * 十六位UCS变换格式,字节顺序由可选的字节顺序标记确定
 */
public static final Charset UTF_16 = new sun.nio.cs.UTF_16();

Notepad(和 Notepad++)提供以下选项:

Why does Java StandardCharsets provide three "UTF-16" encoding types but Notepad only provides 2 options (BE & LE)?

为什么在Notepad中缺少UTF-16?(UTF-16和UTF-16 BE是同一件事吗?)

英文:

StandardCharsets provides three entries for UTF-16:

    /**
     * Sixteen-bit UCS Transformation Format, big-endian byte order
     */
    public static final Charset UTF_16BE = new sun.nio.cs.UTF_16BE();
    /**
     * Sixteen-bit UCS Transformation Format, little-endian byte order
     */
    public static final Charset UTF_16LE = new sun.nio.cs.UTF_16LE();
    /**
     * Sixteen-bit UCS Transformation Format, byte order identified by an
     * optional byte-order mark
     */
    public static final Charset UTF_16 = new sun.nio.cs.UTF_16();

Notepad (& Notepad++) Provides following:

Why does Java StandardCharsets provide three "UTF-16" encoding types but Notepad only provides 2 options (BE & LE)?

Why is UTF-16 missing in Notepad? (Is UTF-16 and UTF-16 BE same thing?)

答案1

得分: 1

这些内容记录在这里

  • 在解码时,UTF-16BE 和 UTF-16LE 字符集将初始的字节顺序标记解释为零宽非断空格;在编码时,它们不会写入字节顺序标记。

  • 在解码时,UTF-16 字符集将输入流开头的字节顺序标记解释为流的字节顺序,但如果没有字节顺序标记,则默认为大端序;在编码时,它使用大端序并写入大端序的字节顺序标记。

简而言之,UTF_16BEUTF_16LE 不关心字节顺序标记,因此不对应于你似乎在暗示的 "UTF-16 BE BOM" 或 "UTF-16 LE BOM" 在 Notepad++ 中的选项。

另一方面,UTF_16 在编码时会写入大端序的字节顺序标记,因此对应于在 Notepad++ 中选择 "(转换为) UTF-16 BE BOM" 选项。请注意,在解码时,字节顺序标记是 "可选的"。

至于 NotePad 的选项,它们并未说明它们是否包括字节顺序标记,所以我不确定它们是否包括。如果不包括的话,那就相当于 UTF_16BEUTF_16LE 的编码行为。

至于为什么 Notepad++ 没有类似于 UTF_16BEUTF_16LE 选项,或者为什么 Java 没有 "UTF-16 LE BOM" 选项,这并不是一个有用的问题。正如Eric Lippert 所说

特性不是默认魔法般地实现,然后开发团队不得不出于某种充分理由将其移除。 相反,所有特性默认情况下都没有实现,必须经过思考、设计、规范、实现、测试、批准并交付给客户。 这一切都需要时间和精力。

英文:

These are documented here:

> - When decoding, the UTF-16BE and UTF-16LE charsets interpret the initial byte-order marks as a ZERO-WIDTH NON-BREAKING SPACE; when
> encoding, they do not write byte-order marks.
>
> - When decoding, the UTF-16 charset interprets the byte-order mark at the beginning of the input stream to indicate the byte-order of the
> stream but defaults to big-endian if there is no byte-order mark; when
> encoding, it uses big-endian byte order and writes a big-endian
> byte-order mark.

So in short, UTF_16BE and UTF_16LE do not care about BOM, and so does not correspond to the "UTF-16 BE BOM" or "UTF-16 LE BOM" options in notepad++ as you seem to imply.

On the other hand, UTF_16 does write a BE BOM when encoding, so would correspond to choosing the "(Convert to) UTF-16 BE BOM" option in notepad++. Note that for decoding, the BOM is "optional".

As for the NotePad options, they do not say whether they include a BOM, so I'm not sure if they do. If they do not, then it would be equivalent to UTF_16BE and UTF_16LE's encoding behaviour.

As for why notepad++ does not have the equivalent of the UTF_16BE and UTF_16LE options, or why Java doesn't have a "UTF-16 LE BOM" option, it is not really a useful question to ask. As Eric Lippert said,

> features are not magically implemented by default and then the implementations have to get removed by the development team for a good reason. Rather, all features are unimplemented by default and have to be thought of, designed, specified, implemented, tested, approved and shipped to customers. All that costs time and effort.

huangapple
  • 本文由 发表于 2023年2月6日 15:05:28
  • 转载请务必保留本文链接:https://go.coder-hub.com/75358262.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定