使用 Java 进行字符串编码,带有表情符号?

huangapple go评论71阅读模式
英文:

String Encoding with Emoji in Java?

问题

我有一个类似这样的小测试示例:

public class Main {
    public static void main(String[] args) {
        String s = "🇻🇺";
        System.out.println(s);
        System.out.println(s.length());
        System.out.println(s.toCharArray().length);
        System.out.println(s.getBytes(StandardCharsets.UTF_8).length);
        System.out.println(s.getBytes(StandardCharsets.UTF_16).length);
        System.out.println(s.codePointCount(0, s.length()));
        System.out.println(Character.codePointCount(s, 0, s.length()));
    }
}

结果是:

🇻🇺
4
4
8
10
2
2

我无法理解,为什么一个Unicode字符“🇻🇺”的长度为4,utf-8编码下为8字节,utf-16编码下为10字节。我知道Java使用UTF-16,在一个代码点需要1个字符(2字节)的情况下,让我对于一个Unicode字符的长度为4感到困惑,我认为它只需要2个字符,但结果是4。有人可以完全解释帮助我理解这个吗?非常感谢。

英文:

I have small test example like this

    public class Main {
        public static void main(String[] args) {
            String s = "🇻🇺";
            System.out.println(s);
            System.out.println(s.length());
            System.out.println(s.toCharArray().length);
            System.out.println(s.getBytes(StandardCharsets.UTF_8).length);
            System.out.println(s.getBytes(StandardCharsets.UTF_16).length);
            System.out.println(s.codePointCount(0, s.length()));
            System.out.println(Character.codePointCount(s, 0, s.length()));
       }
    }

And result is:

🇻🇺
4
4
8
10
2
2

I can not understand, why 1 unicode character Vanuatu flag return 4 of length, 8 bytes in utf-8 and 10 bytes in utf-16, I know java using
UTF-16 and it need 1 char(2 byte) for 1 code point but it make me confusing about 4 char for 1 unicode character, i think it just need 2 char but result 4. Someone can fully explain to help me understand about this. Many thanks.

答案1

得分: 5

Unicode国旗表情以两个码点进行编码。

共有26个区域指示符号,代表A-Z,国旗的编码是通过拼写ISO国家代码来实现的。例如,瓦努阿图国旗的编码为“VU”,美国国旗的编码为“US”。

这些指示符都位于辅助平面,因此它们各自需要两个UTF-16字符来表示。这将使每个国旗占用4个Java char

其目的是避免在一个国家获得或失去独立时不得不更新标准,同时帮助Unicode联盟保持中立,因为它不必成为地缘政治主张的仲裁者。

英文:

Unicode flag emojis are encoded as two code points.

There are 26 Regional Indicator Symbols representing A-Z, and a flag is encoded by spelling out the ISO country code. For example, the Vanuatu flag is encoded as "VU", and the American flag is "US".

The indicators are all in the supplemental plane, so they each require two UTF-16 characters. This brings the total up to 4 Java char per flag.

The purpose of this is to avoid having to update the standard whenever a country gains or loses independence, and it helps the Unicode consortium stay neutral since it doesn't have to be an arbiter of geopolitical claims.

答案2

得分: 1

UTF-8是一种可变长度编码,每个Unicode字符使用1到4个字节。第一个字节携带字符的3到7位,每个后续字节携带6位。因此,有效负载为7到21位。

所需字节数取决于特定字符。

有关编码,请参阅此维基百科页面

UTF-16对于Unicode字符可以使用一个16位单元或两个16位单元。粗略地说,前64K个字符中的字符被编码为一个单元;超出该范围的字符需要两个单元。

“粗略地说”,因为实际上适合一个16位单元中的代码要么在U+0000到U+D7FF范围内,要么在U+E000到U+FFFF范围内。这两者之间的值用于双单元格式。

所需的16位单元数量取决于特定字符。

有关详细信息,请参阅另一个维基百科页面

英文:

UTF-8 is a variable-length encoding that uses 1 to 4 bytes per Unicode character. The first byte carries from 3 to 7 bits of the character, and each subsequent byte carries 6 bits. Thus there's from 7 to 21 bits of payload.

The number of bytes needed depends on the particular character.

See this Wikipedia page for the encoding.

UTF-16 uses either one 16-bit unit or two 16-bit units for a Unicode character. Approximately speaking, characters in the first 64K characters are encoded as one unit; characters outside that range need two units.

"Approximately" because, actually, the codes that fit in one 16-bit unit are either in U+0000 to U+D7FF, or U+E000 to U+FFFF. The values in between those two are used for the two-unit format.

The number of 16-bit units needed depends on the particular character.

See this other Wikipedia page.

huangapple
  • 本文由 发表于 2020年10月6日 01:23:47
  • 转载请务必保留本文链接:https://go.coder-hub.com/64213394.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定