Java – 如何正确地将UTF-8字符串转换为二进制?

huangapple go评论84阅读模式
英文:

Java - What is the proper way to convert a UTF-8 String to binary?

问题

以下是翻译好的内容:

我正在使用这段代码将一个UTF-8的String转换成二进制:

public String toBinary(String str) {
    byte[] buf = str.getBytes(StandardCharsets.UTF_8);
    StringBuilder result = new StringBuilder();
    for (int i = 0; i < buf.length; i++) {
        int ch = (int) buf[i];
        String binary = Integer.toBinaryString(ch);
        result.append(("00000000" + binary).substring(binary.length()));
        result.append(' ');
    }
    return result.toString().trim();
}

之前我使用过这段代码:

private String toBinary2(String str) {
    StringBuilder result = new StringBuilder();
    for (int i = 0; i < str.length(); i++) {
        int ch = (int) str.charAt(i);
        String binary = Integer.toBinaryString(ch);
        if (ch < 256)
            result.append(("00000000" + binary).substring(binary.length()));
        else {
            binary = ("0000000000000000" + binary).substring(binary.length());
            result.append(binary.substring(0, 8));
            result.append(' ');
            result.append(binary.substring(8));
        }
        result.append(' ');
    }
    return result.toString().trim();
}

这两种方法可能返回不同的结果,例如:

toBinary("è") = "11000011 10101000"
toBinary2("è") = "11101000"

我认为这是因为&#232;的字节是负数,而相应的char值不是负数(因为char是一个2字节无符号整数)。我想知道的是:这两种方法中哪个是正确的,以及为什么?提前感谢您的帮助。

英文:

I'm using this code to convert a UTF-8 String to binary:

public String toBinary(String str) {
    byte[] buf = str.getBytes(StandardCharsets.UTF_8);
    StringBuilder result = new StringBuilder();
    for (int i = 0; i &lt; buf.length; i++) {
        int ch = (int) buf[i];
        String binary = Integer.toBinaryString(ch);
        result.append((&quot;00000000&quot; + binary).substring(binary.length()));
        result.append(&#39; &#39;);
    }
    return result.toString().trim();
}

Before I was using this code:

private String toBinary2(String str) {
    StringBuilder result = new StringBuilder();
    for (int i = 0; i &lt; str.length(); i++) {
        int ch = (int) str.charAt(i);
        String binary = Integer.toBinaryString(ch);
        if (ch&lt;256)
           result.append((&quot;00000000&quot; + binary).substring(binary.length()));
        else {
           binary = (&quot;0000000000000000&quot; + binary).substring(binary.length());
           result.append(binary.substring(0, 8));
           result.append(&#39; &#39;);
           result.append(binary.substring(8));
        }
        result.append(&#39; &#39;);
    }
    return result.toString().trim();
}

These two method can return different results; for example:

toBinary(&quot;&#232;&quot;) = &quot;11000011 10101000&quot;
toBinary2(&quot;&#232;&quot;) = &quot;11101000&quot;

I think that because the bytes of &#232; are negative while the corresponding char is not (because char is a 2 byte unsigned integer).<br />
What I want to know is: which of the two approaches is the correct one and why? <br />
Thanks in advance.

答案1

得分: 1

每当你想将文本转换为二进制数据(或者表示二进制数据的文本,就像你在这里所做的那样),你都需要使用某种编码

你的 toBinary 使用的是UTF-8 编码。

你的 toBinary2 使用的是一种非标准编码:它将每个UTF-16代码点 <sup>*</sup> <= 256 编码为一个字节,而其他所有代码点编码为2个字节。不幸的是,这种编码方式不太有用,因为在解码时,你需要知道一个字节是独立的还是2个字节序列的一部分(UTF-8/UTF-16 通过最高级别的位来指示它是哪种编码方式)。

简而言之 toBinary 看起来是正确的,toBinary2 会生成无法唯一解码回原始字符串的输出。

<sup>* 你可能会想知道UTF-16 的提及是从哪里来的:这是因为Java中的所有 String 对象都隐式编码为UTF-16。因此,如果你使用 charAt,你将获得UTF-16 代码点(恰好等于适合于基本多文种平面的所有字符的Unicode 代码编号)。</sup>

英文:

Whenever you want to convert text into binary data (or into text representing binary data, as you do here) you have to use some encoding.

Your toBinary uses UTF-8 for that encoding.

Your toBinary2 uses something that's not a standard encoding: it encodes every UTF-16 codepoint <sup>*</sup> <= 256 in a single byte and all others in 2 bytes. Unfortunately that one is not a useful encoding, since for decoding you'll have to know if a single byte is stand-alone or part of a 2-byte sequence (UTF-8/UTF-16 do that by indicating with the highest-level bits which one it is).

tl;dr toBinary seems correct, toBinary2 will produce output that can't uniquely be decoded back to the original string.

<sup>* You might be wondering where the mention of UTF-16 comes from: That's because all String objects in Java are implicitly encoded in UTF-16. So if you use charAt you get UTF-16 codepoints (which just so happen to be equal to the Unicode code number for all characters that fit into the Basic Multilingual Plane).</sup>

答案2

得分: 0

这段代码片段可能会有所帮助。

String s = "Some String";
byte[] bytes = s.getBytes();
StringBuilder binary = new StringBuilder();
for(byte b:bytes){
    int val =b;
    for(int i=0;i<=s.length;i++){
        binary.append((val & 128) == 0 ? 0 : 1);
        val<<=1;
    }
}
System.out.println(" "+s+ "转换为二进制:" +binary);
英文:

This code snippet might help.

String s = &quot;Some String&quot;;
byte[] bytes = s.getBytes();
StringBuilder binary = new StringBuilder();
for(byte b:bytes){
    int val =b;
    for(int i=;i&lt;=s.length;i++){
        binary.append((val &amp; 128) == 0 ? 0 : 1);
        val&lt;&lt;=1;
    }
}
System.out.println(&quot; &quot;+s+ &quot;to binary&quot; +binary);

huangapple
  • 本文由 发表于 2020年4月6日 19:37:04
  • 转载请务必保留本文链接:https://go.coder-hub.com/61058880.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定