2020年4月6日 19:37:04go评论84阅读模式

英文:

Java - What is the proper way to convert a UTF-8 String to binary?

问题

以下是翻译好的内容：

我正在使用这段代码将一个UTF-8的String转换成二进制：

public String toBinary(String str) {
    byte[] buf = str.getBytes(StandardCharsets.UTF_8);
    StringBuilder result = new StringBuilder();
    for (int i = 0; i < buf.length; i++) {
        int ch = (int) buf[i];
        String binary = Integer.toBinaryString(ch);
        result.append(("00000000" + binary).substring(binary.length()));
        result.append(' ');
    }
    return result.toString().trim();
}

之前我使用过这段代码：

private String toBinary2(String str) {
    StringBuilder result = new StringBuilder();
    for (int i = 0; i < str.length(); i++) {
        int ch = (int) str.charAt(i);
        String binary = Integer.toBinaryString(ch);
        if (ch < 256)
            result.append(("00000000" + binary).substring(binary.length()));
        else {
            binary = ("0000000000000000" + binary).substring(binary.length());
            result.append(binary.substring(0, 8));
            result.append(' ');
            result.append(binary.substring(8));
        }
        result.append(' ');
    }
    return result.toString().trim();
}

这两种方法可能返回不同的结果，例如：

toBinary("è") = "11000011 10101000"
toBinary2("è") = "11101000"

我认为这是因为è的字节是负数，而相应的char值不是负数（因为char是一个2字节无符号整数）。我想知道的是：这两种方法中哪个是正确的，以及为什么？提前感谢您的帮助。

英文:

I'm using this code to convert a UTF-8 String to binary:

public String toBinary(String str) {
    byte[] buf = str.getBytes(StandardCharsets.UTF_8);
    StringBuilder result = new StringBuilder();
    for (int i = 0; i &lt; buf.length; i++) {
        int ch = (int) buf[i];
        String binary = Integer.toBinaryString(ch);
        result.append((&quot;00000000&quot; + binary).substring(binary.length()));
        result.append(&#39; &#39;);
    }
    return result.toString().trim();
}

Before I was using this code:

private String toBinary2(String str) {
    StringBuilder result = new StringBuilder();
    for (int i = 0; i &lt; str.length(); i++) {
        int ch = (int) str.charAt(i);
        String binary = Integer.toBinaryString(ch);
        if (ch&lt;256)
           result.append((&quot;00000000&quot; + binary).substring(binary.length()));
        else {
           binary = (&quot;0000000000000000&quot; + binary).substring(binary.length());
           result.append(binary.substring(0, 8));
           result.append(&#39; &#39;);
           result.append(binary.substring(8));
        }
        result.append(&#39; &#39;);
    }
    return result.toString().trim();
}

These two method can return different results; for example:

toBinary(&quot;&#232;&quot;) = &quot;11000011 10101000&quot;
toBinary2(&quot;&#232;&quot;) = &quot;11101000&quot;

I think that because the bytes of è are negative while the corresponding char is not (because char is a 2 byte unsigned integer). 
What I want to know is: which of the two approaches is the correct one and why? 
Thanks in advance.

答案1

得分: 1

每当你想将文本转换为二进制数据（或者表示二进制数据的文本，就像你在这里所做的那样），你都需要使用某种编码。

你的 toBinary 使用的是UTF-8 编码。

你的 toBinary2 使用的是一种非标准编码：它将每个UTF-16代码点 * <= 256 编码为一个字节，而其他所有代码点编码为2个字节。不幸的是，这种编码方式不太有用，因为在解码时，你需要知道一个字节是独立的还是2个字节序列的一部分（UTF-8/UTF-16 通过最高级别的位来指示它是哪种编码方式）。

简而言之 toBinary 看起来是正确的，toBinary2 会生成无法唯一解码回原始字符串的输出。

* 你可能会想知道UTF-16 的提及是从哪里来的：这是因为Java中的所有 String 对象都隐式编码为UTF-16。因此，如果你使用 charAt，你将获得UTF-16 代码点（恰好等于适合于基本多文种平面的所有字符的Unicode 代码编号）。

英文:

Whenever you want to convert text into binary data (or into text representing binary data, as you do here) you have to use some encoding.

Your toBinary uses UTF-8 for that encoding.

Your toBinary2 uses something that's not a standard encoding: it encodes every UTF-16 codepoint * <= 256 in a single byte and all others in 2 bytes. Unfortunately that one is not a useful encoding, since for decoding you'll have to know if a single byte is stand-alone or part of a 2-byte sequence (UTF-8/UTF-16 do that by indicating with the highest-level bits which one it is).

tl;dr toBinary seems correct, toBinary2 will produce output that can't uniquely be decoded back to the original string.

* You might be wondering where the mention of UTF-16 comes from: That's because all String objects in Java are implicitly encoded in UTF-16. So if you use charAt you get UTF-16 codepoints (which just so happen to be equal to the Unicode code number for all characters that fit into the Basic Multilingual Plane).

答案2

得分: 0

这段代码片段可能会有所帮助。

String s = "Some String";
byte[] bytes = s.getBytes();
StringBuilder binary = new StringBuilder();
for(byte b:bytes){
    int val =b;
    for(int i=0;i<=s.length;i++){
        binary.append((val & 128) == 0 ? 0 : 1);
        val<<=1;
    }
}
System.out.println(" "+s+ "转换为二进制：" +binary);

英文:

This code snippet might help.

String s = &quot;Some String&quot;;
byte[] bytes = s.getBytes();
StringBuilder binary = new StringBuilder();
for(byte b:bytes){
    int val =b;
    for(int i=;i&lt;=s.length;i++){
        binary.append((val &amp; 128) == 0 ? 0 : 1);
        val&lt;&lt;=1;
    }
}
System.out.println(&quot; &quot;+s+ &quot;to binary&quot; +binary);

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

Java – 如何正确地将UTF-8字符串转换为二进制？

问题

答案1

答案2

如何在Nim游戏中反转输出？

只需要在Java中获取主机名，而不是完全限定域名（FQDN）。

如何在Java中创建条形码

如何编写包含空格的 JSON 字段的 XPath 表达式？

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论