英文:
Java - What is the proper way to convert a UTF-8 String to binary?
问题
以下是翻译好的内容:
我正在使用这段代码将一个UTF-8的String
转换成二进制:
public String toBinary(String str) {
byte[] buf = str.getBytes(StandardCharsets.UTF_8);
StringBuilder result = new StringBuilder();
for (int i = 0; i < buf.length; i++) {
int ch = (int) buf[i];
String binary = Integer.toBinaryString(ch);
result.append(("00000000" + binary).substring(binary.length()));
result.append(' ');
}
return result.toString().trim();
}
之前我使用过这段代码:
private String toBinary2(String str) {
StringBuilder result = new StringBuilder();
for (int i = 0; i < str.length(); i++) {
int ch = (int) str.charAt(i);
String binary = Integer.toBinaryString(ch);
if (ch < 256)
result.append(("00000000" + binary).substring(binary.length()));
else {
binary = ("0000000000000000" + binary).substring(binary.length());
result.append(binary.substring(0, 8));
result.append(' ');
result.append(binary.substring(8));
}
result.append(' ');
}
return result.toString().trim();
}
这两种方法可能返回不同的结果,例如:
toBinary("è") = "11000011 10101000"
toBinary2("è") = "11101000"
我认为这是因为è
的字节是负数,而相应的char
值不是负数(因为char是一个2字节无符号整数)。我想知道的是:这两种方法中哪个是正确的,以及为什么?提前感谢您的帮助。
英文:
I'm using this code to convert a UTF-8 String
to binary:
public String toBinary(String str) {
byte[] buf = str.getBytes(StandardCharsets.UTF_8);
StringBuilder result = new StringBuilder();
for (int i = 0; i < buf.length; i++) {
int ch = (int) buf[i];
String binary = Integer.toBinaryString(ch);
result.append(("00000000" + binary).substring(binary.length()));
result.append(' ');
}
return result.toString().trim();
}
Before I was using this code:
private String toBinary2(String str) {
StringBuilder result = new StringBuilder();
for (int i = 0; i < str.length(); i++) {
int ch = (int) str.charAt(i);
String binary = Integer.toBinaryString(ch);
if (ch<256)
result.append(("00000000" + binary).substring(binary.length()));
else {
binary = ("0000000000000000" + binary).substring(binary.length());
result.append(binary.substring(0, 8));
result.append(' ');
result.append(binary.substring(8));
}
result.append(' ');
}
return result.toString().trim();
}
These two method can return different results; for example:
toBinary("è") = "11000011 10101000"
toBinary2("è") = "11101000"
I think that because the bytes of è
are negative while the corresponding char
is not (because char is a 2 byte unsigned integer).<br />
What I want to know is: which of the two approaches is the correct one and why? <br />
Thanks in advance.
答案1
得分: 1
每当你想将文本转换为二进制数据(或者表示二进制数据的文本,就像你在这里所做的那样),你都需要使用某种编码。
你的 toBinary
使用的是UTF-8 编码。
你的 toBinary2
使用的是一种非标准编码:它将每个UTF-16代码点 <sup>*</sup> <= 256 编码为一个字节,而其他所有代码点编码为2个字节。不幸的是,这种编码方式不太有用,因为在解码时,你需要知道一个字节是独立的还是2个字节序列的一部分(UTF-8/UTF-16 通过最高级别的位来指示它是哪种编码方式)。
简而言之 toBinary
看起来是正确的,toBinary2
会生成无法唯一解码回原始字符串的输出。
<sup>* 你可能会想知道UTF-16 的提及是从哪里来的:这是因为Java中的所有 String
对象都隐式编码为UTF-16。因此,如果你使用 charAt
,你将获得UTF-16 代码点(恰好等于适合于基本多文种平面的所有字符的Unicode 代码编号)。</sup>
英文:
Whenever you want to convert text into binary data (or into text representing binary data, as you do here) you have to use some encoding.
Your toBinary
uses UTF-8 for that encoding.
Your toBinary2
uses something that's not a standard encoding: it encodes every UTF-16 codepoint <sup>*</sup> <= 256 in a single byte and all others in 2 bytes. Unfortunately that one is not a useful encoding, since for decoding you'll have to know if a single byte is stand-alone or part of a 2-byte sequence (UTF-8/UTF-16 do that by indicating with the highest-level bits which one it is).
tl;dr toBinary
seems correct, toBinary2
will produce output that can't uniquely be decoded back to the original string.
<sup>* You might be wondering where the mention of UTF-16 comes from: That's because all String
objects in Java are implicitly encoded in UTF-16. So if you use charAt
you get UTF-16 codepoints (which just so happen to be equal to the Unicode code number for all characters that fit into the Basic Multilingual Plane).</sup>
答案2
得分: 0
这段代码片段可能会有所帮助。
String s = "Some String";
byte[] bytes = s.getBytes();
StringBuilder binary = new StringBuilder();
for(byte b:bytes){
int val =b;
for(int i=0;i<=s.length;i++){
binary.append((val & 128) == 0 ? 0 : 1);
val<<=1;
}
}
System.out.println(" "+s+ "转换为二进制:" +binary);
英文:
This code snippet might help.
String s = "Some String";
byte[] bytes = s.getBytes();
StringBuilder binary = new StringBuilder();
for(byte b:bytes){
int val =b;
for(int i=;i<=s.length;i++){
binary.append((val & 128) == 0 ? 0 : 1);
val<<=1;
}
}
System.out.println(" "+s+ "to binary" +binary);
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论