英文:
String comparision in UTF8
问题
我有一个PHP脚本,应该返回一个UTF-8编码的字符串。然而,在Java中,我似乎无法以任何方式与其内部字符串进行比较。
如果我打印 "OK"
和 response,在控制台中它们看起来是相同的。然而,如果我进行相等性检查
if ( "OK".equals(response) ) {
结果是false。我将两者都以二进制形式打印出来,response 是 11101111 10111011 10111111 01001111 01001011
,然而 Java 的字符串 "OK"
则是 01001111 01001011
,这显然是ASCII。我尝试过以几种方式将其转换为UTF8,但都无效:
String result2 = new String("OK".getBytes(StandardCharsets.UTF_8), StandardCharsets.UTF_8);
和
String result2 = new String("OK".getBytes(StandardCharsets.ISO_8859_1), StandardCharsets.UTF_8);
都没有起作用,仍然返回ASCII码,原因不明。
byte[] result2 = "OK".getBytes(StandardCharsets.UTF_8); System.out.print(new String(result2));
虽然这也会给出正确的 "OK"
结果,但在二进制中仍然返回ASCII。
我尝试将通信更改为数字,但 1
仍然不等于 1
,因为 Integer.parseInt(response)
返回 "1"
不是字符串错误消息,尽管在其他每个方面,它都被识别为普通字符串。
我正在寻找一个解决方案,最好是将 "OK"
转换为UTF-8,而不是将 response 转换为ASCII,因为我需要与一个设置为UTF-8的PHP脚本和两个数据库进行通信。Java 是通过开关 -Dfile.encoding=UTF8
启动的,以确保国际字符不会损坏。
英文:
I have a PHP script which is supposed to return an UTF-8 encoded string. However, in Java I can't seem to compare it with it's internal string in any way.
If I print "OK"
and response, they appear the same in console. However, if I check equality
if ( "OK".equals(response) ) {
the result is false. I printed out both in binary, response is 11101111 10111011 10111111 01001111 01001011
, the Java's String "OK"
however is 01001111 01001011
which is cleary ASCII. I tried to convert it to UTF8 in a few ways, but no avail:
String result2 = new String("OK".getBytes(StandardCharsets.UTF_8), StandardCharsets.UTF_8);
and
String result2 = new String("OK".getBytes(StandardCharsets.ISO_8859_1), StandardCharsets.UTF_8);
are both not working, still return ASCII codes for some reason.
byte[] result2 = "OK".getBytes(StandardCharsets.UTF_8);
System.out.print(new String(result2));
While this also gives the correct "OK"
result, in binary it still returns ASCII.
I've tried to change communication to numbers instead, but 1
still does not equal to 1
, as Integer.parseInt(response)
returns "1"
is not a String error message, altough in every other aspect, it is recognised as a normal String.
I'm looking for a solution preferably where "OK"
is converted to UTF-8 and not response to ASCII, since I need to communicate with a PHP script along with 2 databases, all set to UTF-8. Java is started with the switch -Dfile.encoding=UTF8
to ensure national characters are not broken.
答案1
得分: 4
在UTF-8中,所有编码为127或更低的字符都由一个字节编码。因此,在UTF-8和ASCII中,"OK"
都是相同的两个字节。
11101111 10111011 10111111 01001111 01001011 不仅仅是简单的 "OK"
,而是
0xEF,0xBB,0xBF,"OK"
其中 0xEF,0xBB,0xBF
是 字节顺序标记(BOM,Byte order mark)
这些符号在编辑器中不显示,但用于确定编码。
可能这些符号出现在你的php脚本中,在 <?php
之前。
你需要配置你的编辑器以从文件中移除BOM。
更新
如果无法修改php脚本,可以使用以下解决方法:
// 检查响应的第一个符号是否为BOM
if (!response.isEmpty() && (response.charAt(0) == 0xFEFF)) {
// 删除第一个符号
response = response.substring(1);
}
英文:
in UTF-8 all characters with codes 127 or less are encoded by a single byte. Therefore "OK"
in UTF-8 and ASCII is the same two bytes.
11101111 10111011 10111111 01001111 01001011 it is not just simple "OK"
it is
0xEF, 0xBB, 0xBF, "OK"
where 0xEF, 0xBB, 0xBF
are a BOM (Byte order mark)
It is symbols which are not displayed by editors but used to determine the encoding.
Probably those symbols appeared in you php script before <?php
You have to configure your editor to remove BOM from the file
UPD
If it is not possible to alter the php script, you can use a workaround:
// check if the first symbol of the response is BOM
if (!response.isEmpty() && (response.charAt(0) == 0xFEFF)) {
// removing the first symbol
response = response.substring(1);
}
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论