字符串比较在UTF-8中

huangapple go评论74阅读模式
英文:

String comparision in UTF8

问题

我有一个PHP脚本,应该返回一个UTF-8编码的字符串。然而,在Java中,我似乎无法以任何方式与其内部字符串进行比较。

如果我打印 "OK" 和 response,在控制台中它们看起来是相同的。然而,如果我进行相等性检查

if ( "OK".equals(response) ) {

结果是false。我将两者都以二进制形式打印出来,response 是 11101111 10111011 10111111 01001111 01001011,然而 Java 的字符串 "OK" 则是 01001111 01001011,这显然是ASCII。我尝试过以几种方式将其转换为UTF8,但都无效:

String result2 = new String("OK".getBytes(StandardCharsets.UTF_8), StandardCharsets.UTF_8);

String result2 = new String("OK".getBytes(StandardCharsets.ISO_8859_1), StandardCharsets.UTF_8);

都没有起作用,仍然返回ASCII码,原因不明。

byte[] result2 = "OK".getBytes(StandardCharsets.UTF_8); System.out.print(new String(result2));

虽然这也会给出正确的 "OK" 结果,但在二进制中仍然返回ASCII。

我尝试将通信更改为数字,但 1 仍然不等于 1,因为 Integer.parseInt(response) 返回 "1" 不是字符串错误消息,尽管在其他每个方面,它都被识别为普通字符串。

我正在寻找一个解决方案,最好是将 "OK" 转换为UTF-8,而不是将 response 转换为ASCII,因为我需要与一个设置为UTF-8的PHP脚本和两个数据库进行通信。Java 是通过开关 -Dfile.encoding=UTF8 启动的,以确保国际字符不会损坏。

英文:

I have a PHP script which is supposed to return an UTF-8 encoded string. However, in Java I can't seem to compare it with it's internal string in any way.

If I print "OK" and response, they appear the same in console. However, if I check equality

if ( "OK".equals(response) ) {

the result is false. I printed out both in binary, response is 11101111 10111011 10111111 01001111 01001011, the Java's String "OK" however is 01001111 01001011 which is cleary ASCII. I tried to convert it to UTF8 in a few ways, but no avail:

String result2 = new String("OK".getBytes(StandardCharsets.UTF_8), StandardCharsets.UTF_8);

and

String result2 = new String("OK".getBytes(StandardCharsets.ISO_8859_1), StandardCharsets.UTF_8);

are both not working, still return ASCII codes for some reason.

byte[] result2 = "OK".getBytes(StandardCharsets.UTF_8);
System.out.print(new String(result2));

While this also gives the correct "OK" result, in binary it still returns ASCII.

I've tried to change communication to numbers instead, but 1 still does not equal to 1, as Integer.parseInt(response) returns "1" is not a String error message, altough in every other aspect, it is recognised as a normal String.

I'm looking for a solution preferably where "OK" is converted to UTF-8 and not response to ASCII, since I need to communicate with a PHP script along with 2 databases, all set to UTF-8. Java is started with the switch -Dfile.encoding=UTF8 to ensure national characters are not broken.

答案1

得分: 4

在UTF-8中,所有编码为127或更低的字符都由一个字节编码。因此,在UTF-8和ASCII中,"OK"都是相同的两个字节。

11101111 10111011 10111111 01001111 01001011 不仅仅是简单的 "OK",而是

0xEF,0xBB,0xBF,"OK"

其中 0xEF,0xBB,0xBF字节顺序标记(BOM,Byte order mark)

这些符号在编辑器中不显示,但用于确定编码。

可能这些符号出现在你的php脚本中,在 <?php 之前。

你需要配置你的编辑器以从文件中移除BOM。

更新

如果无法修改php脚本,可以使用以下解决方法:

// 检查响应的第一个符号是否为BOM
if (!response.isEmpty() && (response.charAt(0) == 0xFEFF)) {
  // 删除第一个符号
  response = response.substring(1);
}
英文:

in UTF-8 all characters with codes 127 or less are encoded by a single byte. Therefore "OK" in UTF-8 and ASCII is the same two bytes.

11101111 10111011 10111111 01001111 01001011 it is not just simple "OK" it is

0xEF, 0xBB, 0xBF, "OK"

where 0xEF, 0xBB, 0xBF are a BOM (Byte order mark)

It is symbols which are not displayed by editors but used to determine the encoding.

Probably those symbols appeared in you php script before <?php

You have to configure your editor to remove BOM from the file

UPD

If it is not possible to alter the php script, you can use a workaround:

  // check if the first symbol of the response is BOM
  if (!response.isEmpty() && (response.charAt(0) == 0xFEFF)) {
    // removing the first symbol
    response = response.substring(1);
  }

huangapple
  • 本文由 发表于 2020年10月9日 21:32:00
  • 转载请务必保留本文链接:https://go.coder-hub.com/64281128.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定