2020年7月28日 20:02:05go评论99阅读模式

英文:

Java - read UTF-8 file with a single emoji symbol

问题

I have a file with a single unicode symbol.
这是一个包含单个Unicode符号的文件。

The file is encoded in UTF-8.
该文件使用UTF-8编码。

It contains a single symbol represented as 4 bytes.
它包含一个由4个字节表示的单个符号。

When I read the file I get two symbols/chars.
当我读取文件时，我得到两个符号/字符。

The program below prints
下面的程序打印出：

?
？
2
2
?
？
55357
55357
56842
56842

======================================
&#55357;&#56842;
&#55357;&#56842;
16
16
&
&

======================================
?
？
2
2

======================================

Is this normal... or a bug? Or am I misusing something?
这正常吗...还是一个bug？还是我在误用某些东西？

How do I get that single emoji symbol in my code?
如何在我的代码中获取这个单个表情符号？

EDIT: And also... how do I escape it for XML?
编辑：而且...如何在XML中转义它？

import java.io.BufferedReader;
import java.io.FileInputStream;
import java.io.InputStreamReader;

public class Test008 {

public static void main(String[] args) throws Exception{
    BufferedReader in = new BufferedReader(
               new InputStreamReader(
                          new FileInputStream("D:\\DATA\\test1.txt"), "UTF8"));

    String s = "";
    while ((s = in.readLine()) != null) {
        System.out.println(s);
        System.out.println(s.length());
        System.out.println(s.charAt(0));
        System.out.println(s.charAt(1));

        System.out.println((int)(s.charAt(0)));
        System.out.println((int)(s.charAt(1)));

        String z = org.apache.commons.lang.StringEscapeUtils.escapeXml(s);
        String z3 = org.apache.commons.lang3.StringEscapeUtils.escapeXml(s);

        System.out.println("======================================");
        System.out.println(z);
        System.out.println(z.length());
        System.out.println(z.charAt(0));

        System.out.println("======================================");
        System.out.println(z3);
        System.out.println(z3.length());
        System.out.println(z3.charAt(0));

        System.out.println("======================================");

    }

    in.close();
}

}

英文:

I have a file with a single unicode symbol.
The file is encoded in UTF-8.
It contains a single symbol represented as 4 bytes.
https://www.fileformat.info/info/unicode/char/1f60a/index.htm

F0 9F 98 8A

When I read the file I get two symbols/chars.

The program below prints

?
2
?
?
55357
56842
======================================
&amp;#55357;&amp;#56842;
16
&amp;
======================================
?
2
?
======================================

Is this normal... or a bug? Or am I misusing something?
How do I get that single emoji symbol in my code?

EDIT: And also... how do I escape it for XML?

import java.io.BufferedReader;
import java.io.FileInputStream;
import java.io.InputStreamReader;

public class Test008 {

	public static void main(String[] args) throws Exception{
		BufferedReader in = new BufferedReader(
				   new InputStreamReader(
		                      new FileInputStream(&quot;D:\\DATA\\test1.txt&quot;), &quot;UTF8&quot;));
		
		String s = &quot;&quot;;
		while ((s = in.readLine()) != null) {
			System.out.println(s);
			System.out.println(s.length());
			System.out.println(s.charAt(0));
			System.out.println(s.charAt(1));
			
			System.out.println((int)(s.charAt(0)));
			System.out.println((int)(s.charAt(1)));
			
			String z = org.apache.commons.lang.StringEscapeUtils.escapeXml(s);
			String z3 = org.apache.commons.lang3.StringEscapeUtils.escapeXml(s);
			
			System.out.println(&quot;======================================&quot;);
			System.out.println(z);
			System.out.println(z.length());
			System.out.println(z.charAt(0));
			
			System.out.println(&quot;======================================&quot;);
			System.out.println(z3);
			System.out.println(z3.length());
			System.out.println(z3.charAt(0));
			
			System.out.println(&quot;======================================&quot;);

		}

		in.close();
	}

}

答案1

得分: 4

是的，正常情况下，Unicode 符号由 2 个 UTF-16 字符组成（1 个字符占 2 个字节）。

int codePoint = s.codePointAt(0); // 您的代码点。
System.out.printf("U+%04X, chars: %d%n", codePoint, Character.charCount(cp));

U+F09F988A, chars: 2

在注释后面

在 Java 中，使用流：

public static String escapeToAsciiHTML(String s) {
    StringBuilder sb = new StringBuilder();
    s.codePoints().forEach(cp -> {
        if (cp < 128) {
            sb.append((char) cp);
        } else{
            sb.append("&#").append(cp).append(";");
        }
    });
    return sb.toString();
}

英文:

Yes normal, the Unicode symbol is 2 UTF-16 chars (1 char is 2 bytes).

int codePoint = s.codePointAt(0); // Your code point.
System.out.printf(&quot;U+%04X, chars: $d%n&quot;, codePoint, Character.charCount(cp));

U+F09F988A, chars: 2

After comments

Java, using a Stream:

public static String escapeToAsciiHTML(String s) {
    StringBuilder sb = new StringBuilder();
    s.codePoints().forEach(cp -&gt; {
        if (cp &lt; 128) {
            sb.append((char) cp);
        } else{
            sb.append(&quot;&amp;#&quot;).append(cp).append(&quot;;&quot;);
        }
    });
    return sb.toString();
}

答案2

得分: 3

StringEscapeUtils存在问题。不要使用它。尝试使用NumericEntityEscaper。

或者，更好的选择是，由于Apache Commons库通常是糟糕的API并且容易出现问题*，可以考虑使用Guava*的XmlEscapers。

Java使用Unicode，但'char'是一个谎言。'char'并不代表字符；它代表一个单一的无符号16位数。从一个j.l.String对象中获取字符的实际方法不是charAt，这是一个误导；而是codepointAt等相关方法。

通常情况下（char是一个虚假），这并不重要；大多数实际字符适合16位的char类型。但当它们不适合时，这很重要，而那个表情符号就不适合。在Java和char类型使用的Unicode模型中，您会得到2个char值（表示一个单一的Unicode字符）。这一对被称为'代理对'。

请注意，正确的方法通常在int中工作（毕竟，您需要32位来表示一个单一的Unicode符号）。

*) Guava有它自己的问题，因为它积极地不与自己向后兼容，所以它往往会导致依赖关系问题。这是一种选择你毒药的情况，不幸的是。

**) 通常情况下，以Utils结尾的东西通常是糟糕的API设计的迹象；'util'几乎是没有意义的术语，通常意味着您已经破坏了面向对象的模型。正确的模型当然是拥有一个代表将数据从一种形式（例如，原始字符串）转换为另一种形式（例如，可以直接转储到XML文件中并进行转义的字符串）的过程的对象 - 这样的东西因此将被称为'escaper'，并且可能存在于一个名为'escapers'或'text'的包中。幸运的是，Apache库的后续版本以及Guava“修复”了这个问题。

***) 正如这个示例所示，这些API通常不会执行您希望它们执行的操作。请注意，Apache是开源的；如果您希望这些API变得更好，它们接受拉取请求:)

英文:

StringEscapeUtils is broken. Don't use it. Try NumericEntityEscaper.

Or, better yet, as apache commons libraries tend to be bad API** and broken*** anyway, guava*'s XmlEscapers

java is unicode, yes, but 'char' is a lie. 'char' does not represent characters; it represents a single, unsigned 16 bit number. The actual method to get a character out of, say, a j.l.String object isn't charAt, which is a misnomer; it's codepointAt, and friends.

This (char being a fakeout) normally doesn't matter; most actual characters fit in the 16-bit char type. But when they don't, this matters, and that emoji doesn't fit. In the unicode model used by java and the char type, you then get 2 char values (representing a single unicode character). This pair is called a 'surrogate pair'.

Note that the right methods tend to work in int (you need the 32 bits to represent one single unicode symbol, after all).

*) guava has its own issues, by being aggressively not backwards compatible with itself, it tends to lead to dependency hell. It's a pick your poison kind of deal, unfortunately.

**) Utils-anything is usually a sign of bad API design; 'util' is almost meaningless as a term and usually implies you've broken the object oriented model. The right model is of course to have an object representing the process of translating data in one form (say, a raw string) to another (say, a string that can be dumped straight into an XML file, escaped and well) - and such a thing would thus be called an 'escaper', and would live perhaps in a package named 'escapers' or 'text'. Later editions of apache libraries, as well as guava, fortunately 'fixed' this.

***) As this very example shows, these APIs often don't do what you want them to. Note that apache is open source; if you want these APIs to be better, they accept pull requests

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

Java – 读取包含单个表情符号的UTF-8文件

问题

?
？
2
2
?
？
55357
55357
56842
56842

======================================
&#55357;&#56842;
&#55357;&#56842;
16
16
&
&

======================================
?
？
2
2

答案1

答案2

Android Studio NDK设置正确，但不兼容

Java：使用 AWT 按钮与聊天服务器断开连接

EasyMock断言错误：JdbcTemplate – 意外的方法调用

Java父类使用子类属性

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论

问题

? ？ 2 2 ? ？ 55357 55357 56842 56842

====================================== &#55357;&#56842; &#55357;&#56842; 16 16 & &

====================================== ? ？ 2 2

答案1

答案2

发表评论

?
？
2
2
?
？
55357
55357
56842
56842

======================================
&#55357;&#56842;
&#55357;&#56842;
16
16
&
&

======================================
?
？
2
2