Java — 如何取消转义Unicode专用字符?

huangapple go评论68阅读模式
英文:

Java -- How to unescape unicode private-use characters?

问题

以下是翻译好的部分:

I have a program that reads a list of unescaped unicode strings (u/XXXX) and converts them into their encoded unicode character, writing that version to both the terminal and to a textfile.

I'm using org.apache.commons.text.StringEscapeUtils.unescapeJava(String) to handle the unescaping of the escaped unicode points. (From Apache Commons Text library.)

I'm referring to these unicode entries to get my private-use characters: https://jrgraphix.net/r/Unicode/E000-F8FF
(I preprend u/ with the hex digits shown above ^)

Heres an example of what the output should look like:
If you pasted that into a ctrl F box on the website above, you'll see that it points to E022

Now, here is my question, and by extension the problem I am having:

Its not working. For some reason, it doesn't output the character itself, rather it just outputs a generic question mark that does not represent the private use char in question. If someone can help me with this it'd be much appreciated.

英文:

I have a program that reads a list of unescaped unicode strings (u/XXXX) and converts them into their encoded unicode character, writing that version to both the terminal and to a textfile.

I'm using org.apache.commons.text.StringEscapeUtils.unescapeJava(String) to handle the unescaping of the escaped unicode points. (From Apache Commons Text library.)

I'm referring to these unicode entries to get my private-use characters: https://jrgraphix.net/r/Unicode/E000-F8FF
(I preprend u/ with the hex digits shown above ^)

Heres an example of what the output should look like:
If you pasted that into a ctrl F box on the website above, you'll see that it points to E022

Now, here is my question, and by extension the problem I am having:

Its not working. For some reason, it doesn't output the character itself, rather it just outputs a generic question mark that does not represent the private use char in question. If someone can help me with this it'd be much appreciated.

So far, I have had no luck.

答案1

得分: 2

简要说明

  • 在输入字符串中使用正确的Java语法来表示Unicode十六进制:\uXXXX
  • 如果没有字体提供给定代码点的字形,你的操作系统会显示一个空框、问号或其他替代字符来指示缺失。

要获得官方认可的红心:

org.apache.commons.text.StringEscapeUtils.unescapeJava( "\u2764\uFE0F" )  // 模拟一些以Java语法转义的十六进制Unicode代码点的文本输入。

结果:

>❤️

示例代码

你没有展示你的确切代码,但你的问题提到了 u/XXXX,这是不正确的。在Java中,正确的Unicode十六进制语法是 \uXXXX

你可以通过请求代码点来验证你的十六进制字面值,如下所示。

这里是一些示例代码:

System.out.println( "Demo of Private Use Area" );

String input = "\uE022";
String output = org.apache.commons.text.StringEscapeUtils.unescapeJava( input );
int codePoint = output.codePointAt( 0 );
String name = Character.getName( codePoint );

输出到控制台:

System.out.println( "input = " + input );
System.out.println( "output = " + output );
System.out.println( "codePoint = " + codePoint + " (我们期望 \uE022 对应的是 57378)。" );
System.out.println( "Name = " + name );

当运行时:

Demo of Private Use Area
input = \uE022
output = 
codePoint = 57378 (我们期望 \uE022 对应的是 57378)。
Name = PRIVATE USE AREA E022

红心表情符号

如果你真的想要一个红心,Unicode确实定义了一个表情符号。

但要访问这个表情符号需要两个代码点。Unicode 1.1 在 1993 年定义了 "Heavy Black Heart",十进制代码点是 10,084 (U+2764)。后来的Unicode版本在2015年增加了Emoji 1.0的定义,将 "HEAVY BLACK HEART" 与 "VARIATION SELECTOR-16" 结合在一起,十进制代码点是 65,039 (U+FE0F)。

在Unicode联盟网站的Full Emoji List中查看"red heart"行,但我认为这一行不正确,因为它没有提到必需的 U+FE0F 代码点。

// HEAVY BLACK HEART + VARIATION SELECTOR-16 = 红心。
String input = "\u2764\uFE0F";
String output = org.apache.commons.text.StringEscapeUtils.unescapeJava( input );

结果:

>❤️

完整的示例代码:

System.out.println( "Demo of Red Heart" );

// HEAVY BLACK HEART + VARIATION SELECTOR-16 = 红心。
String input = "\u2764\uFE0F";
String output = org.apache.commons.text.StringEscapeUtils.unescapeJava( input );

System.out.println( "input = " + input );
System.out.println( "output = " + output );

output.codePoints().forEachOrdered( ( int codePoint ) -> {
    String message =
            "Code point decimal " + codePoint
                    + " = hex " + Integer.toHexString( codePoint )
                    + " = name " + Character.getName( codePoint );
    System.out.println( message );
} );

当运行时:

Demo of Red Heart
input = \u2764\uFE0F
output = ❤️
Code point decimal 10084 = hex 2764 = name HEAVY BLACK HEART
Code point decimal 65039 = hex fe0f = name VARIATION SELECTOR-16

私有使用区域没有官方分配的字符

根据定义,私有使用区域(PUA)没有由Unicode联盟分配的字符。在该范围内的所有代码点编号都被Unicode联盟承诺永远不会官方分配给任何字符。

这意味着我们都可以自由创建一个字体,为这些代码点中的任何一个分配任何类型的字形。你可以创建一个带有红心卡通的字体,位于代码点E022。同时,我可以选择制作一个带有鹦鹉的图案的字体。还有一个名叫Bob的人可以创建自己的字体,其中包含代码点E022处的Microlino汽车的图片。无论是你、我还是Bob,都会高兴地知道我们的自定义字体永远不会受到将来官方认可字符在该代码点处的影响。

如果Alice喜欢你的红心并想使用它,她需要获得你的字体副本。她需要在她的计算机上安装这个字体。然后她需要:

  • 确保没有其他字体在代码点E022提供字形,或者
  • 使用一个允许她指定使用你的字体而不是其他可能偶然提供代码点E022字形的字体的应用程序。

如果Alice没有安装任何具有代码点E022字形的字体,那么她的计算机操作系统将退而显示某种替代字形,例如空框、问号或表示缺少字形的内容。

Unicode中定义的三个PUA已经变得相当受欢迎。人们使用它们来创建字符,这些字符不符合Unicode联盟的要求,从而防止这些字符被考虑用于将来的Unicode版本中。例如,星际迷航中的[克林贡语](https://en.m.wikipedia.org/wiki

英文:

tl;dr

  • Use correct Java syntax in your input string for a Unicode hexadecimal: \uXXXX
  • If you have no font providing a glyph for that code point number, your OS indicates the lack by displaying an empty-box, question-mark, or some such fall-back replacement.

To get an officially sanctioned Red Heart:

org.apache.commons.text.StringEscapeUtils.unescapeJava( "\\" + "u2764" + "\\" + "uFE0F" )  // Simulating some textual input of Java-syntax escaped Unicode code point numbers in hexadecimal.

>❤️

Example code

You did not show your exact code. But your Question mentions u/XXXX which is incorrect. Correct syntax in Java for a Unicode hexadecimal is \uXXXX.

You can verify your hexadecimal literal by asking for the code point, as shown below.

Here is some example code.

System.out.println( "Demo of Private Use Area" );

String input = "\\" + "uE022";
String output = org.apache.commons.text.StringEscapeUtils.unescapeJava( input );
int codePoint = output.codePointAt( 0 );
String name = Character.getName( codePoint );

Dump to console.

System.out.println( "input = " + input );
System.out.println( "output = " + output );
System.out.println( "codePoint = " + codePoint + " (we expect 57378 for \\uE022)." );
System.out.println( "Name = " + name );

When run:

Demo of Private Use Area
input = \uE022
output = 
codePoint = 57378 (we expect 57378 for \uE022).
Name = PRIVATE USE AREA E022

Red heart emoji

If you really want a red heart, Unicode does define an emoji.

But accessing this emoji requires two code points. Unicode 1.1 in 1993 defined “Heavy Black Heart” at decimal 10,084 (U+2764). Later versions of Unicode added Emoji 1.0 definitions in 2015, adding a definition for Red Heart by combining HEAVY BLACK HEART with VARIATION SELECTOR-16 at decimal 65,039 (U+FEOF).

See red heart row of Full Emoji List at the Unicode Consortium web site. However, that row appears to me to be incorrect in that it fails to mention the required U+FE0F code point.

// HEAVY BLACK HEART + VARIATION SELECTOR-16 = Red Heart.
String input = "\\" + "u2764" + "\\" + "uFE0F";
String output = org.apache.commons.text.StringEscapeUtils.unescapeJava( input );

>❤️

Full example code:

System.out.println( "Demo of Red Heart" );

// HEAVY BLACK HEART + VARIATION SELECTOR-16 = Red Heart.
String input = "\\" + "u2764" + "\\" + "uFE0F";
String output = org.apache.commons.text.StringEscapeUtils.unescapeJava( input );

System.out.println( "input = " + input );
System.out.println( "output = " + output );

output.codePoints().forEachOrdered( ( int codePoint ) -> {
    String message =
            "Code point decimal " + codePoint
                    + " = hex " + Integer.toHexString( codePoint )
                    + " = name " + Character.getName( codePoint );
    System.out.println( message );
} );

When run:

Demo of Red Heart
input = \u2764\uFE0F
output = ❤️
Code point decimal 10084 = hex 2764 = name HEAVY BLACK HEART
Code point decimal 65039 = hex fe0f = name VARIATION SELECTOR-16

A PUA has no officially assigned characters

By definition, a Private Use Area (PUA) has no characters assigned by the Unicode Consortium. All the code point numbers in that range are promised by the Unicode Consortium to never be officially assigned any character.

These leaves all of us free to create a font that assigns any kind of glyph we want to assign to any of those code points.

You may want to create a font with red heart cartoon at code point E022. Meanwhile I may choose to make a font that has a drawing of a cockatiel. And some guy named Bob creates his own font with a picture of a Microlino car at E022. All of us, you, me, and Bob, are all happy knowing that our custom font will never be stomped on by a future officially sanctioned character at that code point.

If Alice likes your red heart, and wants to use it, she needs to obtain a copy of your font. She needs to install that font on her computer. And she needs to either:

  • Ensure that no enter font provides a glyph at code point E022, or,
  • Use an app that allows her to specify the use of your font rather than any other font that may also coincidentally provide a glyph at E022.

👉 If Alice has installed no fonts at all with a glyph at E022, then the operating system of her computer will fall back to displaying some kind of substitute glyph such as an empty box or question mark or nothing to indicate the lack of a glyph.

The three PUAs defined in Unicode have turned out to be rather popular. People use them to create fonts for characters that do not meet the requirements of the Unicode Consortium, preventing those characters from ever being considered for future inclusion in Unicode. For example, fictional languages such as Klingon in Star Trek or elves’ language from novels.

This popularity has prompted volunteers outside the Unicode Consortium to devise a public registry of the PUA code points, in an attempt to avoid conflicts among various fonts over particular code points.

huangapple
  • 本文由 发表于 2023年2月27日 01:02:29
  • 转载请务必保留本文链接:https://go.coder-hub.com/75573608.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定