如何在Java中识别并过滤掉字符串中位于Unicode范围40000–DFFFF的字符。

huangapple go评论74阅读模式
英文:

How to identify and filter out unicode characters falling in the range of 40000–​DFFFF from a String in Java

问题

在Java中有没有办法找出一个字符是否在Unicode的plane4到plane13范围内。根据https://en.wikipedia.org/wiki/Unicode_block,plane4到plane13的范围是- 40000–​DFFFF。

在这段代码中,我尝试将一个十六进制值赋给一个字符,但是当我将其转换回整数时,我得不到相同的整数值。DFFFF的十进制形式是917503。但是将字符转换回整数后,我得到的十进制值是65535。不确定为什么在将字符转换回整数时值会发生变化。有人能否给我一些关于这个问题的想法。根据Unicode,范围40000–​DFFFF目前是未定义的。这是这种奇怪行为的原因吗?

实际上,我要实现的用例是从输入字符串中过滤掉任何落在40000–​DFFFF范围内的字符。

有没有任何开源库可以直接做到这一点?如果能提供任何帮助,将不胜感激。

int intHex = 0xDFFFF;
char c = (char)intHex;
System.out.println((int)c);

谢谢。

英文:

Is there any way in java to find out if a character falls in between plane4 to plane13 of Unicode. The range of plane4 to plane13 as per https://en.wikipedia.org/wiki/Unicode_block is - 40000–​DFFFF

In this code I am trying assign a hex value to a char, but when I cast it back to int, then I don't get that same int value. The decimal form of DFFFF is 917503. But on casting the char back to an int I get the decimal value 65535. Not sure why the value is getting changed on casting the char back to int. Can someone please give me some idea on this. As per unicode the range 40000–​DFFFF is undefined currently. Is that the reason for this strange behaviour ?

Actually the use-case I want to implement is to filter out any characters from the input string if they fall with-in the range of 40000–​DFFFF.

Is there any opensource library that does this out of the box ? Appreciate if any help can be provided on this.

int intHex = 0xDFFFF;
	char c = (char)intHex;
	System.out.println((int)c);

Thanks

答案1

得分: 1

参见Character类的文档

char数据类型(因此Character对象封装的值)基于最初的Unicode规范,该规范将字符定义为固定宽度的16位实体。Unicode标准后来已更改,以允许表示需要超过16位的字符。合法代码点的范围现在是U+0000到U+10FFFF,称为Unicode标量值。
(请参阅Unicode标准中有关U+n表示法的定义。)

字符集从U+0000到U+FFFF的范围有时称为基本多语言平面(BMP)。代码点大于U+FFFF的字符称为补充字符。Java平台在char数组、String和StringBuffer类中使用UTF-16表示。在这种表示中,补充字符表示为一对char值,第一个值来自高代理范围(\uD800-\uDBFF),第二个值来自低代理范围(\uDC00-\uDFFF)。

因此,char值表示基本多语言平面(BMP)代码点,包括代理代码点,或者是UTF-16编码的代码单元。int值表示所有Unicode代码点,包括补充代码点。int的低(最不重要)21位用于表示Unicode代码点,而上(最重要)11位必须为零。除非另有说明,关于补充字符和代理char值的行为如下:

只接受char值的方法无法支持补充字符。它们将代理范围内的char值视为未定义字符。例如,Character.isLetter('\uD840')返回false,即使在字符串中,此特定值后跟任何低代理值都将表示字母。
接受int值的方法支持所有Unicode字符,包括补充字符。例如,Character.isLetter(0x2F81A)返回true,因为代码点值表示一个字母(一个CJK表意字符)。

这意味着一个单独的char是不够的,因为它只有16位。你需要一个int值来保存更大的值。65535 = 2^16是无符号16位整数数据类型中可以保存的最大值。

英文:

See the documentation of the Character class:

> The char data type (and therefore the value that a Character object
> encapsulates) are based on the original Unicode specification, which
> defined characters as fixed-width 16-bit entities. The Unicode
> Standard has since been changed to allow for characters whose
> representation requires more than 16 bits. The range of legal code
> points is now U+0000 to U+10FFFF, known as Unicode scalar value.
> (Refer to the definition of the U+n notation in the Unicode Standard.)
>
> The set of characters from U+0000 to U+FFFF is sometimes referred to
> as the Basic Multilingual Plane (BMP). Characters whose code points
> are greater than U+FFFF are called supplementary characters. The Java
> platform uses the UTF-16 representation in char arrays and in the
> String and StringBuffer classes. In this representation, supplementary
> characters are represented as a pair of char values, the first from
> the high-surrogates range, (\uD800-\uDBFF), the second from the
> low-surrogates range (\uDC00-\uDFFF).
>
> A char value, therefore, represents Basic Multilingual Plane (BMP)
> code points, including the surrogate code points, or code units of the
> UTF-16 encoding. An int value represents all Unicode code points,
> including supplementary code points. The lower (least significant) 21
> bits of int are used to represent Unicode code points and the upper
> (most significant) 11 bits must be zero. Unless otherwise specified,
> the behavior with respect to supplementary characters and surrogate
> char values is as follows:
>
> The methods that only accept a char value cannot support supplementary characters. They treat char values from the surrogate
> ranges as undefined characters. For example,
> Character.isLetter('\uD840') returns false, even though this specific
> value if followed by any low-surrogate value in a string would
> represent a letter.
> The methods that accept an int value support all Unicode characters, including supplementary characters. For example,
> Character.isLetter(0x2F81A) returns true because the code point value
> represents a letter (a CJK ideograph).

This means that a single char is not enough because it only has 16 bits. You need an int value to hold larger values. 65535 = 2^16 is the maximum value you can save in an unsigned 16 bit integer datatype.

答案2

得分: 1

欢迎来到UTF-16的世界,这是IT行业的一个重大事故。

许多操作系统和编程语言在明显已经清楚16位远不足以表示地球上使用的所有字符时,仍将字符类型定义为16位长。其中之一就是Java。

与此同时,Unicode已经发展起来。它需要32位来表示俗称为字母的内容,被称为代码点(codepoint)。出于兼容性的原因,Java无法将char类型从16位更改为32位。相反,他们将其保留在16位,并将其重新定义为UTF-16编码(而不是直接的UCS-2表示)。

简而言之:诸如U+DFFFF的代码点需要超过16位,无法在单个char中表示。因此,从char切换到代码点,而在Java中代码点表示为int

final int length = s.length();
for (int offset = 0; offset < length; ) {
   final int codepoint = s.codePointAt(offset);
   if (codepoint >= 0x40000 && codepoint <= 0xdffff) {
       // 对代码点进行操作
   }
   offset += Character.charCount(codepoint);
}
英文:

Welcome to the world of UTF-16, a major accident in the IT industry.

Many operating systems and programming languages have defined the character type to be 16 bit long when it was already obvious that 16 bit is far from sufficient to represent all the letters used on Earth. Java is one of them.

Unicode has evolved in the mean time. It requires 32 bit to represent what is colloquially known as a letter and it is called codepoint. For compatibility reasons, Java could not change the char type from 16 bit to 32 bit. Instead, they left it at 16 bit and redefined it to be UTF-16 encoded (instead of a direct UCS-2) representation.

Short story is: A code point such as U+DFFFF requires more than 16 bit and cannot be represented in a single char. So switch from char to code points, which are represented as int in Java:

final int length = s.length();
for (int offset = 0; offset &lt; length; ) {
   final int codepoint = s.codePointAt(offset);
   if (codepoint &gt;= 0x40000 &amp;&amp; codepoint &lt;= 0xdffff) {
       // do something with the codepoint
   }
   offset += Character.charCount(codepoint);
}

答案3

得分: 0

int intHex = 0xDFFFF; // 这被称为 Unicode 代码点。

char[] chars = Character.toChars(intHex); // 在这种情况下是一对字符。
String s = new String(chars);
int[] codePoints = new int[] {intHex};
String t = new String(codePoints, 0, codePoints.length);

Java 中的字符采用 UTF-16 编码确保与 UTF-8 类似Unicode 符号代码点的序列不包含可误解的其他字符序列一对代理字符)。

最好使用 String 处理并提取代码点

int[] codePoints = s.codePoints().filter(cp -> cp < 0x40000).toArray();
String t = new String(codePoints, 0, codePoints.length);
英文:
    int intHex = 0xDFFFF; // This is called a Unicode code point.

    char[] chars = Character.toChars(intHex); // In this case a pair.
    String s = new String(chars);
    int[] codePoints = new int[] {intHex};
    String t = new String(codePoints, 0, codePoints.length);

Chars in java are in UTF-16, and ensure (like UTF-8) that a sequence for a Unicode symbol, a code point, does not contain a mistakable other char in a sequence (surrogate pair of chars).

Best work with String, and extract code points.

    int[] codePoints = s.codePoints().filter(cp -&gt; cp &lt; 0x40000).toArray();
    String t = new String(codePoints, 0, codePoints.length);

huangapple
  • 本文由 发表于 2020年9月7日 22:43:10
  • 转载请务必保留本文链接:https://go.coder-hub.com/63779808.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定