Java为什么不识别这些空白字符?

huangapple go评论46阅读模式
英文:

Why does Java not recognize these white spaces?

问题

25种空白字符。下面的代码中Character.isWhitespace(char)显示了其中25种中有四种在Java中不被视为空白字符。为什么?

public class Main {
    public static void main(String...args){
        char [] whiteSpaces = {'\u0085', '\u00A0', '\u2007', '\u202F'};
        for(char space : whiteSpaces){
            //所有这些字符在Java中都不是空白字符。
            System.out.println("[" + space + "] is a white space in Java:" + Character.isWhitespace(space));
        }
    }
}

参考 - https://docs.oracle.com/en/java/javase/11/docs/api/java.base/java/lang/Character.html#isWhitespace(char)

英文:

There are 25 types of white spaces. Character.isWhitespace(char) in the code below shows that four of the 25 types are not considered as white space in Java. Why ?

public class Main {
    public static void main(String...args){
        char [] whiteSpaces = {'\u0085', '\u00A0', '\u2007', '\u202F'};
        for(char space : whiteSpaces){
            //All spaces are not white spaces in Java.
            System.out.println("[" + space + "] is a white space in Java:" + Character.isWhitespace(space));
        }
    }
}

Refer -https://docs.oracle.com/en/java/javase/11/docs/api/java.base/java/lang/Character.html#isWhitespace(char)

答案1

得分: 3

为什么?因为这就是该方法的规定方式。isWhiteSpacejavadoc列出了它匹配的代码。您所识别的4个字符不在列表中。

**我们无法告诉您为什么它被定义成这样。**但是,根据javadoc的说法,'\u00A0''\u2007''\u202F' 被排除在外,因为它们是不间断的空白字符。

'\u0085'NEL 是一个有趣的案例。根据Unicode代码表(请参见此处以获取非官方摘要),它不属于一般类别SPACE_SEPARATOR、LINE_SEPARATOR或PARAGRAPH_SEPARATOR。(它属于CONTROL类别。)

如果您想要一个识别所有Unicode空白字符(即SPACE_SEPARATOR、LINE_SEPARATOR或PARAGRAPH_SEPARATOR中的字符)的方法,应该使用 isSpaceCharjavadoc)而不是isWhiteSpace

请注意,Unicode规范不是一个固定不变的东西。代码的分类,甚至“空白字符”的定义随着时间的推移而发展演变。每个Java版本实现了特定版本的Unicode规范,该规范在其发布时是当前的。例如:

  • Java 8 实现了 Unicode 6.2
  • Java 11 实现了 Unicode 10.0.0
  • Java 13 实现了 Unicode 12.1

每个Java版本的Character类的javadoc中都有详细信息。请注意,给定的Java版本不会被补丁以跟踪随后的Unicode发布。


总之,“空白字符”是一个相当模糊的概念。如果您想要实现特定含义的方法,可能需要自己实现。

英文:

Why? Because that is how that method is specified. The javadoc for isWhiteSpace lists the codes that it matches. The 4 that you identified are not in the list.

We can't tell you why it was defined that way. However, one implication of what the javadoc says is that '\u00A0', '\u2007' and '\u202F' are excluded because they are non-breaking whitespace characters.

'\u0085' or NEL is an interesting case. According to the Unicode code tables (see here for an unofficial summary) it is NOT a member of the general categories SPACE_SEPARATOR, LINE_SEPARATOR or PARAGRAPH_SEPARATOR. (It shows up in the CONTROL category.)

If you want a method that recognises all Unicode white space characters (i.e. characters in SPACE_SEPARATOR, LINE_SEPARATOR or PARAGRAPH_SEPARATOR), you should use isSpaceChar (javadoc) instead of isWhiteSpace.

Note that the Unicode spec is not a constant thing. The categorization of the codes, and indeed the definition of "white space" has evolved over time. Each Java version implements a specific version of the Unicode spec that was current at the time it was released. For example:

  • Java 8 implements Unicode 6.2
  • Java 11 implements Unicode 10.0.0
  • Java 13 implements Unicode 12.1

The details are in the javadoc for the Character class for each Java version. Note that a given Java release is NOT patched to track subsequent Unicode releases.


The bottom line is that "white space" is a rather slippery concept. If you want a method that implements a specific meaning, you may need to implement it yourself.

答案2

得分: 3

如果您阅读文档,即Character.isWhitespace(char)的javadoc,它说:

> 根据Java,确定指定的字符是否为空格。只有当满足以下条件之一时,字符才是Java空格字符:
>
> - 它是Unicode空格字符(SPACE_SEPARATORLINE_SEPARATORPARAGRAPH_SEPARATOR),但不是不换行空格'\u00A0''\u2007''\u202F')。
> - 它是'\t'U+0009水平制表符
> - 它是'\n'U+000A换行符
> - 它是'\u000B'U+000B垂直制表符
> - 它是'\f'U+000C换页符
> - 它是'\r'U+000D回车符
> - 它是'\u001C'U+001C文件分隔符
> - 它是'\u001D'U+001D组分隔符
> - 它是'\u001E'U+001E记录分隔符
> - 它是'\u001F'U+001F单元分隔符

您列出的4个字符中的3个明确被排除,因为它们是不换行空格

至于U+0085下一行(NEL),它不是Unicode空格字符,Java不将其视为空格字符,您可以在javadoc中清楚地看到。

英文:

If you read the documentation, i.e. the javadoc of Character.isWhitespace(char), it says:

> Determines if the specified character is white space according to Java. A character is a Java whitespace character if and only if it satisfies one of the following criteria:
>
> - It is a Unicode space character (SPACE_SEPARATOR, LINE_SEPARATOR, or PARAGRAPH_SEPARATOR) but is not also a non-breaking space ('\u00A0', '\u2007', '\u202F').
> - It is '\t', U+0009 HORIZONTAL TABULATION.
> - It is '\n', U+000A LINE FEED.
> - It is '\u000B', U+000B VERTICAL TABULATION.
> - It is '\f', U+000C FORM FEED.
> - It is '\r', U+000D CARRIAGE RETURN.
> - It is '\u001C', U+001C FILE SEPARATOR.
> - It is '\u001D', U+001D GROUP SEPARATOR.
> - It is '\u001E', U+001E RECORD SEPARATOR.
> - It is '\u001F', U+001F UNIT SEPARATOR.

3 of the 4 you listed are explicitly excluded because they are non-breaking spaces.

As for U+0085 NEXT LINE (NEL), it is not a Unicode space character, and it is not considered a whitespace character by Java, as you can well see in that javadoc.

答案3

得分: 1

Java似乎没有在任何地方公开Unicode空白字符列表。

在Java中,isWhitespace明确定义为以下之一:

  • 它是Unicode空格字符(SPACE_SEPARATOR、LINE_SEPARATOR或PARAGRAPH_SEPARATOR),但不是不间断空格('\u00A0'、'\u2007'或'\u202F')之一。
  • 它是'\t',U+0009水平制表符。
  • 它是'\n',U+000A换行符。
  • 它是'\u000B',U+000B垂直制表符。
  • 它是'\f',U+000C换页符。
  • 它是'\r',U+000D回车符。
  • 它是'\u001C',U+001C文件分隔符。
  • 它是'\u001D',U+001D组分隔符。
  • 它是'\u001E',U+001E记录分隔符。
  • 它是'\u001F',U+001F单元分隔符。

Java还通过Character.isSpaceChar()提供Unicode的空格,但不包括Unicode的空白字符,这是一个稍微不同的列表。

char[] whiteSpaces = {'\u0085', '\u00A0', '\u2007', '\u202F'};
for (char space : whiteSpaces) {
    // 在Java中,所有空格都不是空白字符。
    System.out.println("[" + space + "] is a white space in Java: " + Character.isWhitespace(space) + " Unicode: " + Character.isSpaceChar(space));
}

输出:

[…] is a white space in Java: false Unicode: false
[ ] is a white space in Java: false Unicode: true
[ ] is a white space in Java: false Unicode: true
[ ] is a white space in Java: false Unicode: true

如果对您的应用程序来说匹配Unicode规范而不是Java规范很重要,您可以自行定义。

英文:

Java doesn't seem to expose the unicode whitespace list anywhere

In Java, isWhitespace is specifically defined as one of these:

  • It is a Unicode space character (SPACE_SEPARATOR, LINE_SEPARATOR, or PARAGRAPH_SEPARATOR) but is not also a non-breaking space ('\u00A0', '\u2007', '\u202F').
  • It is '\t', U+0009 HORIZONTAL TABULATION.
  • It is '\n', U+000A LINE FEED.
  • It is '\u000B', U+000B VERTICAL TABULATION.
  • It is '\f', U+000C FORM FEED.
  • It is '\r', U+000D CARRIAGE RETURN.
  • It is '\u001C', U+001C FILE SEPARATOR.
  • It is '\u001D', U+001D GROUP SEPARATOR.
  • It is '\u001E', U+001E RECORD SEPARATOR.
  • It is '\u001F', U+001F UNIT SEPARATOR.

Java also makes unicode spaces available, but not unicode whitespaces, via Character.isSpaceChar(). This is a slightly different list.

char [] whiteSpaces = {'\u0085', '\u00A0', '\u2007', '\u202F'};
        for(char space : whiteSpaces){
            //All spaces are not white spaces in Java.
            System.out.println("[" + space + "] is a white space in Java: " + Character.isWhitespace(space) + " Unicode: " + Character.isSpaceChar(space));
        }

Output:

[…] is a white space in Java: false Unicode: false
[ ] is a white space in Java: false Unicode: true
[ ] is a white space in Java: false Unicode: true
[ ] is a white space in Java: false Unicode: true

If it's important for your application to match the unicode specs instead of the java specs, just define it yourself.

huangapple
  • 本文由 发表于 2020年8月5日 13:41:39
  • 转载请务必保留本文链接:https://go.coder-hub.com/63259064.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定