英文:
Why does Java not recognize these white spaces?
问题
有25种空白字符。下面的代码中Character.isWhitespace(char)
显示了其中25种中有四种在Java中不被视为空白字符。为什么?
public class Main {
public static void main(String...args){
char [] whiteSpaces = {'\u0085', '\u00A0', '\u2007', '\u202F'};
for(char space : whiteSpaces){
//所有这些字符在Java中都不是空白字符。
System.out.println("[" + space + "] is a white space in Java:" + Character.isWhitespace(space));
}
}
}
英文:
There are 25 types of white spaces. Character.isWhitespace(char)
in the code below shows that four of the 25 types are not considered as white space in Java. Why ?
public class Main {
public static void main(String...args){
char [] whiteSpaces = {'\u0085', '\u00A0', '\u2007', '\u202F'};
for(char space : whiteSpaces){
//All spaces are not white spaces in Java.
System.out.println("[" + space + "] is a white space in Java:" + Character.isWhitespace(space));
}
}
}
Refer -https://docs.oracle.com/en/java/javase/11/docs/api/java.base/java/lang/Character.html#isWhitespace(char)
答案1
得分: 3
为什么?因为这就是该方法的规定方式。isWhiteSpace
的javadoc列出了它匹配的代码。您所识别的4个字符不在列表中。
**我们无法告诉您为什么它被定义成这样。**但是,根据javadoc的说法,'\u00A0'
、'\u2007'
和 '\u202F'
被排除在外,因为它们是不间断的空白字符。
'\u0085'
或 NEL
是一个有趣的案例。根据Unicode代码表(请参见此处以获取非官方摘要),它不属于一般类别SPACE_SEPARATOR、LINE_SEPARATOR或PARAGRAPH_SEPARATOR。(它属于CONTROL类别。)
如果您想要一个识别所有Unicode空白字符(即SPACE_SEPARATOR、LINE_SEPARATOR或PARAGRAPH_SEPARATOR中的字符)的方法,应该使用 isSpaceChar
(javadoc)而不是isWhiteSpace
。
请注意,Unicode规范不是一个固定不变的东西。代码的分类,甚至“空白字符”的定义随着时间的推移而发展演变。每个Java版本实现了特定版本的Unicode规范,该规范在其发布时是当前的。例如:
- Java 8 实现了 Unicode 6.2
- Java 11 实现了 Unicode 10.0.0
- Java 13 实现了 Unicode 12.1
每个Java版本的Character
类的javadoc中都有详细信息。请注意,给定的Java版本不会被补丁以跟踪随后的Unicode发布。
总之,“空白字符”是一个相当模糊的概念。如果您想要实现特定含义的方法,可能需要自己实现。
英文:
Why? Because that is how that method is specified. The javadoc for isWhiteSpace
lists the codes that it matches. The 4 that you identified are not in the list.
We can't tell you why it was defined that way. However, one implication of what the javadoc says is that '\u00A0'
, '\u2007'
and '\u202F'
are excluded because they are non-breaking whitespace characters.
'\u0085'
or NEL
is an interesting case. According to the Unicode code tables (see here for an unofficial summary) it is NOT a member of the general categories SPACE_SEPARATOR, LINE_SEPARATOR or PARAGRAPH_SEPARATOR. (It shows up in the CONTROL category.)
If you want a method that recognises all Unicode white space characters (i.e. characters in SPACE_SEPARATOR, LINE_SEPARATOR or PARAGRAPH_SEPARATOR), you should use isSpaceChar
(javadoc) instead of isWhiteSpace
.
Note that the Unicode spec is not a constant thing. The categorization of the codes, and indeed the definition of "white space" has evolved over time. Each Java version implements a specific version of the Unicode spec that was current at the time it was released. For example:
- Java 8 implements Unicode 6.2
- Java 11 implements Unicode 10.0.0
- Java 13 implements Unicode 12.1
The details are in the javadoc for the Character
class for each Java version. Note that a given Java release is NOT patched to track subsequent Unicode releases.
The bottom line is that "white space" is a rather slippery concept. If you want a method that implements a specific meaning, you may need to implement it yourself.
答案2
得分: 3
如果您阅读文档,即Character.isWhitespace(char)
的javadoc,它说:
> 根据Java,确定指定的字符是否为空格。只有当满足以下条件之一时,字符才是Java空格字符:
>
> - 它是Unicode空格字符(SPACE_SEPARATOR
、LINE_SEPARATOR
或PARAGRAPH_SEPARATOR
),但不是不换行空格('\u00A0'
、'\u2007'
、'\u202F'
)。
> - 它是'\t'
,U+0009水平制表符
。
> - 它是'\n'
,U+000A换行符
。
> - 它是'\u000B'
,U+000B垂直制表符
。
> - 它是'\f'
,U+000C换页符
。
> - 它是'\r'
,U+000D回车符
。
> - 它是'\u001C'
,U+001C文件分隔符
。
> - 它是'\u001D'
,U+001D组分隔符
。
> - 它是'\u001E'
,U+001E记录分隔符
。
> - 它是'\u001F'
,U+001F单元分隔符
。
您列出的4个字符中的3个明确被排除,因为它们是不换行空格。
至于U+0085下一行(NEL)
,它不是Unicode空格字符,Java不将其视为空格字符,您可以在javadoc中清楚地看到。
英文:
If you read the documentation, i.e. the javadoc of Character.isWhitespace(char)
, it says:
> Determines if the specified character is white space according to Java. A character is a Java whitespace character if and only if it satisfies one of the following criteria:
>
> - It is a Unicode space character (SPACE_SEPARATOR
, LINE_SEPARATOR
, or PARAGRAPH_SEPARATOR
) but is not also a non-breaking space ('\u00A0'
, '\u2007'
, '\u202F'
).
> - It is '\t'
, U+0009 HORIZONTAL TABULATION
.
> - It is '\n'
, U+000A LINE FEED
.
> - It is '\u000B'
, U+000B VERTICAL TABULATION
.
> - It is '\f'
, U+000C FORM FEED
.
> - It is '\r'
, U+000D CARRIAGE RETURN
.
> - It is '\u001C'
, U+001C FILE SEPARATOR
.
> - It is '\u001D'
, U+001D GROUP SEPARATOR
.
> - It is '\u001E'
, U+001E RECORD SEPARATOR
.
> - It is '\u001F'
, U+001F UNIT SEPARATOR
.
3 of the 4 you listed are explicitly excluded because they are non-breaking spaces.
As for U+0085 NEXT LINE (NEL)
, it is not a Unicode space character, and it is not considered a whitespace character by Java, as you can well see in that javadoc.
答案3
得分: 1
Java似乎没有在任何地方公开Unicode空白字符列表。
在Java中,isWhitespace明确定义为以下之一:
- 它是Unicode空格字符(SPACE_SEPARATOR、LINE_SEPARATOR或PARAGRAPH_SEPARATOR),但不是不间断空格('\u00A0'、'\u2007'或'\u202F')之一。
- 它是'\t',U+0009水平制表符。
- 它是'\n',U+000A换行符。
- 它是'\u000B',U+000B垂直制表符。
- 它是'\f',U+000C换页符。
- 它是'\r',U+000D回车符。
- 它是'\u001C',U+001C文件分隔符。
- 它是'\u001D',U+001D组分隔符。
- 它是'\u001E',U+001E记录分隔符。
- 它是'\u001F',U+001F单元分隔符。
Java还通过Character.isSpaceChar()
提供Unicode的空格,但不包括Unicode的空白字符,这是一个稍微不同的列表。
char[] whiteSpaces = {'\u0085', '\u00A0', '\u2007', '\u202F'};
for (char space : whiteSpaces) {
// 在Java中,所有空格都不是空白字符。
System.out.println("[" + space + "] is a white space in Java: " + Character.isWhitespace(space) + " Unicode: " + Character.isSpaceChar(space));
}
输出:
[
] is a white space in Java: false Unicode: false
[ ] is a white space in Java: false Unicode: true
[ ] is a white space in Java: false Unicode: true
[ ] is a white space in Java: false Unicode: true
如果对您的应用程序来说匹配Unicode规范而不是Java规范很重要,您可以自行定义。
英文:
Java doesn't seem to expose the unicode whitespace list anywhere
In Java, isWhitespace is specifically defined as one of these:
- It is a Unicode space character (SPACE_SEPARATOR, LINE_SEPARATOR, or PARAGRAPH_SEPARATOR) but is not also a non-breaking space ('\u00A0', '\u2007', '\u202F').
- It is '\t', U+0009 HORIZONTAL TABULATION.
- It is '\n', U+000A LINE FEED.
- It is '\u000B', U+000B VERTICAL TABULATION.
- It is '\f', U+000C FORM FEED.
- It is '\r', U+000D CARRIAGE RETURN.
- It is '\u001C', U+001C FILE SEPARATOR.
- It is '\u001D', U+001D GROUP SEPARATOR.
- It is '\u001E', U+001E RECORD SEPARATOR.
- It is '\u001F', U+001F UNIT SEPARATOR.
Java also makes unicode spaces available, but not unicode whitespaces, via Character.isSpaceChar()
. This is a slightly different list.
char [] whiteSpaces = {'\u0085', '\u00A0', '\u2007', '\u202F'};
for(char space : whiteSpaces){
//All spaces are not white spaces in Java.
System.out.println("[" + space + "] is a white space in Java: " + Character.isWhitespace(space) + " Unicode: " + Character.isSpaceChar(space));
}
Output:
[
] is a white space in Java: false Unicode: false
[ ] is a white space in Java: false Unicode: true
[ ] is a white space in Java: false Unicode: true
[ ] is a white space in Java: false Unicode: true
If it's important for your application to match the unicode specs instead of the java specs, just define it yourself.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论