英文:
Java regular expression to match valid Java identifiers
问题
I need to create a regular expression able to find and get valid identifiers in Java code like this:
int a, b, c;
float d, e;
a = b = 5;
c = 6;
if ( a > b)
{
c = a - b;
e = d - 2.0;
}
else
{
d = e + 6.0;
b = a + c;
}
I have tried to add multiple regexes in a single regex, but how can I build a pattern to exclude reserved words?
I tried this regex ^(((&&|<=|>=|<|>|!=|==|&|!)|([-+=]{1,2})|([.!?)}{;,(-]))|(else|if|float|int)|(\d[\d.]))
but it does not work as expected.
In the following picture, how should I match for identifiers?
英文:
I need to create a regular expression able to find and get valid identifiers in Java code like this:
int a, b, c;
float d, e;
a = b = 5;
c = 6;
if ( a > b)
{
c = a - b;
e = d - 2.0;
}
else
{
d = e + 6.0;
b = a + c;
}
I have tried to add multiple regexes in a single regex, but how can I build a pattern to exclude reserved words?
I tried this regex ^(((&&|<=|>=|<|>|!=|==|&|!)|([-+=]{1,2})|([.!?)}{;,(-]))|(else|if|float|int)|(\d[\d.]))
but it does not work as expected.
In the following picture, how should I match for identifiers?
答案1
得分: 4
以下是您要求的代码部分的翻译:
一个Java有效的标识符应该满足以下条件:
1) 至少有一个字符
2) 第一个字符必须是字母 `[a-zA-Z]`,下划线 `_` 或者美元符号 `$`
3) 剩下的字符可以是字母、数字、下划线或者美元符号
4) 不能使用保留字作为标识符
5) _更新_: 单个下划线 `_` 在Java 9中成为关键字 [Java 9文档](https://docs.oracle.com/javase/9/whatsnew/toc.htm#JSNEW-GUID-825576B5-203C-4C8D-85E5-FFDA4CA0B346)
一个用于验证前三个条件的简单正则表达式如下:`(\b([A-Za-z_$][$\w]*)\b)`,但它不能过滤掉保留字。
为了排除保留字,需要使用否定先行断言 `(?!)` 来指定不能匹配的一组标记:
`\b(?!(_\b|if|else|for|float|int))([A-Za-z_$][$\w]*)`:
- 第一组:`(?!(_\b|if|else|for|float|int))` 排除指定的单词列表
- 第二组:`([A-Za-z_$][$\w]*)` 匹配标识符。
然而,词边界 `\b` 会吞掉美元符号 `$`,所以这个正则表达式无法匹配以 `$` 开头的标识符。另外,我们可能希望排除字符串和字符字面量内的匹配(例如 "not_a_variable"、'c'、'\u65')。
这可以通过使用正向后行断言 `(?<=)` 来匹配主要表达式之前的一组内容,而不包括在结果中,而不是词边界类 `\b` 来实现:
`(?<=[^$\w'"\\])(?!(_\b|if|else|for|float|int))([A-Za-z_$][$\w]*)`
[用于一小部分保留字的在线演示](https://regexr.com/5h6rl)
接下来,完整的Java保留字列表如下,可以收集成一个用 `|` 分隔的标记字符串。
下面提供了一个显示正则表达式的最终模式以及其用法来检测Java标识符的测试类。
```java
import java.util.Arrays;
import java.util.List;
import java.util.regex.MatchResult;
import java.util.regex.Pattern;
public class IdFinder {
static final List<String> RESERVED = Arrays.asList(
"abstract", "assert", "boolean", "break", "byte", "case", "catch", "char", "class", "const",
"continue", "default", "double", "do", "else", "enum", "extends", "false", "final", "finally",
"float", "for", "goto", "if", "implements", "import", "instanceof", "int", "interface", "long",
"native", "new", "null", "package", "private", "protected", "public", "return", "short", "static",
"strictfp", "super", "switch", "synchronized", "this", "throw", "throws", "transient", "true", "try",
"void", "volatile", "while", "_\\b"
);
static final String JAVA_KEYWORDS = String.join("|", RESERVED);
static final Pattern VALID_IDENTIFIERS = Pattern.compile(
"(?<=[^$\\w'"\\\\])(?!(" + JAVA_KEYWORDS + "))([A-Za-z_$][$\\w]*)");
public static void main(String[] args) {
System.out.println("标识符模式:\n" + VALID_IDENTIFIERS.pattern());
String code = "public class Main {\n\tstatic int $1;\n\tprotected char _c0 = ''\\u65'';\n\tprivate long c1__$$;\n}";
System.out.println("\n以下代码中的标识符:\n=====\n" + code + "\n=====");
VALID_IDENTIFIERS.matcher(code).results()
.map(MatchResult::group)
.forEach(System.out::println);
}
}
输出
标识符模式:
(?<=[^$\\w'"\\\\])(?!('abstract|assert|boolean|break|byte|case|catch|char|class|const|continue|default|double|do|else|enum|extends|false|final|finally|float|for|goto|if|implements|import|instanceof|int|interface|long|native|new|null|package|private|protected|public|return|short|static|strictfp|super|switch|synchronized|this|throw|throws|transient|true|try|void|volatile|while|_\\b'))([A-Za-z_$][$\\w]*)
以下代码中的标识符:
=====
public class Main {
static int $1;
protected char _c0 = ''\\u65'';
private long c1__$$;
}
=====
Main
$1
_c0
c1__$$
希望这能满足您的要求。
英文:
A Java valid identifier is:
- having at least one character
- the first character MUST be a letter
[a-zA-Z]
, underscore_
, or dollar sign$
- the rest of the characters MAY be letters, digits, underscores, or dollar signs
- reserved words MUST not be used as identifiers
- Update: as single underscore
_
is a keyword since Java 9
A naive regexp to validate the first three conditions would be as follows: (\b([A-Za-z_$][$\w]*)\b)
but it does not filter out the reserved words.
To exclude the reserved words, negative look-ahead (?!)
is needed to specify a group of tokens that cannot match:
\b(?!(_\b|if|else|for|float|int))([A-Za-z_$][$\w]*)
:
- Group #1:
(?!(_\b|if|else|for|float|int))
excludes the list of the specified words - Group #2:
([A-Za-z_$][$\w]*)
matches identifiers.
However, word border \b
consumes dollar sign $
, so this regular expression fails to match identifies starting with $
.<br/>
Also, we may want to exclude matching inside string and character literals ("not_a_variable", 'c', '\u65').
This can be done using positive lookbehind (?<=)
to match a group before main expression without including it in the result instead of the word-border class \b
:
(?<=[^$\w'"\\])(?!(_\b|if|else|for|float|int))([A-Za-z_$][$\w]*)
Online demo for a short list of reserved words
Next, the full list of the Java reserved words is as follows, which can be collected into a single String of tokens separated with |
.
A test class showing the final pattern for regular expression and its usage to detect the Java identifiers is provided below.
import java.util.Arrays;
import java.util.List;
import java.util.regex.MatchResult;
import java.util.regex.Pattern;
public class IdFinder {
static final List<String> RESERVED = Arrays.asList(
"abstract", "assert", "boolean", "break", "byte", "case", "catch", "char", "class", "const",
"continue", "default", "double", "do", "else", "enum", "extends", "false", "final", "finally",
"float", "for", "goto", "if", "implements", "import", "instanceof", "int", "interface", "long",
"native", "new", "null", "package", "private", "protected", "public", "return", "short", "static",
"strictfp", "super", "switch", "synchronized", "this", "throw", "throws", "transient", "true", "try",
"void", "volatile", "while", "_\\b"
);
static final String JAVA_KEYWORDS = String.join("|", RESERVED);
static final Pattern VALID_IDENTIFIERS = Pattern.compile(
"(?<=[^$\\w'\"\\\\])(?!(" + JAVA_KEYWORDS + "))([A-Za-z_$][$\\w]*)");
public static void main(String[] args) {
System.out.println("ID pattern:\n" + VALID_IDENTIFIERS.pattern());
String code = "public class Main {\n\tstatic int $1;\n\tprotected char _c0 = '\\u65';\n\tprivate long c1__$$;\n}";
System.out.println("\nIdentifiers in the following code:\n=====\n" + code + "\n=====");
VALID_IDENTIFIERS.matcher(code).results()
.map(MatchResult::group)
.forEach(System.out::println);
}
}
Output
ID pattern:
(?<=[^$\w'"\\])(?!(abstract|assert|boolean|break|byte|case|catch|char|class|const|continue|default|double|do|else|enum|extends|false|final|finally|float|for|goto|if|implements|import|instanceof|int|interface|long|native|new|null|package|private|protected|public|return|short|static|strictfp|super|switch|synchronized|this|throw|throws|transient|true|try|void|volatile|while|_\b))([A-Za-z_$][$\w]*)
Identifiers in the following code:
=====
public class Main {
static int $1;
protected char _c0 = '\u65';
private long c1__$$;
}
=====
Main
$1
_c0
c1__$$
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论