Java正则表达式以匹配合法的Java标识符

huangapple go评论119阅读模式
英文:

Java regular expression to match valid Java identifiers

问题

I need to create a regular expression able to find and get valid identifiers in Java code like this:

int a, b, c;
float d, e;
a = b = 5;
c = 6;
if ( a > b)
{
c = a - b;
e = d - 2.0;
}
else
{
d = e + 6.0;
b = a + c;
}

I have tried to add multiple regexes in a single regex, but how can I build a pattern to exclude reserved words?

I tried this regex ^(((&&|<=|>=|<|>|!=|==|&|!)|([-+=]{1,2})|([.!?)}{;,(-]))|(else|if|float|int)|(\d[\d.])) but it does not work as expected.

In the following picture, how should I match for identifiers?

Java正则表达式以匹配合法的Java标识符

英文:

I need to create a regular expression able to find and get valid identifiers in Java code like this:

int a, b, c;
float d, e;
a = b = 5;
c = 6;
if ( a > b)
{
c = a - b;
e = d - 2.0;
}
else
{
d = e + 6.0;
b = a + c;
}

I have tried to add multiple regexes in a single regex, but how can I build a pattern to exclude reserved words?

I tried this regex ^(((&&|<=|>=|<|>|!=|==|&|!)|([-+=]{1,2})|([.!?)}{;,(-]))|(else|if|float|int)|(\d[\d.])) but it does not work as expected.

Online demo

In the following picture, how should I match for identifiers?

Java正则表达式以匹配合法的Java标识符

答案1

得分: 4

以下是您要求的代码部分的翻译:

一个Java有效的标识符应该满足以下条件
1) 至少有一个字符
2) 第一个字符必须是字母 `[a-zA-Z]`,下划线 `_` 或者美元符号 `$`
3) 剩下的字符可以是字母数字下划线或者美元符号
4) 不能使用保留字作为标识符
5) _更新_: 单个下划线 `_` 在Java 9中成为关键字 [Java 9文档](https://docs.oracle.com/javase/9/whatsnew/toc.htm#JSNEW-GUID-825576B5-203C-4C8D-85E5-FFDA4CA0B346)

一个用于验证前三个条件的简单正则表达式如下:`(\b([A-Za-z_$][$\w]*)\b)`,但它不能过滤掉保留字

为了排除保留字需要使用否定先行断言 `(?!)` 来指定不能匹配的一组标记
`\b(?!(_\b|if|else|for|float|int))([A-Za-z_$][$\w]*)`:
- 第一组:`(?!(_\b|if|else|for|float|int))` 排除指定的单词列表
- 第二组:`([A-Za-z_$][$\w]*)` 匹配标识符

然而词边界 `\b` 会吞掉美元符号 `$`,所以这个正则表达式无法匹配以 `$` 开头的标识符另外我们可能希望排除字符串和字符字面量内的匹配例如 "not_a_variable"'c'、'\u65')。

这可以通过使用正向后行断言 `(?<=)` 来匹配主要表达式之前的一组内容而不包括在结果中而不是词边界类 `\b` 来实现
`(?<=[^$\w&#39;&quot;\\])(?!(_\b|if|else|for|float|int))([A-Za-z_$][$\w]*)`

[用于一小部分保留字的在线演示](https://regexr.com/5h6rl)

接下来完整的Java保留字列表如下可以收集成一个用 `|` 分隔的标记字符串

下面提供了一个显示正则表达式的最终模式以及其用法来检测Java标识符的测试类

```java
import java.util.Arrays;
import java.util.List;
import java.util.regex.MatchResult;
import java.util.regex.Pattern;

public class IdFinder {

	static final List<String> RESERVED = Arrays.asList(
		"abstract", "assert", "boolean", "break", "byte", "case", "catch", "char", "class", "const",
		"continue", "default", "double", "do", "else", "enum", "extends", "false", "final", "finally",
		"float", "for", "goto", "if", "implements", "import", "instanceof", "int", "interface", "long",
		"native", "new", "null", "package", "private", "protected", "public", "return", "short", "static",
		"strictfp", "super", "switch", "synchronized", "this", "throw", "throws", "transient", "true", "try",
		"void", "volatile", "while", "_\\b"
	);

	static final String JAVA_KEYWORDS = String.join("|", RESERVED);

	static final Pattern VALID_IDENTIFIERS = Pattern.compile(
			"(?<=[^$\\w&#39;&quot;\\\\])(?!(" + JAVA_KEYWORDS + "))([A-Za-z_$][$\\w]*)");

	public static void main(String[] args) {
		System.out.println("标识符模式:\n" + VALID_IDENTIFIERS.pattern());

		String code = "public class Main {\n\tstatic int $1;\n\tprotected char _c0 = '&#39;\\u65&#39;';\n\tprivate long c1__$$;\n}";

		System.out.println("\n以下代码中的标识符:\n=====\n" + code + "\n=====");

		VALID_IDENTIFIERS.matcher(code).results()
						 .map(MatchResult::group)
						 .forEach(System.out::println);
	}
}

输出

标识符模式:
(?<=[^$\\w&#39;&quot;\\\\])(?!('abstract|assert|boolean|break|byte|case|catch|char|class|const|continue|default|double|do|else|enum|extends|false|final|finally|float|for|goto|if|implements|import|instanceof|int|interface|long|native|new|null|package|private|protected|public|return|short|static|strictfp|super|switch|synchronized|this|throw|throws|transient|true|try|void|volatile|while|_\\b'))([A-Za-z_$][$\\w]*)
以下代码中的标识符:
=====
public class Main {
static int $1;
protected char _c0 = '&#39;\\u65&#39;';
private long c1__$$;
}
=====
Main
$1
_c0
c1__$$

希望这能满足您的要求。

英文:

A Java valid identifier is:

  1. having at least one character
  2. the first character MUST be a letter [a-zA-Z], underscore _, or dollar sign $
  3. the rest of the characters MAY be letters, digits, underscores, or dollar signs
  4. reserved words MUST not be used as identifiers
  5. Update: as single underscore _ is a keyword since Java 9

A naive regexp to validate the first three conditions would be as follows: (\b([A-Za-z_$][$\w]*)\b) but it does not filter out the reserved words.

To exclude the reserved words, negative look-ahead (?!) is needed to specify a group of tokens that cannot match:
\b(?!(_\b|if|else|for|float|int))([A-Za-z_$][$\w]*):

  • Group #1: (?!(_\b|if|else|for|float|int)) excludes the list of the specified words
  • Group #2: ([A-Za-z_$][$\w]*) matches identifiers.

However, word border \b consumes dollar sign $, so this regular expression fails to match identifies starting with $.<br/>
Also, we may want to exclude matching inside string and character literals ("not_a_variable", 'c', '\u65').

This can be done using positive lookbehind (?&lt;=) to match a group before main expression without including it in the result instead of the word-border class \b:
(?&lt;=[^$\w&#39;&quot;\\])(?!(_\b|if|else|for|float|int))([A-Za-z_$][$\w]*)

Online demo for a short list of reserved words

Next, the full list of the Java reserved words is as follows, which can be collected into a single String of tokens separated with |.

A test class showing the final pattern for regular expression and its usage to detect the Java identifiers is provided below.

import java.util.Arrays;
import java.util.List;
import java.util.regex.MatchResult;
import java.util.regex.Pattern;

public class IdFinder {

	static final List&lt;String&gt; RESERVED = Arrays.asList(
		&quot;abstract&quot;, &quot;assert&quot;, &quot;boolean&quot;, &quot;break&quot;, &quot;byte&quot;, &quot;case&quot;, &quot;catch&quot;, &quot;char&quot;, &quot;class&quot;, &quot;const&quot;,
		&quot;continue&quot;, &quot;default&quot;, &quot;double&quot;, &quot;do&quot;, &quot;else&quot;, &quot;enum&quot;, &quot;extends&quot;, &quot;false&quot;, &quot;final&quot;, &quot;finally&quot;,
		&quot;float&quot;, &quot;for&quot;, &quot;goto&quot;, &quot;if&quot;, &quot;implements&quot;, &quot;import&quot;, &quot;instanceof&quot;, &quot;int&quot;, &quot;interface&quot;, &quot;long&quot;,
		&quot;native&quot;, &quot;new&quot;, &quot;null&quot;, &quot;package&quot;, &quot;private&quot;, &quot;protected&quot;, &quot;public&quot;, &quot;return&quot;, &quot;short&quot;, &quot;static&quot;,
		&quot;strictfp&quot;, &quot;super&quot;, &quot;switch&quot;, &quot;synchronized&quot;, &quot;this&quot;, &quot;throw&quot;, &quot;throws&quot;, &quot;transient&quot;, &quot;true&quot;, &quot;try&quot;,
		&quot;void&quot;, &quot;volatile&quot;, &quot;while&quot;, &quot;_\\b&quot;
	);

	static final String JAVA_KEYWORDS = String.join(&quot;|&quot;, RESERVED);

	static final Pattern VALID_IDENTIFIERS = Pattern.compile(
			&quot;(?&lt;=[^$\\w&#39;\&quot;\\\\])(?!(&quot; + JAVA_KEYWORDS + &quot;))([A-Za-z_$][$\\w]*)&quot;);

	public static void main(String[] args) {
		System.out.println(&quot;ID pattern:\n&quot; + VALID_IDENTIFIERS.pattern());

		String code = &quot;public class Main {\n\tstatic int $1;\n\tprotected char _c0 = &#39;\\u65&#39;;\n\tprivate long c1__$$;\n}&quot;;

		System.out.println(&quot;\nIdentifiers in the following code:\n=====\n&quot; + code + &quot;\n=====&quot;);

		VALID_IDENTIFIERS.matcher(code).results()
						 .map(MatchResult::group)
						 .forEach(System.out::println);
	}
}

Output

ID pattern:
(?&lt;=[^$\w&#39;&quot;\\])(?!(abstract|assert|boolean|break|byte|case|catch|char|class|const|continue|default|double|do|else|enum|extends|false|final|finally|float|for|goto|if|implements|import|instanceof|int|interface|long|native|new|null|package|private|protected|public|return|short|static|strictfp|super|switch|synchronized|this|throw|throws|transient|true|try|void|volatile|while|_\b))([A-Za-z_$][$\w]*)
Identifiers in the following code:
=====
public class Main {
static int $1;
protected char _c0 = &#39;\u65&#39;;
private long c1__$$;
}
=====
Main
$1
_c0
c1__$$

huangapple
  • 本文由 发表于 2020年10月27日 20:09:46
  • 转载请务必保留本文链接:https://go.coder-hub.com/64554165.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定