ANTLR解析器在Java中为什么不对无效的数字输入抛出错误?

huangapple go评论71阅读模式
英文:

Why does ANTLR parser not throw an error for invalid numerical input in Java?

问题

我建了以下ANTLR语法(antlr4-runtime-4.13.0)用于简单条件:

语法 Condition;

@header {
package expression;
}

条件
    :(表达式)('OR' 表达式)*
    ;
	
表达式
    : 标识 '=' 数字
    ;
	
标识 : ('a'..'z' | 'A'..'Z')+;
数字   : [0-9]+;
WS    : [ \t\r\n]+ -> 跳过;

我用这个Java主要测试它:

public class TestANTLRGrammar extends ConditionBaseListener   {
	
	public static void main(String[] args) {
		String entry = "id = 889xx88 OR y = 7";
		ConditionLexer lexer = new ConditionLexer(CharStreams.fromString(entry));
		TokenStream tokens = new CommonTokenStream(lexer);
		ConditionParser parser = new ConditionParser(tokens);
		parser.condition();
		System.out.println(parser.getNumberOfSyntaxErrors());
	}
}

我期望解析器会抛出错误,因为"889xx88"不应该被视为数字,但解析器识别为"id = 889"并停止,而不继续处理条件的其余部分(即"OR y = 7")。

getNumberOfSyntaxErrors()函数显示"0"。有人可以帮我解决这个问题吗?

英文:

I built following ANTLR grammar (antlr4-runtime-4.13.0) for a simple condition:

grammar Condition;

@header {
package expression;
}

condition
    :(expression)('OR' expression)*
    ;
	
expression
    : IDENT '=' NUM
    ;
	
IDENT : ('a'..'z' | 'A'..'Z')+;
NUM   : [0-9]+;
WS    : [ \t\r\n]+ -> skip;

I used this Java main test it:

public class TestANTLRGrammar extends ConditionBaseListener   {
	
	public static void main(String[] args) {
		String entry = "id = 889xx88 OR y = 7";
		ConditionLexer lexer = new ConditionLexer(CharStreams.fromString(entry));
		TokenStream tokens = new CommonTokenStream(lexer);
		ConditionParser parser = new ConditionParser(tokens);
		parser.condition();
		System.out.println(parser.getNumberOfSyntaxErrors());
	}
}

I expected the parser to throw an error because "889xx88" shouldn't be considered as number but the parser identified "id = 889" and stops without continuing to the rest of the condition (i.e. "OR y = 7").
The function getNumberOfSyntaxErrors() displayed "0".
Can anyone help me to fix this problem, please ?

I expected the parser to throw an error as explained above.

答案1

得分: 0

对于输入 id = 889xx88 OR y = 7,词法分析器将生成以下 9 个标记:

  • IDENT: id
  • '=': =
  • NUM: 889
  • IDENT: xx
  • NUM: 88
  • 'OR': OR
  • IDENT: y
  • '=': =
  • NUM: 7

如果现在让解析器规则 condition 消耗这些标记,它会愉快地从这些标记中创建 IDENT = NUM (id = 889),然后停止解析。

正如评论中的 kaby76 提到的:创建一个包含内置的 EOF(文件结束)标记的起始规则,以确保所有标记都被消耗(否则将报告错误,如果无法这样做):

start
 : condition EOF
 ;

请注意,解析器很可能只会将错误打印到 STDERR,并且(尝试)在错误后继续解析。这是ANTLR的默认错误恢复模式。如果想要更改这一点,请尝试搜索“ANTLR 自定义错误恢复”或“ANTLR 自定义错误处理程序”(或类似的内容)。

英文:

For the input id = 889xx88 OR y = 7, the lexer will produce the following 9 tokens:

  • IDENT: id
  • '=': =
  • NUM: 889
  • IDENT: xx
  • NUM: 88
  • 'OR': OR
  • IDENT: y
  • '=': =
  • NUM: 7

If you now let the parser rule condition consume these tokens, it happily creates IDENT = NUM (id = 889) from these tokens and will then stop parsing.

As mentioned by kaby76 in the comments: create a start rule that contains the built-in EOF (end-of-file) token to make sure all tokens are consumed (or an error will be reported, if it cannot do so):

start
 : condition EOF
 ;

Note that chances are that the parser will only print an error to your STDERR and will (try to) continue parsing after the error. This is the default error recovery mode of ANTLR. If you want to change that, try searching for "ANTLR custom error recovery" or "ANTLR custom error handler" (or similar).

huangapple
  • 本文由 发表于 2023年6月1日 05:42:48
  • 转载请务必保留本文链接:https://go.coder-hub.com/76377485.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定