英文:
Why does ANTLR parser not throw an error for invalid numerical input in Java?
问题
我建了以下ANTLR语法(antlr4-runtime-4.13.0)用于简单条件:
语法 Condition;
@header {
package expression;
}
条件
:(表达式)('OR' 表达式)*
;
表达式
: 标识 '=' 数字
;
标识 : ('a'..'z' | 'A'..'Z')+;
数字 : [0-9]+;
WS : [ \t\r\n]+ -> 跳过;
我用这个Java主要测试它:
public class TestANTLRGrammar extends ConditionBaseListener {
public static void main(String[] args) {
String entry = "id = 889xx88 OR y = 7";
ConditionLexer lexer = new ConditionLexer(CharStreams.fromString(entry));
TokenStream tokens = new CommonTokenStream(lexer);
ConditionParser parser = new ConditionParser(tokens);
parser.condition();
System.out.println(parser.getNumberOfSyntaxErrors());
}
}
我期望解析器会抛出错误,因为"889xx88"不应该被视为数字,但解析器识别为"id = 889"并停止,而不继续处理条件的其余部分(即"OR y = 7")。
getNumberOfSyntaxErrors()函数显示"0"。有人可以帮我解决这个问题吗?
英文:
I built following ANTLR grammar (antlr4-runtime-4.13.0) for a simple condition:
grammar Condition;
@header {
package expression;
}
condition
:(expression)('OR' expression)*
;
expression
: IDENT '=' NUM
;
IDENT : ('a'..'z' | 'A'..'Z')+;
NUM : [0-9]+;
WS : [ \t\r\n]+ -> skip;
I used this Java main test it:
public class TestANTLRGrammar extends ConditionBaseListener {
public static void main(String[] args) {
String entry = "id = 889xx88 OR y = 7";
ConditionLexer lexer = new ConditionLexer(CharStreams.fromString(entry));
TokenStream tokens = new CommonTokenStream(lexer);
ConditionParser parser = new ConditionParser(tokens);
parser.condition();
System.out.println(parser.getNumberOfSyntaxErrors());
}
}
I expected the parser to throw an error because "889xx88" shouldn't be considered as number but the parser identified "id = 889" and stops without continuing to the rest of the condition (i.e. "OR y = 7").
The function getNumberOfSyntaxErrors() displayed "0".
Can anyone help me to fix this problem, please ?
I expected the parser to throw an error as explained above.
答案1
得分: 0
对于输入 id = 889xx88 OR y = 7
,词法分析器将生成以下 9 个标记:
IDENT
:id
'='
:=
NUM
:889
IDENT
:xx
NUM
:88
'OR'
:OR
IDENT
:y
'='
:=
NUM
:7
如果现在让解析器规则 condition
消耗这些标记,它会愉快地从这些标记中创建 IDENT = NUM
(id = 889
),然后停止解析。
正如评论中的 kaby76 提到的:创建一个包含内置的 EOF
(文件结束)标记的起始规则,以确保所有标记都被消耗(否则将报告错误,如果无法这样做):
start
: condition EOF
;
请注意,解析器很可能只会将错误打印到 STDERR,并且(尝试)在错误后继续解析。这是ANTLR的默认错误恢复模式。如果想要更改这一点,请尝试搜索“ANTLR 自定义错误恢复”或“ANTLR 自定义错误处理程序”(或类似的内容)。
英文:
For the input id = 889xx88 OR y = 7
, the lexer will produce the following 9 tokens:
IDENT
:id
'='
:=
NUM
:889
IDENT
:xx
NUM
:88
'OR'
:OR
IDENT
:y
'='
:=
NUM
:7
If you now let the parser rule condition
consume these tokens, it happily creates IDENT = NUM
(id = 889
) from these tokens and will then stop parsing.
As mentioned by kaby76 in the comments: create a start rule that contains the built-in EOF
(end-of-file) token to make sure all tokens are consumed (or an error will be reported, if it cannot do so):
start
: condition EOF
;
Note that chances are that the parser will only print an error to your STDERR and will (try to) continue parsing after the error. This is the default error recovery mode of ANTLR. If you want to change that, try searching for "ANTLR custom error recovery" or "ANTLR custom error handler" (or similar).
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论