英文:
Use token tokens in ANTLR4
问题
我在使用ANTLR时遇到了一个问题,我想知道类似这样的情况在ANTLR中是否可行。我已经准备了下面一个非常简化的示例。
grammar test;
test
: statement*
;
statement
: s1
| s2
;
s1
: 'OK' INT
;
s2
: 'ABC' US_INT
;
INT
: S_INT
| US_INT
;
S_INT
: [+-] [0-9]+
;
US_INT
: [0-9]+
;
对于输入 `OK 5` 一切正常,但对于 `ABC 5`,我收到以下错误:
第1行第4列 遇到不匹配的输入 '5',期望 US_INT
我在运行 `grun` 时使用了 `-tokens` 选项,我在这里得到了 `INT` 而不是 `US_INT`。
[@1,4:4='5',<INT>,1:4]
这让我想知道在ANTLR中是否可能出现这种情况。以前,我尝试过重新排列标记,将 `US_INT` 从 `INT` 中移出,使用片段和其他一些方法,但效果不佳。唯一的变化是 `OK 5` 不再工作,而 `ABC 5` 开始工作。我希望这两种情况都能正常工作而没有错误。
英文:
I ran into a problem with ANTLR and I wonder if a situation like this is even acceptable in ANTLR. I have prepared a very simplified example below.
grammar test;
test
: statement*
;
statement
: s1
| s2
;
s1
: 'OK' INT
;
s2
: 'ABC' US_INT
;
INT
: S_INT
| US_INT
;
S_INT
: [+-] [0-9]+
;
US_INT
: [0-9]+
;
For OK 5
everything is ok, but for ABC 5
I get the following error:
line 1:4 mismatched input '5' expecting US_INT
I was running the grun
with the -tokens
option and I have here INT
instead of US_INT
[@1,4:4='5',<INT>,1:4]
This made me wonder if such a situation in ANTLR was possible at all. Previously, I tried reordering tokens, moving US_INT
out of INT
, fragments and some other things, but it didn't work well. The only change was that OK 5
stopped working and ABC 5
started. I would like both of these cases to work without errors.
答案1
得分: 1
你面临的问题很简单:5
可以同时匹配 US_INT
(因为它包含 US_INT
)和 S_INT
本身。但只要 INT
在词法标记中的声明高于 US_INT
,词法分析器就会将 5
解析为 INT
。
为了解决这个问题,我建议将 INT
从词法标记移到解析器规则中,就像这样:
grammar test;
test
: statement*
;
statement
: s1
| s2
;
s1
: 'OK' int_stmt
;
s2
: 'ABC' US_INT
;
int_stmt
: S_INT | US_INT
;
S_INT
: [+-] [0-9]+
;
US_INT
: [0-9]+
;
英文:
The problem you're facing is quite simple: 5
can match both: US_INT
(since it contains US_INT
) and S_INT
itself. But, as long as INT
is declared higher than US_INT
, the lexer is going to resolve 5
as INT
.
To solve it, I'd suggest you moving INT
from lexer tokens to parser rules, like this:
grammar test;
test
: statement*
;
statement
: s1
| s2
;
s1
: 'OK' int_stmt
;
s2
: 'ABC' US_INT
;
int_stmt
: S_INT | US_INT
;
S_INT
: [+-] [0-9]+
;
US_INT
: [0-9]+
;
答案2
得分: 0
如果您想要在此情况下避免词法分析的优先级,您可以在Tunnel Grammar Studio中使用这个ABNF解析器语法,它完全没有这个问题:
test = *statement
statement = s-ok / s-abc
s-ok = "OK" 1*ws int
s-abc = "ABC" 1*ws unsigned-int
int = signed-int / unsigned-int
signed-int = ('+' / '-') unsigned-int
unsigned-int = 1*('0'-'9')
ws = %x20 / %x9 / %xA / %xD
这是大小写不敏感匹配的情况,如ABNF(RFC 5234)中所定义。您还可以为每个字符串显式地定义区分大小写或不区分大小写的匹配,分别使用%s"ABC"
或%i"ABC"
(RFC 7405)。当您开始有更多语句时,一些字符串将开始重叠,然后您可以在词法分析器语法中将它们定义为关键词:
keyword = %s"OK" / %s"OK2"
然后在解析器语法中进行如下操作:
s-ok = {keyword, %s"OK"} 1*ws int
s-ok-2 = {keyword, %s"OK2"} 1*ws int 1*ws int
s-ok-any = {keyword} 1*ws int *(ws 0*1 int)
请注意,最后一条规则将允许在整数之间有任何空白,并且任何关键词都将匹配。
*我开发了Tunnel Grammar Studio。语法非常简单,所以演示可能已经足够了。
英文:
If you want to escape, in this case, from the priorities of the lexing, you can use this ABNF parser grammar in Tunnel Grammar Studio, which does not have this issue at all:
test = *statement
statement = s-ok / s-abc
s-ok = "OK" 1*ws int
s-abc = "ABC" 1*ws unsigned-int
int = signed-int / unsigned-int
signed-int = ('+' / '-') unsigned-int
unsigned-int = 1*('0'-'9')
ws = %x20 / %x9 / %xA / %xD
This is the case of case-insensitive matching, as defined in ABNF (RFC 5234). You can also define explicitly the case sensitive or insensitive matching per string as %s"ABC"
or %i"ABC"
respectively (RFC 7405). When you start to have more statements, some strings will start to overlap, then you can make yourself keywords in the lexer grammar:
keyword = %s"OK" / %s"OK2"
and in the parser grammar do:
s-ok = {keyword, %s"OK"} 1*ws int
s-ok-2 = {keyword, %s"OK2"} 1*ws int 1*ws int
s-ok-any = {keyword} 1*ws int *(ws 0*1 int)
Note that the last rule, will allow you to have any white space in between the integers and any keyword will match.
*I develop Tunnel Grammar Studio. The grammar is quite simple, so the demo may be enough for you.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论