英文:
Antlr pattern matching and lexer modes
问题
我正试图编译一个用于 [HTML 语法][1] 的模式。下面的代码展示了如何解析包含 `htmlAttributeRule` 的字符串:
String code = "href=\"val\"";
CharStream chars = CharStreams.fromString(code);
Lexer lexer = new HTMLLexer(chars);
lexer.pushMode(HTMLLexer.TAG);
TokenStream tokens = new CommonTokenStream(lexer);
HTMLParser parser = new HTMLParser(tokens);
parser.htmlAttribute();
但是当我尝试执行以下操作时:
ParseTreePatternMatcher matcher = new ParseTreePatternMatcher(lexer, parser);
matcher.compile(code, HTMLParser.RULE_htmlAttribute);
它会失败,并显示错误信息:
第1行第0列,在输入中没有可行的替代项 'href="val"'
org.antlr.v4.runtime.NoViableAltException
at org.antlr.v4.runtime.atn.ParserATNSimulator.noViableAlt(ParserATNSimulator.java:2026)
at org.antlr.v4.runtime.atn.ParserATNSimulator.execATN(ParserATNSimulator.java:467)
at org.antlr.v4.runtime.atn.ParserATNSimulator.adaptivePredict(ParserATNSimulator.java:393)
at org.antlr.v4.runtime.ParserInterpreter.visitDecisionState(ParserInterpreter.java:316)
at org.antlr.v4.runtime.ParserInterpreter.visitState(ParserInterpreter.java:223)
at org.antlr.v4.runtime.ParserInterpreter.parse(ParserInterpreter.java:194)
at org.antlr.v4.runtime.tree.pattern.ParseTreePatternMatcher.compile(ParseTreePatternMatcher.java:205)
当我尝试执行以下操作时:
List<? extends Token> tokenList = matcher.tokenize(code);
结果只包含一个标记,与使用 `DEFAULT_MODE` 模式下使用词法分析器的结果相同。是否有办法修复这个问题?
[1]: https://github.com/antlr/grammars-v4/tree/master/html
英文:
I am trying to compile a pattern for html grammar. The code below shows how to parse a string containing htmlAttributeRule
:
String code = "href=\"val\"";
CharStream chars = CharStreams.fromString(code);
Lexer lexer = new HTMLLexer(chars);
lexer.pushMode(HTMLLexer.TAG);
TokenStream tokens = new CommonTokenStream(lexer);
HTMLParser parser = new HTMLParser(tokens);
parser.htmlAttribute();
But when i'm trying to:
ParseTreePatternMatcher matcher = new ParseTreePatternMatcher(lexer, parser);
matcher.compile(code, HTMLParser.RULE_htmlAttribute);
it fails with error:
line 1:0 no viable alternative at input 'href="val"'
org.antlr.v4.runtime.NoViableAltException
at org.antlr.v4.runtime.atn.ParserATNSimulator.noViableAlt(ParserATNSimulator.java:2026)
at org.antlr.v4.runtime.atn.ParserATNSimulator.execATN(ParserATNSimulator.java:467)
at org.antlr.v4.runtime.atn.ParserATNSimulator.adaptivePredict(ParserATNSimulator.java:393)
at org.antlr.v4.runtime.ParserInterpreter.visitDecisionState(ParserInterpreter.java:316)
at org.antlr.v4.runtime.ParserInterpreter.visitState(ParserInterpreter.java:223)
at org.antlr.v4.runtime.ParserInterpreter.parse(ParserInterpreter.java:194)
at org.antlr.v4.runtime.tree.pattern.ParseTreePatternMatcher.compile(ParseTreePatternMatcher.java:205)
When i tried to:
List<? extends Token> tokenList = matcher.tokenize(code);
The result contained a single token, the same as when using the lexer with DEFAULT_MODE
. Is there some way to fix this?
答案1
得分: 1
问题出在 ParseTreePatternMatcher::tokenize
方法的以下代码:
TextChunk textChunk = (TextChunk)chunk;
ANTLRInputStream in = new ANTLRInputStream(textChunk.getText());
lexer.setInputStream(in);
Token t = lexer.nextToken();
Lexer::setInputStream
方法会清除 _modeStack
,并将 _mode
设置为 0
。一个可能的解决方案是扩展 ParseTreePatternMatcher
,重写 tokenize
方法,并在 lexer.setInputStream(in)
之后插入 lexer.pushMode(lexerMode)
:
TextChunk textChunk = (TextChunk)chunk;
ANTLRInputStream in = new ANTLRInputStream(textChunk.getText());
lexer.setInputStream(in);
lexer.pushMode(lexerMode);
Token t = lexer.nextToken();
但是 tokenize
方法使用了 Chunk
和 TextChunk
,这些在包外部无法访问,所以我们必须将扩展类定义在与 ParseTreePatternMatcher
相同的包中。
我考虑的另一个解决方案是使用 ASM 修改该方法的字节码。
英文:
The problem was the following code from ParseTreePatternMatcher::tokenize
:
TextChunk textChunk = (TextChunk)chunk;
ANTLRInputStream in = new ANTLRInputStream(textChunk.getText());
lexer.setInputStream(in);
Token t = lexer.nextToken();
Lexer::setInputStream
clears _modeStack
and sets _mode
to 0
. One possible solution is to extend ParseTreePatternMatcher
, override method tokenize
and insert lexer.pushMode(lexerMode)
after lexer.setInputStream(in)
:
TextChunk textChunk = (TextChunk)chunk;
ANTLRInputStream in = new ANTLRInputStream(textChunk.getText());
lexer.setInputStream(in);
lexer.pushMode(lexerMode);
Token t = lexer.nextToken();
But method tokenize
uses Chunk
and TextChunk
which cannot be accesses from outsize package, so we are obligated to define the extension class in the same package as ParseTreePatternMatcher
.
Another solution i'm considering is to modify byte code of the method using ASM.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论