Antlr模式匹配和词法分析器模式

huangapple go评论81阅读模式
英文:

Antlr pattern matching and lexer modes

问题

我正试图编译一个用于 [HTML 语法][1] 的模式。下面的代码展示了如何解析包含 `htmlAttributeRule` 的字符串:

    String code = "href=\"val\"";
    CharStream chars = CharStreams.fromString(code);
    Lexer lexer = new HTMLLexer(chars);
    lexer.pushMode(HTMLLexer.TAG);
    TokenStream tokens = new CommonTokenStream(lexer);
    HTMLParser parser = new HTMLParser(tokens);
    parser.htmlAttribute();

但是当我尝试执行以下操作时:

    ParseTreePatternMatcher matcher = new ParseTreePatternMatcher(lexer, parser);
    matcher.compile(code, HTMLParser.RULE_htmlAttribute);

它会失败,并显示错误信息:

    第1行第0列,在输入中没有可行的替代项 'href="val"'

    org.antlr.v4.runtime.NoViableAltException
	at org.antlr.v4.runtime.atn.ParserATNSimulator.noViableAlt(ParserATNSimulator.java:2026)
	at org.antlr.v4.runtime.atn.ParserATNSimulator.execATN(ParserATNSimulator.java:467)
	at org.antlr.v4.runtime.atn.ParserATNSimulator.adaptivePredict(ParserATNSimulator.java:393)
	at org.antlr.v4.runtime.ParserInterpreter.visitDecisionState(ParserInterpreter.java:316)
	at org.antlr.v4.runtime.ParserInterpreter.visitState(ParserInterpreter.java:223)
	at org.antlr.v4.runtime.ParserInterpreter.parse(ParserInterpreter.java:194)
	at org.antlr.v4.runtime.tree.pattern.ParseTreePatternMatcher.compile(ParseTreePatternMatcher.java:205)

当我尝试执行以下操作时:

    List<? extends Token> tokenList = matcher.tokenize(code);

结果只包含一个标记,与使用 `DEFAULT_MODE` 模式下使用词法分析器的结果相同。是否有办法修复这个问题?

  [1]: https://github.com/antlr/grammars-v4/tree/master/html
英文:

I am trying to compile a pattern for html grammar. The code below shows how to parse a string containing htmlAttributeRule:

String code = &quot;href=\&quot;val\&quot;&quot;;
CharStream chars = CharStreams.fromString(code);
Lexer lexer = new HTMLLexer(chars);
lexer.pushMode(HTMLLexer.TAG);
TokenStream tokens = new CommonTokenStream(lexer);
HTMLParser parser = new HTMLParser(tokens);
parser.htmlAttribute();

But when i'm trying to:

ParseTreePatternMatcher matcher = new ParseTreePatternMatcher(lexer, parser);
matcher.compile(code, HTMLParser.RULE_htmlAttribute);

it fails with error:

line 1:0 no viable alternative at input &#39;href=&quot;val&quot;&#39;

org.antlr.v4.runtime.NoViableAltException
at org.antlr.v4.runtime.atn.ParserATNSimulator.noViableAlt(ParserATNSimulator.java:2026)
at org.antlr.v4.runtime.atn.ParserATNSimulator.execATN(ParserATNSimulator.java:467)
at org.antlr.v4.runtime.atn.ParserATNSimulator.adaptivePredict(ParserATNSimulator.java:393)
at org.antlr.v4.runtime.ParserInterpreter.visitDecisionState(ParserInterpreter.java:316)
at org.antlr.v4.runtime.ParserInterpreter.visitState(ParserInterpreter.java:223)
at org.antlr.v4.runtime.ParserInterpreter.parse(ParserInterpreter.java:194)
at org.antlr.v4.runtime.tree.pattern.ParseTreePatternMatcher.compile(ParseTreePatternMatcher.java:205)

When i tried to:

List&lt;? extends Token&gt; tokenList = matcher.tokenize(code);

The result contained a single token, the same as when using the lexer with DEFAULT_MODE. Is there some way to fix this?

答案1

得分: 1

问题出在 ParseTreePatternMatcher::tokenize 方法的以下代码:

TextChunk textChunk = (TextChunk)chunk;
ANTLRInputStream in = new ANTLRInputStream(textChunk.getText());
lexer.setInputStream(in);
Token t = lexer.nextToken();

Lexer::setInputStream 方法会清除 _modeStack,并将 _mode 设置为 0。一个可能的解决方案是扩展 ParseTreePatternMatcher,重写 tokenize 方法,并在 lexer.setInputStream(in) 之后插入 lexer.pushMode(lexerMode)

TextChunk textChunk = (TextChunk)chunk;
ANTLRInputStream in = new ANTLRInputStream(textChunk.getText());
lexer.setInputStream(in);
lexer.pushMode(lexerMode);
Token t = lexer.nextToken();

但是 tokenize 方法使用了 ChunkTextChunk,这些在包外部无法访问,所以我们必须将扩展类定义在与 ParseTreePatternMatcher 相同的包中。

我考虑的另一个解决方案是使用 ASM 修改该方法的字节码。

英文:

The problem was the following code from ParseTreePatternMatcher::tokenize:

TextChunk textChunk = (TextChunk)chunk;
ANTLRInputStream in = new ANTLRInputStream(textChunk.getText());
lexer.setInputStream(in);
Token t = lexer.nextToken();

Lexer::setInputStream clears _modeStack and sets _mode to 0. One possible solution is to extend ParseTreePatternMatcher, override method tokenize and insert lexer.pushMode(lexerMode) after lexer.setInputStream(in):

TextChunk textChunk = (TextChunk)chunk;
ANTLRInputStream in = new ANTLRInputStream(textChunk.getText());
lexer.setInputStream(in);
lexer.pushMode(lexerMode);
Token t = lexer.nextToken();

But method tokenize uses Chunk and TextChunk which cannot be accesses from outsize package, so we are obligated to define the extension class in the same package as ParseTreePatternMatcher.

Another solution i'm considering is to modify byte code of the method using ASM.

huangapple
  • 本文由 发表于 2020年8月30日 02:31:51
  • 转载请务必保留本文链接:https://go.coder-hub.com/63650492.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定