使用makeExprParser处理歧义

huangapple go评论54阅读模式
英文:

Using makeExprParser with ambiguity

问题

我正在尝试将一个基于 CFG 的工具(antlr)中的解析器转换到 Megaparsec 时遇到了问题。

语法包含了用尖括号(<>)括起来并用逗号分隔的表达式列表(使用 makeExprParser 处理)。

&lt;&gt;&lt;23&gt;&lt;23,87&gt; 等等。

问题在于这些表达式本身可能包含 &gt; 运算符(表示“大于”),这会导致我的解析器失败。
例如,&lt;1223&gt;234&gt; 应该解析为 [BinaryExpression ">&gt;" (IntExpr 1223) (IntExpr 234)]

我认为我需要在某个地方巧妙地使用 try,但我尝试过的地方(sepBy 的第一个参数和 makeExprParser 的第一个参数)不幸没有起作用。

在这种情况下,我是否可以在 Megaparsec 中使用 makeExprParser,还是我必须手动编写表达式解析器?

这是我的解析器的相关部分:

-- 使用 megaparsec、text 和 parser-combinators
{-# LANGUAGE OverloadedStrings #-}

module Main where

import Control.Monad.Combinators.Expr
import Data.Text
import Data.Void
import System.Environment
import Text.Megaparsec
import Text.Megaparsec.Char
import qualified Text.Megaparsec.Char.Lexer as L

type BinaryOperator = Text

type Name = Text

data Expr
  = IntExpr Integer
  | BinaryExpression BinaryOperator Expr Expr
  deriving (Eq, Show)

type Parser = Parsec Void Text

lexeme :: Parser a -> Parser a
lexeme = L.lexeme sc

symbol :: Text -> Parser Text
symbol = L.symbol sc

sc :: Parser ()
sc = L.space space1 (L.skipLineComment "//") (L.skipBlockCommentNested "/*" "*/")

parseInteger :: Parser Expr
parseInteger = do
  number <- some digitChar
  _ <- sc
  return $ IntExpr $ read number

parseExpr :: Parser Expr
parseExpr = makeExprParser parseInteger [[InfixL (BinaryExpression ">&gt;" <$ symbol ">&gt;")]]

parseBracketList :: Parser [Expr]
parseBracketList = do
  _ <- symbol "&lt;"
  exprs <- sepBy parseExpr (symbol ",")
  _ <- symbol "&gt;"
  return exprs

main :: IO ()
main = do
  text : _ <- getArgs
  let res = runParser parseBracketList "stdin" (pack text)
  case res of
    (Right suc) -> do
      print suc
    (Left err) ->
      putStrLn $ errorBundlePretty err

希望这对你有所帮助。

英文:

I'm currently encountering a problem while translating a parser from a CFG-based tool (antlr) to Megaparsec.

The grammar contains lists of expressions (handled with makeExprParser) that are enclosed in brackets (&lt;, &gt;) and separated by ,.

Stuff like &lt;&gt;, &lt;23&gt;, &lt;23,87&gt; etc.

The problem now is that the expressions may themselves contain the &gt; operator (meaning "greater than"), which causes my parser to fail.
&lt;1223&gt;234&gt; should, for example, be parsed into [BinaryExpression &quot;&gt;&quot; (IntExpr 1223) (IntExpr 234)].

I presume that I have to strategically place try somewhere, but the places I tried (to the first argument of sepBy and the first argument of makeExprParser) did unfortunately not work.

Can I use makeExprParser in such a situation or do I have to manually write the expression parser?:

This is the relevant part of my parser:

-- uses megaparsec, text, and parser-combinators
{-# LANGUAGE OverloadedStrings #-}

module Main where

import Control.Monad.Combinators.Expr
import Data.Text
import Data.Void
import System.Environment
import Text.Megaparsec
import Text.Megaparsec.Char
import qualified Text.Megaparsec.Char.Lexer as L

type BinaryOperator = Text

type Name = Text

data Expr
  = IntExpr Integer
  | BinaryExpression BinaryOperator Expr Expr
  deriving (Eq, Show)

type Parser = Parsec Void Text

lexeme :: Parser a -&gt; Parser a
lexeme = L.lexeme sc

symbol :: Text -&gt; Parser Text
symbol = L.symbol sc

sc :: Parser ()
sc = L.space space1 (L.skipLineComment &quot;//&quot;) (L.skipBlockCommentNested &quot;/*&quot; &quot;*/&quot;)

parseInteger :: Parser Expr
parseInteger = do
  number &lt;- some digitChar
  _ &lt;- sc
  return $ IntExpr $ read number

parseExpr :: Parser Expr
parseExpr = makeExprParser parseInteger [[InfixL (BinaryExpression &quot;&gt;&quot; &lt;$ symbol &quot;&gt;&quot;)]]

parseBracketList :: Parser [Expr]
parseBracketList = do
  _ &lt;- symbol &quot;&lt;&quot;
  exprs &lt;- sepBy parseExpr (symbol &quot;,&quot;)
  _ &lt;- symbol &quot;&gt;&quot;
  return exprs

main :: IO ()
main = do
  text : _ &lt;- getArgs
  let res = runParser parseBracketList &quot;stdin&quot; (pack text)
  case res of
    (Right suc) -&gt; do
      print suc
    (Left err) -&gt;
      putStrLn $ errorBundlePretty err

答案1

得分: 2

您可能误诊了问题。您的解析器在&lt;1233&gt;234&gt;上失败,因为它试图将&gt;解析为左结合运算符,就像+一样。换句话说,就像:

1+2+

会失败,因为第二个+没有右操作数,您的解析器之所以失败,是因为:

1233&gt;234&gt;

在第二个&gt;之后没有数字。假设您不希望您的&gt;运算符链接(即1&gt;2&gt;3不是有效的Expr),您应该首先在makeExprParser表中将InfixL替换为InfixN(非关联)。然后,它将可以成功解析此示例。

不幸的是,无论是否进行此更改,您的解析器仍然会在更简单的测试案例上失败:

&lt;1233&gt;

因为&gt;被解释为继续表达式内的运算符。

换句话说,问题不是您的解析器无法处理带有&gt;字符的表达式,而是它在处理&gt;字符时过于激进,阻止它们被识别为结束的角括号。

要修复这个问题,您需要准确地弄清楚您正在解析的内容。具体来说,您需要通过精确描述&gt;何时可以作为继续表达式的一部分以及何时不能来解决解析器中的歧义。

可能有效的一个规则是,只有在其后跟一个有效的“term”(即parseInteger)时,才将&gt;视为运算符。您可以使用lookAhead来实现这一点。以下解析器:

symbol ">"

只会在后面跟一个有效的term时解析&gt;运算符。如果未找到term,它将消耗一些输入(至少是&gt;符号本身),因此您必须将其包装在try中:

try (symbol ">"

将上述两个修复应用于parseExpr后:

parseExpr :: Parser Expr
parseExpr = makeExprParser term
[[InfixN (BinaryExpression ">" <$ try (symbol ">" <* lookAhead term))]]
where term = parseInteger

您将获得以下解析结果:

λ> parseTest parseBracketList "<23>"
[IntExpr 23]
λ> parseTest parseBracketList "<23,87>"
[IntExpr 23,IntExpr 87]
λ> parseTest parseBracketList "<23,87>18>"
[IntExpr 23,BinaryExpression ">" (IntExpr 87) (IntExpr 18)]

但以下会失败:

λ> parseTest parseBracketList "<23,87>18"
1:10:
|
1 | <23,87>18
|          ^
unexpected end of input
expecting ',', '>', or digit
λ>

因为&gt;后跟18意味着它是有效的运算符,它的失败解析是因为有效表达式87&gt;18之后既不跟逗号也不跟结束的&gt;角括号。

如果您需要解析类似<23,87>18的内容,那么您可能会遇到更大的问题。考虑以下两个测试案例:

<1,2>3,4,5,6,7,...,100000000000,100000000001>
<1,2>3,4,5,6,7,...,100000000000,100000000001

编写一个有效的解析器,可以将第一个解析为10000000000个表达式的列表,但将第二个解析为两个表达式的列表:

[IntExpr 1, IntExpr 2]

然后是一些“额外”文本。希望您要解析的底层“语言”不会如此严重地破裂,以至于这将成为一个问题。

英文:

You've (probably) misdiagnosed the problem. Your parser fails on &lt;1233&gt;234&gt; because it's trying to parse &gt; as a left associative operator, like +. In other words, the same way:

1+2+

would fail, because the second + has no right-hand operand, your parser is failing because:

1233&gt;234&gt;

has no digit following the second &gt;. Assuming you don't want your &gt; operator to chain (i.e., 1&gt;2&gt;3 is not a valid Expr), you should first replace InfixL with InfixN (non-associative) in your makeExprParser table. Then, it will parse this example fine.

Unfortunately, with or without this change your parser will still fail on the simpler test case:

&lt;1233&gt;

because the &gt; is interpreted as an operator within a continuing expression.

In other words, the problem isn't that your parser can't handle expressions with &gt; characters, it's that it's overly aggressive in treating &gt; characters as part of an expression, preventing them from being recognized as the closing angle bracket.

To fix this, you need to figure out exactly what you're parsing. Specifically, you need to resolve the ambiguity in your parser by precisely characterizing the situations where &gt; can be part of a continuing expression and where it can't.

One rule that will probably work is to only consider a &gt; as an operator if it is followed by a valid "term" (i.e., a parseInteger). You can do this with lookAhead. The parser:

symbol &quot;&gt;&quot; &lt;* lookAhead term

will parse a &gt; operator only if it is followed by a valid term. If it fails to find a term, it will consume some input (at least the &gt; symbol itself), so you must surround it with a try:

try (symbol &quot;&gt;&quot; &lt;* lookAhead term)

With the above two fixes applied to parseExpr:

parseExpr :: Parser Expr
parseExpr = makeExprParser term
[[InfixN (BinaryExpression &quot;&gt;&quot; &lt;$ try (symbol &quot;&gt;&quot; &lt;* lookAhead term))]]
where term = parseInteger

you'll get the following parses:

λ&gt; parseTest parseBracketList &quot;&lt;23&gt;&quot;
[IntExpr 23]
λ&gt; parseTest parseBracketList &quot;&lt;23,87&gt;&quot;
[IntExpr 23,IntExpr 87]
λ&gt; parseTest parseBracketList &quot;&lt;23,87&gt;18&gt;&quot;
[IntExpr 23,BinaryExpression &quot;&gt;&quot; (IntExpr 87) (IntExpr 18)]

However, the following will fail:

λ&gt; parseTest parseBracketList &quot;&lt;23,87&gt;18&quot;
1:10:
|
1 | &lt;23,87&gt;18
|          ^
unexpected end of input
expecting &#39;,&#39;, &#39;&gt;&#39;, or digit
λ&gt; 

because the fact that the &gt; is followed by 18 means that it is a valid operator, and it is parse failure that the valid expression 87&gt;18 is followed by neither a comma nor a closing &gt; angle bracket.

If you need to parse something like &lt;23,87&gt;18, you have bigger problems. Consider the following two test cases:

&lt;1,2&gt;3,4,5,6,7,...,100000000000,100000000001&gt;
&lt;1,2&gt;3,4,5,6,7,...,100000000000,100000000001

It's a challenge to write an efficient parser that will parse the first one as a list of 10000000000 expressions but the second one as a list of two expression:

[IntExpr 1, IntExpr 2]

followed by some "extra" text. Hopefully, the underlying "language" you're trying to parse isn't so hopelessly broken that this will be an issue.

huangapple
  • 本文由 发表于 2023年2月16日 05:59:48
  • 转载请务必保留本文链接:https://go.coder-hub.com/75465830.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定