如何解决LR(1)文法定义中的歧义问题?

huangapple go评论80阅读模式
英文:

How to resolve ambiguity in the definition of an LR(1) grammar?

问题

我正在用OCaml编写一个Golang编译器,参数列表让我有些头疼。在Go语言中,你可以按照以下方式将连续的相同类型的参数名分组:

func f(a, b, c int)  ===  func f(a int, b int, c int)

你也可以有一个没有参数名的类型列表:

func g(int, string, int)

这两种风格不能混合使用;要么所有参数都有名字,要么都没有。

我的问题是,当解析器遇到逗号时,它不知道该怎么办。在第一个例子中,a是一个类型的名称,还是一个带有更多变量的变量的名称?逗号有双重作用,我不确定该如何解决这个问题。

我正在使用OCaml的Menhir解析器生成工具。

编辑:目前,我的Menhir语法完全遵循http://golang.org/ref/spec#Function_types中指定的规则。

英文:

I am writing a Golang compiler in OCaml, and argument lists are causing me a bit of a headache. In Go, you can group consecutive parameter names of the same type in the following way:

func f(a, b, c int)  ===  func f(a int, b int, c int)

You can also have a list of types, without parameter names:

func g(int, string, int)

The two styles cannot be mix-and-matched; either all parameters are named or none are.

My issue is that when the parser sees a comma, it doesn't know what to do. In the first example, is a the name of a type or the name of a variable with more variables coming up? The comma has a dual role and I am not sure how to fix this.

I am using the Menhir parser generator tool for OCaml.

Edit: at the moment, my Menhir grammar follows exactly the rules as specified at http://golang.org/ref/spec#Function_types

答案1

得分: 6

如所述,Go语法并不是LALR(1)。实际上,它对于任何k都不是LR(k)。然而,它是无歧义的,所以如果你能找到一个GLR解析器,你可以成功地解析它(我相信OCAML有几个GLR解析器生成器,但我对它们不够了解,无法推荐一个)。

如果你不想(或不能)使用GLR解析器,你可以像Russ Cox在gccgo编译器中所做的那样,使用bison。(bison可以生成GLR解析器,但Cox没有使用该功能。) 他的技术不依赖于扫描器区分类型名称和非类型名称。

相反,它只接受参数列表,其中的元素可以是name_or_typename name_or_type(实际上,由于...语法,还有更多可能性,但这不改变一般原则)。这是简单、无歧义和LALR(1)的,但它接受的范围过广——例如,它将接受func foo(a, b int, c),并且它不会生成正确的抽象语法树,因为它没有将类型附加到正在声明的参数列表上。

这意味着一旦参数列表完全解析并将要插入到与某个函数声明(例如)相关联的AST中时,将执行语义扫描来修复它,并在必要时生成错误消息。该扫描沿着声明元素列表从右到左进行,以便将指定的类型传播到左侧。

值得注意的是,参考手册中的语法也接受范围过广,因为它没有表达“要么所有参数都有名称,要么所有参数都没有名称”的约束。这个约束可以在LR(1)语法中表达出来——我将把它作为读者的练习——但是生成的语法会更难理解。

英文:

As written, the go grammar is not LALR(1). In fact, it is not LR(k) for any k. It is, however, unambiguous, so you could successfully parse it with a GLR parser, if you can find one (I'm pretty sure that there are several GLR parser generators for OCAML, but I don't know enough about any of them to recommend one).

If you don't want to (or can't) use a GLR parser, you can do it the same way Russ Cox did in the gccgo compiler, which uses bison. (bison can generate GLR parsers, but Cox doesn't use that feature.) His technique does not rely on the scanner distinguishing between type-names and non-type-names.

Rather, it just accepts parameter lists whose elements are either name_or_type or name name_or_type (actually, there are more possibilities than that, because of the ... syntax, but it doesn't change the general principle.) That's simple, unambiguous and LALR(1), but it is overly-accepting -- it will accept func foo(a, b int, c), for example -- and it does not produce the correct abstract syntax tree because it doesn't attach the type to the list of parameters being declared.

What that means is that once the argument list is fully parsed and is about to be inserted into the AST attached to some function declaration (for example), a semantic scan is performed to fix it up and, if necessary, produce an error message. That scan is done right-to-left over the list of declaration elements, so that the specified type can be propagated to the left.

It's worth noting that the grammar in the reference manual is also overly-accepting, because it does not express the constraint that "either all parameters are named or none are". That constraint could be expressed in an LR(1) grammar -- I'll leave that as an exercise for readers -- but the resulting grammar would be a lot more difficult to understand.

答案2

得分: 3

你没有歧义。标准的Go解析器是LALR(1)的事实证明了这一点。

是一个类型的名称还是一个带有更多变量的变量的名称?

所以基本上你的语法和解析器整体上应该与符号表完全无关;不要像C一样,你的语法不是歧义的,因此你可以在AST中稍后检查类型名称。

以下是相关的规则(来自http://golang.org/ref/spec);它们已经是正确的。

Parameters     = "(" [ ParameterList [ "," ] ] ")" .
ParameterList  = ParameterDecl { "," ParameterDecl } .
ParameterDecl  = [ IdentifierList ] [ "..." ] Type .
IdentifierList = identifier { "," identifier } .

我会给你解释一下:

IdentifierList = identifier { "," identifier } .

花括号表示克林闭包(在POSIX正则表达式表示法中是星号)。这个规则表示“一个标识符名称,可选地跟随一个逗号和一个标识符,可选地跟随一个逗号和一个标识符,等等...无限循环”。

ParameterDecl  = [ IdentifierList ] [ "..." ] Type .

方括号表示可为空性;这意味着该部分可能存在,也可能不存在(在POSIX正则表达式表示法中是问号)。所以你有“可能是一个IdentifierList,后面可能是一个省略号,后面是一个类型”。

ParameterList  = ParameterDecl { "," ParameterDecl } .

你可以在列表中有多个ParameterDecl,例如func x(a, b int, c, d string)

Parameters     = "(" [ ParameterList [ "," ] ] ")" .

这个规则定义了一个可选的ParameterList,并用括号括起来,可能包括一个可选的最后一个逗号,当你写像这样的东西时很有用:

func x(
    a, b int,
    c, d string, // <- 注意最后的逗号
)

Go语法是可移植的,并且可以被任何具有一个标记向前看的自底向上解析器解析。

关于“不要像C一样”的编辑:我之所以这样说是因为C是上下文敏感的,并且他们在许多(所有?)编译器中解决这个问题的方式是将符号表与词法分析器连接起来,并根据它们是否被定义为类型名称或变量来不同地进行词法分析。这是一个hack,对于没有歧义的语法来说不应该这样做!

英文:

You don't have ambiguity. The fact that the standard Go parser is LALR(1) proves that.

> is a the name of a type or the name of a variable with more variables coming up?

So basically your grammar and the parser as a whole should be completely disconnected from the symbol table; don't be C &ndash; your grammar is not ambiguous therefore you can check the type name later in the AST.

These are the relevant rules (from http://golang.org/ref/spec); they are already correct.

Parameters     = &quot;(&quot; [ ParameterList [ &quot;,&quot; ] ] &quot;)&quot; .
ParameterList  = ParameterDecl { &quot;,&quot; ParameterDecl } .
ParameterDecl  = [ IdentifierList ] [ &quot;...&quot; ] Type .
IdentifierList = identifier { &quot;,&quot; identifier } .

I'll explain them to you:

IdentifierList = identifier { &quot;,&quot; identifier } .

The curly braces represent the kleene-closure (In POSIX regular expression notation it's the asterisk). This rule says "an identifier name, optionally followed by a literal comma and an identifier, optionally followed by a literal comma and an identifier, etc&hellip; ad infinitum"

ParameterDecl  = [ IdentifierList ] [ &quot;...&quot; ] Type .

The square brackets are nullability; this means that that part may or may not be present. (In POSIX regular expression notation it's the question mark). So you have "Maybe an IdentifierList, followed by maybe an ellipsis, followed by a type.

ParameterList  = ParameterDecl { &quot;,&quot; ParameterDecl } .

You can have several ParameterDecl in a list like e.g. func x(a, b int, c, d string).

Parameters     = &quot;(&quot; [ ParameterList [ &quot;,&quot; ] ] &quot;)&quot; .

This rules defines that a ParameterList is optional and to be surrounded by parenthesis and may include an optional final comma literal, useful when you write something like:

func x(
    a, b int,
    c, d string, // &lt;- note the final comma
)

The Go grammar is portable and can be parsed by any bottom-up parser with one token of lookahead.

<hr>

Edit regarding "don't be C": I said this because C is context-sensitive and the way they solve this problem in many (all?) compilers is by wiring the symbol table to the lexer and lexing tokens differently depending on if they are defined as type names or variables. This is a hack and should not be done for unambiguous grammars!

huangapple
  • 本文由 发表于 2014年9月22日 22:21:30
  • 转载请务必保留本文链接:https://go.coder-hub.com/25976394.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定