如何在Flex/Bison中处理这种情况的优先级。

huangapple go评论77阅读模式
英文:

How to handle the precedence in this case in Flex/Bison

问题

I'm doing something work to parse the internal configuration files in our project. The configurations can be a simple string or a "function" that we created for internal use. The function can take a string type argument and a special type representing the "condition". Part of my flex token file looks like this:

[a-zA-Z][a-zA-Z0-9_/=.]* SAVE_TOKEN; return TSTRING;
"(" SAVE_TOKEN; return TLPAREN;
")" SAVE_TOKEN; return TRPAREN;
"," SAVE_TOKEN; return TCOMMA;
"==" SAVE_TOKEN; return TIFEQUAL;
"!=" SAVE_TOKEN; return TIFNEQUAL;

Part of my bison parser file looks like this:

condition: expr TIFEQUAL expr { $$ = new NCondition($1, TIFEQUAL, $3);}
| expr TIFNEQUAL expr { $$ = new NCondition($1, TIFNEQUAL, $3);}
;
sring: TSTRING { $$ = new NString(*$1); }
;
expr: string {$$ = $1;}
| string TLPAREN call_args TRPAREN { $$ = new NFunction($1, *$3); }
;
call_args: { $$ = new CallArg(); }
| expr { $$ = new CallArg(); }
| call_args TCOMMA condition { $1->conds.push_back($3); }
| call_args TCOMMA expr { $1->exprs.push_back($3); }
;

Here the conflict is, a string type allows the equal sign "=", which is also part of the token TIFEQUAL. Consider a function like this:

function(arg1, arg2, arg_cond==condition)

Parser will try to match the TSTRING token rather than TIFEQUAL. I did some research and I realized that flex is greedy and will try to match the longest one if two patterns both match. Does that mean my conflict has to be resolved at the bison level? If yes, how should I handle this case?

英文:

I'm doing something work to parse the internal configuration files in our project. The configurations can be a simple string or a "function" that we created for internal use. The function can take string type argument and a special type representing the "condition". Part of my flex token file looks like this:

[a-zA-Z][a-zA-Z0-9_/=\.]*   SAVE_TOKEN; return TSTRING;
"("                         SAVE_TOKEN; return TLPAREN;
")"                         SAVE_TOKEN; return TRPAREN;
","                         SAVE_TOKEN; return TCOMMA;
"=="                        SAVE_TOKEN; return TIFEQUAL;
"!="                        SAVE_TOKEN; return TIFNEQUAL;

Part of my bison parser file looks like this:

condition: expr TIFEQUAL expr { $$ = new NCondition($1, TIFEQUAL, $3);}
          |expr TIFNEQUAL expr { $$ = new NCondition($1, TIFNEQUAL, $3);}
          ;
sring:     TSTRING { $$ = new NString(*$1); }
          ;
expr:     string {$<nstring>$ = $1;}
        | string TLPAREN call_args TRPAREN { $$ = new NFunction($1, *$3); }
        ;
call_args: { $$ = new CallArg(); }
        |  expr { $$ = new CallArg(); }
        |  call_args TCOMMA condition { $1->conds.push_back($3); }
        |  call_args TCOMMA expr   { $1->exprs.push_back($3); }
        ;

Here the conflict is, a string type allows euqal sign "=", which is also part of the token TIFEQUA. Consider a function like this:

function(arg1, arg2, arg_cond==condition)

Parser will try to match the TSTRING token rather than TIFEQUAL. I did some research and I realized that flex is greedy and will try to match the longest one if two patterns both match. Does that mean my conflist has to be resolved at bison level? If yes, how should I handle this case?

答案1

得分: 1

这是由于词法识别器始终识别输入流中最长可能的标记(正如您所指出的),只能通过更改词法模式来“修复”。 在 bison 中无法做任何事情,因为那时已经太晚了 - 字符串已经被识别为单个 TSTRING 标记。

一个明显的可能性是从 TSTRING 模式中删除 =,因为通常情况下在名称/标识符中 = 不是合法标记。 如果您确实希望在其中允许一些 =,则需要确定它们应该何时成为 TSTRING 的一部分而不是单独的运算符。 您可以禁止两个连续的 = 或者在 TSTRING 的末尾出现 =

[a-zA-Z](=?[a-zA-Z0-9_/\.])*     SAVE_TOKEN; return TSTRING;

这将导致您的输入 arg_cond==condition 被识别为3个标记,而不是1个。 但是,像 a==b==c 这样的内容理论上会导致语法错误,尽管在理论上它可以被识别为 a==b == ca == b==c - 但它应该是哪个并不清楚。

英文:

This is due to the lex recognizer always recognizing the longest possible token in the input stream (as you note), and can only be "fixed" by changing your lex patterns. There's nothing you can do in bison as by then it is too late -- the string has already been recognized as a single TSTRING token.

One obious possibility is to remove the = from the TSTRING pattern as that is normally not a legal token in a name/identifier. If you really want to allow some = in them, you need to decide exactly when they should be part of a TSTRING vs a separate operator. You could disallow two consecutive = or an = at the end of the TSTRING:

[a-zA-Z](=?[a-zA-Z0-9_/\.])*     SAVE_TOKEN; return TSTRING;

which would cause your input of arg_cond==condition to be recognized as 3 tokens instead of 1. However, then something like a==b==c would result in a syntax error when in theory it could be recognized as a==b == c or a == b==c -- but which it should be is not clear.

huangapple
  • 本文由 发表于 2023年5月7日 11:42:23
  • 转载请务必保留本文链接:https://go.coder-hub.com/76192100.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定