英文:
How to handle the precedence in this case in Flex/Bison
问题
I'm doing something work to parse the internal configuration files in our project. The configurations can be a simple string or a "function" that we created for internal use. The function can take a string type argument and a special type representing the "condition". Part of my flex token file looks like this:
[a-zA-Z][a-zA-Z0-9_/=.]* SAVE_TOKEN; return TSTRING;
"(" SAVE_TOKEN; return TLPAREN;
")" SAVE_TOKEN; return TRPAREN;
"," SAVE_TOKEN; return TCOMMA;
"==" SAVE_TOKEN; return TIFEQUAL;
"!=" SAVE_TOKEN; return TIFNEQUAL;
Part of my bison parser file looks like this:
condition: expr TIFEQUAL expr { $$ = new NCondition($1, TIFEQUAL, $3);}
| expr TIFNEQUAL expr { $$ = new NCondition($1, TIFNEQUAL, $3);}
;
sring: TSTRING { $$ = new NString(*$1); }
;
expr: string {$
| string TLPAREN call_args TRPAREN { $$ = new NFunction($1, *$3); }
;
call_args: { $$ = new CallArg(); }
| expr { $$ = new CallArg(); }
| call_args TCOMMA condition { $1->conds.push_back($3); }
| call_args TCOMMA expr { $1->exprs.push_back($3); }
;
Here the conflict is, a string type allows the equal sign "=", which is also part of the token TIFEQUAL. Consider a function like this:
function(arg1, arg2, arg_cond==condition)
Parser will try to match the TSTRING token rather than TIFEQUAL. I did some research and I realized that flex is greedy and will try to match the longest one if two patterns both match. Does that mean my conflict has to be resolved at the bison level? If yes, how should I handle this case?
英文:
I'm doing something work to parse the internal configuration files in our project. The configurations can be a simple string or a "function" that we created for internal use. The function can take string type argument and a special type representing the "condition". Part of my flex token file looks like this:
[a-zA-Z][a-zA-Z0-9_/=\.]* SAVE_TOKEN; return TSTRING;
"(" SAVE_TOKEN; return TLPAREN;
")" SAVE_TOKEN; return TRPAREN;
"," SAVE_TOKEN; return TCOMMA;
"==" SAVE_TOKEN; return TIFEQUAL;
"!=" SAVE_TOKEN; return TIFNEQUAL;
Part of my bison parser file looks like this:
condition: expr TIFEQUAL expr { $$ = new NCondition($1, TIFEQUAL, $3);}
|expr TIFNEQUAL expr { $$ = new NCondition($1, TIFNEQUAL, $3);}
;
sring: TSTRING { $$ = new NString(*$1); }
;
expr: string {$<nstring>$ = $1;}
| string TLPAREN call_args TRPAREN { $$ = new NFunction($1, *$3); }
;
call_args: { $$ = new CallArg(); }
| expr { $$ = new CallArg(); }
| call_args TCOMMA condition { $1->conds.push_back($3); }
| call_args TCOMMA expr { $1->exprs.push_back($3); }
;
Here the conflict is, a string type allows euqal sign "=", which is also part of the token TIFEQUA. Consider a function like this:
function(arg1, arg2, arg_cond==condition)
Parser will try to match the TSTRING token rather than TIFEQUAL. I did some research and I realized that flex is greedy and will try to match the longest one if two patterns both match. Does that mean my conflist has to be resolved at bison level? If yes, how should I handle this case?
答案1
得分: 1
这是由于词法识别器始终识别输入流中最长可能的标记(正如您所指出的),只能通过更改词法模式来“修复”。 在 bison 中无法做任何事情,因为那时已经太晚了 - 字符串已经被识别为单个 TSTRING
标记。
一个明显的可能性是从 TSTRING
模式中删除 =
,因为通常情况下在名称/标识符中 =
不是合法标记。 如果您确实希望在其中允许一些 =
,则需要确定它们应该何时成为 TSTRING
的一部分而不是单独的运算符。 您可以禁止两个连续的 =
或者在 TSTRING
的末尾出现 =
:
[a-zA-Z](=?[a-zA-Z0-9_/\.])* SAVE_TOKEN; return TSTRING;
这将导致您的输入 arg_cond==condition
被识别为3个标记,而不是1个。 但是,像 a==b==c
这样的内容理论上会导致语法错误,尽管在理论上它可以被识别为 a==b == c
或 a == b==c
- 但它应该是哪个并不清楚。
英文:
This is due to the lex recognizer always recognizing the longest possible token in the input stream (as you note), and can only be "fixed" by changing your lex patterns. There's nothing you can do in bison as by then it is too late -- the string has already been recognized as a single TSTRING
token.
One obious possibility is to remove the =
from the TSTRING
pattern as that is normally not a legal token in a name/identifier. If you really want to allow some =
in them, you need to decide exactly when they should be part of a TSTRING
vs a separate operator. You could disallow two consecutive =
or an =
at the end of the TSTRING
:
[a-zA-Z](=?[a-zA-Z0-9_/\.])* SAVE_TOKEN; return TSTRING;
which would cause your input of arg_cond==condition
to be recognized as 3 tokens instead of 1. However, then something like a==b==c
would result in a syntax error when in theory it could be recognized as a==b == c
or a == b==c
-- but which it should be is not clear.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论