英文:
semicolon insertion ala google go with flex
问题
我对在我的flex文件中添加类似Google Go的分号插入很感兴趣。
根据Go文档:
分号
像C一样,Go的正式语法使用分号来终止语句;与C不同的是,这些分号不会出现在源代码中。相反,词法分析器在扫描时使用一个简单的规则自动插入分号,因此输入文本基本上不包含分号。
规则是这样的。如果换行符之前的最后一个标记是标识符(包括int和float64等单词)、基本字面量(如数字或字符串常量)或以下标记之一
break continue fallthrough return ++ -- ) }
词法分析器在标记之后总是插入一个分号。这可以总结为“如果换行符出现在可能结束语句的标记之后,则插入分号”。
分号也可以在闭合大括号之前省略,因此像这样的语句
go func() { for { dst <- <-src } }()
不需要分号。符合惯用的Go程序只在诸如for循环子句之类的地方使用分号,用于分隔初始化器、条件和继续元素。它们还必要用于分隔一行上的多个语句,如果您以这种方式编写代码的话。
一个注意事项。您永远不应该将控制结构(if、for、switch或select)的左大括号放在下一行。如果这样做,将在左大括号之前插入一个分号,这可能会导致意外效果。应该像这样编写它们
if i < f() { g() }
而不是像这样
if i < f() // 错误! { // 错误! g() // 错误! } // 错误!
我该如何实现这个功能(如何在流中插入标记,如何查看最后一个匹配的标记以判断是否是一个好主意等等)?
我也在使用bison,但是Go似乎只是使用它们的词法分析器进行分号插入。
英文:
I'm interested in adding semi-colon insertion ala Google Go to my flex file.
From the Go documentation:
> Semicolons
> ==========
>
> Like C, Go's formal grammar uses semicolons to terminate statements;
> unlike C, those semicolons do not appear in the source. Instead the
> lexer uses a simple rule to insert semicolons automatically as it
> scans, so the input text is mostly free of them.
>
> The rule is this. If the last token before a newline is an identifier
> (which includes words like int and float64), a basic literal such as a
> number or string constant, or one of the tokens
>
> break continue fallthrough return ++ -- ) }
>
> the lexer always inserts a semicolon after the token. This could be
> summarized as, “if the newline comes after a token that could end a
> statement, insert a semicolon”.
>
> A semicolon can also be omitted immediately before a closing brace, so
> a statement such as
>
> go func() { for { dst <- <-src } }()
>
> needs no semicolons. Idiomatic Go programs have semicolons only in
> places such as for loop clauses, to separate the initializer,
> condition, and continuation elements. They are also necessary to
> separate multiple statements on a line, should you write code that
> way.
>
> One caveat. You should never put the opening brace of a control
> structure (if, for, switch, or select) on the next line. If you do, a
> semicolon will be inserted before the brace, which could cause
> unwanted effects. Write them like this
>
> if i < f() {
> g()
> }
>
> not like this
>
> if i < f() // wrong!
> { // wrong!
> g() // wrong!
> } // wrong!
How would I go about doing this (how can I insert tokens in the stream, how can I see the last token that was matched to see if it is a good idea, etc etc etc)?
I am using bison too, but Go seems to just use their lexer for semicolon insertion.
答案1
得分: 9
你可以通过一个函数将词法分析器的结果令牌传递过去,该函数会在必要时插入分号。在检测到需要插入分号时,下一个令牌可以被放回到输入流中,基本上在下一轮中再次进行词法分析。
下面是一个示例,在WORD后面的换行符之前插入一个分号。bison文件"insert.y"如下所示:
%{
#include <stdio.h>
void yyerror(const char *str) {
printf("ERROR: %s\n", str);
}
int main() {
yyparse();
return 0;
}
%}
%union {
char *string;
}
%token <string> WORD
%token SEMICOLON NEWLINE
%%
input:
| input WORD {printf("WORD: %s\n", $2); free($2);}
| input SEMICOLON {printf("SEMICOLON\n");}
;
%%
而词法分析器是由flex从以下内容生成的:
%{
#include <string.h>
#include "insert.tab.h"
int f(int token);
%}
%option noyywrap
%%
[ \t] ;
[^ \t\n;]+ {yylval.string = strdup(yytext); return f(WORD);}
; {return f(SEMICOLON);}
\n {int token = f(NEWLINE); if (token != NEWLINE) return token;}
%%
int insert = 0;
int f(int token) {
if (insert && token == NEWLINE) {
unput('\n');
insert = 0;
return SEMICOLON;
} else {
insert = token == WORD;
return token;
}
}
对于输入
abc def
ghi
jkl;
它会打印出
WORD: abc
WORD: def
SEMICOLON
WORD: ghi
SEMICOLON
WORD: jkl
SEMICOLON
将一个非常量令牌放回输入流需要一些额外的工作 - 我试图保持示例简单,只是为了给出这个想法。
英文:
You could pass lexer result tokens through a function that inserts semicolons where necessary. Upon detection of the need to insert, the next token can be put back to the input stream, basically lexing it again in the next turn.
Below is an example that inserts a SEMICOLON before a newline, when it follows a WORD. The bison file "insert.y" is this:
%{
#include <stdio.h>
void yyerror(const char *str) {
printf("ERROR: %s\n", str);
}
int main() {
yyparse();
return 0;
}
%}
%union {
char *string;
}
%token <string> WORD
%token SEMICOLON NEWLINE
%%
input:
| input WORD {printf("WORD: %s\n", $2); free($2);}
| input SEMICOLON {printf("SEMICOLON\n");}
;
%%
and the lexer is generated by flex from this:
%{
#include <string.h>
#include "insert.tab.h"
int f(int token);
%}
%option noyywrap
%%
[ \t] ;
[^ \t\n;]+ {yylval.string = strdup(yytext); return f(WORD);}
; {return f(SEMICOLON);}
\n {int token = f(NEWLINE); if (token != NEWLINE) return token;}
%%
int insert = 0;
int f(int token) {
if (insert && token == NEWLINE) {
unput('\n');
insert = 0;
return SEMICOLON;
} else {
insert = token == WORD;
return token;
}
}
For input
abc def
ghi
jkl;
it prints
WORD: abc
WORD: def
SEMICOLON
WORD: ghi
SEMICOLON
WORD: jkl
SEMICOLON
Unputting a non-constant token requires a little extra work - I have tried to keep the example simple, just to give the idea.
答案2
得分: 1
更改词法分析器对\n
和}
的规则,使其查看词法分析器返回的最后一个标记。这将要求你的词法分析器记录每个规则返回的最后一个标记。
然后,你的换行规则将如下所示:
\n { if (newline_is_semi(last_token)) {
return SEMICOLON;
}
}
newline_is_semi
将检查last_token是否在你列出的标记列表中。
处理可选的分号在闭合大括号之前的情况:在匹配}
时,检查last_token是否为SEMICOLON,如果不是,则将}
放回输入流并返回SEMICOLON。
`}` { if (last_token != SEMICOLON) {
unput(`}`);
return SEMICOLON;
}
}
英文:
Alter the lexer rules for \n
and }
to look at the last token returned by the lexer. This will require that your lexer record the last token returned for every rule.
Then your newline rule will look like this:
\n { if (newline_is_semi(last_token)) {
return SEMICOLON;
}
}
newline_is_semi
will check if last_token is in the list of tokens you listed.
To handle the optional semicolon before a closing brace: when matching '}' check if last_token was SEMICOLON and if not unput the '}' and return SEMICOLON
'}' { if (last_token != SEMICOLON) {
unput('}');
return SEMICOLON;
}
}
答案3
得分: 0
一种简单的方法是创建一个全局变量
%{
ins_token = 0
%}
然后假设在“)”之后你想要插入一个分号,那么你设置ins_token = 1,在其他标记中你将ins_token重置为0
现在,在“)”之后是“\n”,然后你检查如果ins_token == 1,你返回分号,否则忽略它并且总是重置ins_token = 0。
ins_token充当一个标志。当你想要插入分号时设置该标志。在获取\n时,它将检查该标志,如果设置了它将插入分号。
这是因为flex不记得上一个标记。
[\n] { if (ins_token == 1) { ins_token = 0; return SEMICOLON; } }
")" { ins_token = 1; }
...其他标记
... { ins_token = 0; }
英文:
One simple way is to create a global variable
%{
ins_token = 0
%}
Then suppose after ")" you want to insert a SEMICOLON then you set the ins_token = 1 and in other tokens you reset the ins_token = 0
Now, after ")" comes "\n" then you check if ins_token == 1 you return SEMICOLON else ignore it and always reset the ins_token = 0.
The ins_token acts a flag. Set the flag when you want the SEMICOLON to be inserted. On getting \n it will check that flag and if its set it will insert the SEMICOLON.
This is because flex doesn't remember the previous token.
[\n] { if (ins_token == 1) { ins_token = 0; return SEMICOLON; } }
")" { ins_token = 1; }
...other tokens
... { ins_token = 0; }
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论