分词正在报告意外的标记。

huangapple go评论122阅读模式
英文:

Participle is stating Unexpected Token

问题

我正在玩一个分词器来学习解析,但我无法确定为什么会出现这种意外情况。

  1. // nolint: golint, dupl
  2. package main
  3. import (
  4. "fmt"
  5. "io"
  6. "github.com/alecthomas/participle/v2"
  7. "github.com/alecthomas/participle/v2/lexer"
  8. )
  9. var htaccessLexer = lexer.MustSimple([]lexer.SimpleRule{
  10. {"Comment", `^#[^\n]*`},
  11. {"Ident", `^\w+`},
  12. {"Int", `\d+`},
  13. {"String", `("(\\\"|[^"])*"|\S+)`},
  14. {"EOL", `[\n\r]+`},
  15. {"whitespace", `[ \t]+`},
  16. })
  17. type HTACCESS struct {
  18. Directives []*Directive `@@*`
  19. }
  20. type Directive struct {
  21. Pos lexer.Position
  22. ErrorDocument *ErrorDocument `@@`
  23. }
  24. type ErrorDocument struct {
  25. Code int `"ErrorDocument" @Int`
  26. Path string `@String`
  27. }
  28. var htaccessParser = participle.MustBuild(&HTACCESS{},
  29. participle.Lexer(htaccessLexer),
  30. participle.CaseInsensitive("Ident"),
  31. participle.Unquote("String"),
  32. participle.Elide("whitespace"),
  33. )
  34. func Parse(r io.Reader) (*HTACCESS, error) {
  35. program := &HTACCESS{}
  36. err := htaccessParser.Parse("", r, program)
  37. if err != nil {
  38. return nil, err
  39. }
  40. return program, nil
  41. }
  42. func main() {
  43. v, err := htaccessParser.ParseString("", `ErrorDocument 403 test`)
  44. if err != nil {
  45. panic(err)
  46. }
  47. fmt.Println(v)
  48. }

据我所知,这似乎是正确的,我期望403在那里,但我不确定为什么它无法识别它。

编辑:
我将我的分词器更改为以下内容:

  1. var htaccessLexer = lexer.MustSimple([]lexer.SimpleRule{
  2. {"dir", `^\w+`},
  3. {"int", `\d+`},
  4. {"str", `("(\\\"|[^"])*"|\S+)`},
  5. {"EOL", `[\n\r]+`},
  6. {"whitespace", `\s+`},
  7. })

错误消失了,但它仍然打印一个空数组,不确定为什么。我也不确定为什么使用不同的分词器值会修复它。

英文:

I am playing with a participle to learn how to parse and I cannot determine why this is unexpected.

  1. // nolint: golint, dupl
  2. package main
  3. import (
  4. "fmt"
  5. "io"
  6. "github.com/alecthomas/participle/v2"
  7. "github.com/alecthomas/participle/v2/lexer"
  8. )
  9. var htaccessLexer = lexer.MustSimple([]lexer.SimpleRule{
  10. {"Comment", `^#[^\n]*`},
  11. {"Ident", `^\w+`},
  12. {"Int", `\d+`},
  13. {"String", `("(\\"|[^"])*"|\S+)`},
  14. {"EOL", `[\n\r]+`},
  15. {"whitespace", `[ \t]+`},
  16. })
  17. type HTACCESS struct {
  18. Directives []*Directive `@@*`
  19. }
  20. type Directive struct {
  21. Pos lexer.Position
  22. ErrorDocument *ErrorDocument `@@`
  23. }
  24. type ErrorDocument struct {
  25. Code int `"ErrorDocument" @Int`
  26. Path string `@String`
  27. }
  28. var htaccessParser = participle.MustBuild[HTACCESS](
  29. participle.Lexer(htaccessLexer),
  30. participle.CaseInsensitive("Ident"),
  31. participle.Unquote("String"),
  32. participle.Elide("whitespace"),
  33. )
  34. func Parse(r io.Reader) (*HTACCESS, error) {
  35. program, err := htaccessParser.Parse("", r)
  36. if err != nil {
  37. return nil, err
  38. }
  39. return program, nil
  40. }
  41. func main() {
  42. v, err := htaccessParser.ParseString("", `ErrorDocument 403 test`)
  43. if err != nil {
  44. panic(err)
  45. }
  46. fmt.Println(v)
  47. }

From what I can tell, this seems to be correct, I expect 403 to be there, but I am not sure why it doesn't recognize it.

Edit:
I changed my lexer to this:

  1. var htaccessLexer = lexer.MustSimple([]lexer.SimpleRule{
  2. {"dir", `^\w+`},
  3. {"int", `\d+`},
  4. {"str", `("(\\"|[^"])*"|\S+)`},
  5. {"EOL", `[\n\r]+`},
  6. {"whitespace", `\s+`},
  7. })

And the error is gone, but it is still printing an empty array, not sure why. I am also unsure why using different values for the lexer fixes it either.

答案1

得分: 2

我相信我找到了问题所在,问题出在顺序上。Ident通过\w标签在我的词法分析器中找到了数字,这导致我的整数被标记为ident。

我发现我必须将QuotedStrings和UnQuotedStrings分开,否则未引用的字符串会捕捉到整数。或者我可以确保它只捕捉非数字的值,但这样会错过像stringwithnum2这样的内容。

这是我的解决方案:

  1. var htaccessLexer = lexer.MustSimple([]lexer.SimpleRule{
  2. {"Comment", `(?i)#[^\n]*`},
  3. {"QuotedString", `"(\\"|[^"])*"`},
  4. {"Number", `[-+]?(\\d*\\.)?\\d+`},
  5. {"UnQuotedString", `[^ \t]+`},
  6. {"Ident", `^[a-zA-Z_]`},
  7. {"EOL", `[\n\r]+`},
  8. {"whitespace", `[ \t]+`},
  9. })
  1. type ErrorDocument struct {
  2. Pos lexer.Position
  3. Code int `"ErrorDocument" @Number`
  4. Path string `(@QuotedString | @UnQuotedString)`
  5. }

这解决了我的问题,因为它现在先找到引用的字符串,然后查找数字,然后查找未引用的字符串。

英文:

I believe I found the issue, it is the order, Ident was finding numbers in my lexer via the \w tag, so this caused my integers to be marked as ident.

I found that I have to separate QuotedStrings and UnQuotedStrings otherwise unquoted strings was picking up integers. Alternatively I could ensure it only picks up non-numeric values, but that would miss things like stringwithnum2

Here is my solution

  1. var htaccessLexer = lexer.MustSimple([]lexer.SimpleRule{
  2. {"Comment", `(?i)#[^\n]*`},
  3. {"QuotedString", `"(\\"|[^"])*"`},
  4. {"Number", `[-+]?(\d*\.)?\d+`},
  5. {"UnQuotedString", `[^ \t]+`},
  6. {"Ident", `^[a-zA-Z_]`},
  7. {"EOL", `[\n\r]+`},
  8. {"whitespace", `[ \t]+`},
  9. })
  1. type ErrorDocument struct {
  2. Pos lexer.Position
  3. Code int `"ErrorDocument" @Number`
  4. Path string `(@QuotedString | @UnQuotedString)`
  5. }

This fixed my issue, because it now finds quoted strings, then looks for Numbers, then looks for unquoted strings.

huangapple
  • 本文由 发表于 2023年2月19日 04:33:58
  • 转载请务必保留本文链接:https://go.coder-hub.com/75496233.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定