扫描仪提前终止

huangapple go评论85阅读模式
英文:

Scanner terminating early

问题

我正在尝试用Go编写一个扫描器,它可以扫描连续的行,并在返回之前清理该行,以便返回逻辑行。所以,给定以下的SplitLine函数:

func ScanLogicalLines(data []byte, atEOF bool) (int, []byte, error) {
    if atEOF && len(data) == 0 {
        return 0, nil, nil
    }
    
    i := bytes.IndexByte(data, '\n')
    for i > 0 && data[i-1] == '\\' {
        fmt.Printf("i: %d, data[i] = %q\n", i, data[i])
        i = i + bytes.IndexByte(data[i+1:], '\n')
    }

    var match []byte = nil
    advance := 0
    switch {
    case i >= 0:
        advance, match = i + 1, data[0:i]
    case atEOF: 
        advance, match = len(data), data
    }
    token := bytes.Replace(match, []byte("\\\n"), []byte(""), -1)
    return advance, token, nil
}

func main() {
    simple := `
Just a test.

See what is returned. \
when you have empty lines.

Followed by a newline.
`

    scanner := bufio.NewScanner(strings.NewReader(simple))
    scanner.Split(ScanLogicalLines)
    for scanner.Scan() {
        fmt.Printf("line: %q\n", scanner.Text())
    }
}

我期望代码返回类似以下的结果:

line: "Just a test."
line: ""
line: "See what is returned, when you have empty lines."
line: ""
line: "Followed by a newline."

然而,它在返回第一行后停止了。第二次调用返回了1, "", nil

有人有任何想法,或者这是一个错误吗?

英文:

I am trying to write a scanner in Go that scans continuation lines and also clean the line up before returning it so that you can return logical lines. So, given the following SplitLine function (Play):

func ScanLogicalLines(data []byte, atEOF bool) (int, []byte, error) {
	if atEOF && len(data) == 0 {
		return 0, nil, nil
	}
	
 	i := bytes.IndexByte(data, '\n')
	for i > 0 && data[i-1] == '\\' {
		fmt.Printf("i: %d, data[i] = %q\n", i, data[i])
		i = i + bytes.IndexByte(data[i+1:], '\n')
	}

	var match []byte = nil
	advance := 0
    switch {
	case i >= 0:
		advance, match = i + 1, data[0:i]
 	case atEOF: 
		advance, match = len(data), data
	}
	token := bytes.Replace(match, []byte("\\\n"), []byte(""), -1)
	return advance, token, nil
}

func main() {
	simple := `
Just a test.

See what is returned. \
when you have empty lines.

Followed by a newline.
`

	scanner := bufio.NewScanner(strings.NewReader(simple))
	scanner.Split(ScanLogicalLines)
	for scanner.Scan() {
		fmt.Printf("line: %q\n", scanner.Text())
	}
}

I expected the code to return something like:

line: "Just a test."
line: ""
line: "See what is returned, when you have empty lines."
line: ""
line: "Followed by a newline."

However, it stops after returning the first line. The second call return 1, "", nil.

Anybody have any ideas, or is it a bug?

答案1

得分: 7

我会将以下内容翻译为中文:

我认为这是一个错误,因为一个大于0的advance值并不意味着需要进行进一步的读取调用,即使返回的标记为nil(bufio.SplitFunc):

如果数据还没有完整的标记,例如在扫描行时没有换行符,SplitFunc可以返回(0,nil),以向Scanner信号表示将更多数据读入切片中,并尝试使用从输入中相同点开始的更长的切片再次尝试。

发生的情况如下

bufio.Scanner的输入缓冲区默认为4096字节。这意味着如果可能的话,它会一次性读取这么多数据,然后执行分割函数。在您的情况下,扫描器可以一次性读取您的输入,因为它远远低于4096字节。这意味着它将执行的下一次读取将导致EOF,这是主要问题。

逐步进行

  1. scanner.Scan读取所有数据
  2. 您获取到所有的文本
  3. 您寻找一个标记,找到第一个换行符,只有一个换行符
  4. 您通过从匹配中删除换行符来返回nil作为标记
  5. scanner.Scan假设:用户需要更多数据
  6. scanner.Scan尝试读取更多数据
  7. EOF发生
  8. scanner.Scan尝试最后一次标记化
  9. 您找到"Just a test."
  10. scanner.Scan尝试最后一次标记化
  11. 您寻找一个标记,找到第三行,只有一个换行符
  12. 您通过从匹配中删除换行符来返回nil作为标记
  13. scanner.Scan看到nil标记并设置错误(EOF
  14. 执行结束

如何规避

任何非nil的标记都可以防止这种情况发生。只要返回非nil的标记,扫描器就不会检查EOF并继续执行您的标记化程序。

您的代码返回nil标记的原因是bytes.Replace在没有任何操作时返回nilappend([]byte(nil), nil...) == nil。您可以通过返回一个容量为0且没有元素的切片来防止这种情况发生,因为这将是非nil的:make([]byte, 0, 1) != nil

英文:

I would regard this as a bug because an advance value > 0
is not intended to make a further read call, even when the returned token is nil (bufio.SplitFunc):

> If the data does not yet hold a complete token, for instance if it has no newline while scanning lines, SplitFunc can return (0, nil) to signal the Scanner to read more data into the slice and try again with a longer slice starting at the same point in the input.

What happens is this

The input buffer of the bufio.Scanner defaults to 4096 byte. That means that it reads up to this
amount at once if it can and then executes the split function. In your case the scanner can read your input all at once as it is well below 4096 byte. This means that the next read it will do results in EOF which is the main problem here.

Step by step

  1. scanner.Scan reads all your data
  2. You get all the text that is there
  3. You look for a token, you find the first newline which is only one newline
  4. You return nil as a token by removing the newline from the match
  5. scanner.Scan assumes: user needs more data
  6. scanner.Scan attempts to read more
  7. EOF happens
  8. scanner.Scan tries to tokenize one last time
  9. You find "Just a test."
  10. scanner.Scan tries to tokenize one last time
  11. You look for a token, you find the third line which is only one newline
  12. You return nil as a token by removing the newline from the match
  13. scanner.Scan sees nil token and set error (EOF)
  14. Execution ends

How to circumvent

Any token that is non-nil will prevent this. As long as you return non-nil tokens the
scanner will not check for EOF and continues executing your tokenizer.

The reason why your code returns nil tokens is that bytes.Replace returns
nil when there's nothing to be done. append([]byte(nil), nil...) == nil.
You could prevent this by returning a slice with a capacity and no elements as
this would be non-nil: make([]byte, 0, 1) != nil.

huangapple
  • 本文由 发表于 2013年11月13日 04:28:39
  • 转载请务必保留本文链接:https://go.coder-hub.com/19939219.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定