英文:
Scanner terminating early
问题
我正在尝试用Go编写一个扫描器,它可以扫描连续的行,并在返回之前清理该行,以便返回逻辑行。所以,给定以下的SplitLine函数:
func ScanLogicalLines(data []byte, atEOF bool) (int, []byte, error) {
if atEOF && len(data) == 0 {
return 0, nil, nil
}
i := bytes.IndexByte(data, '\n')
for i > 0 && data[i-1] == '\\' {
fmt.Printf("i: %d, data[i] = %q\n", i, data[i])
i = i + bytes.IndexByte(data[i+1:], '\n')
}
var match []byte = nil
advance := 0
switch {
case i >= 0:
advance, match = i + 1, data[0:i]
case atEOF:
advance, match = len(data), data
}
token := bytes.Replace(match, []byte("\\\n"), []byte(""), -1)
return advance, token, nil
}
func main() {
simple := `
Just a test.
See what is returned. \
when you have empty lines.
Followed by a newline.
`
scanner := bufio.NewScanner(strings.NewReader(simple))
scanner.Split(ScanLogicalLines)
for scanner.Scan() {
fmt.Printf("line: %q\n", scanner.Text())
}
}
我期望代码返回类似以下的结果:
line: "Just a test."
line: ""
line: "See what is returned, when you have empty lines."
line: ""
line: "Followed by a newline."
然而,它在返回第一行后停止了。第二次调用返回了1, "", nil
。
有人有任何想法,或者这是一个错误吗?
英文:
I am trying to write a scanner in Go that scans continuation lines and also clean the line up before returning it so that you can return logical lines. So, given the following SplitLine function (Play):
func ScanLogicalLines(data []byte, atEOF bool) (int, []byte, error) {
if atEOF && len(data) == 0 {
return 0, nil, nil
}
i := bytes.IndexByte(data, '\n')
for i > 0 && data[i-1] == '\\' {
fmt.Printf("i: %d, data[i] = %q\n", i, data[i])
i = i + bytes.IndexByte(data[i+1:], '\n')
}
var match []byte = nil
advance := 0
switch {
case i >= 0:
advance, match = i + 1, data[0:i]
case atEOF:
advance, match = len(data), data
}
token := bytes.Replace(match, []byte("\\\n"), []byte(""), -1)
return advance, token, nil
}
func main() {
simple := `
Just a test.
See what is returned. \
when you have empty lines.
Followed by a newline.
`
scanner := bufio.NewScanner(strings.NewReader(simple))
scanner.Split(ScanLogicalLines)
for scanner.Scan() {
fmt.Printf("line: %q\n", scanner.Text())
}
}
I expected the code to return something like:
line: "Just a test."
line: ""
line: "See what is returned, when you have empty lines."
line: ""
line: "Followed by a newline."
However, it stops after returning the first line. The second call return 1, "", nil
.
Anybody have any ideas, or is it a bug?
答案1
得分: 7
我会将以下内容翻译为中文:
我认为这是一个错误,因为一个大于0的advance值并不意味着需要进行进一步的读取调用,即使返回的标记为nil(bufio.SplitFunc):
如果数据还没有完整的标记,例如在扫描行时没有换行符,SplitFunc可以返回(0,nil),以向Scanner信号表示将更多数据读入切片中,并尝试使用从输入中相同点开始的更长的切片再次尝试。
发生的情况如下
bufio.Scanner
的输入缓冲区默认为4096字节。这意味着如果可能的话,它会一次性读取这么多数据,然后执行分割函数。在您的情况下,扫描器可以一次性读取您的输入,因为它远远低于4096字节。这意味着它将执行的下一次读取将导致EOF
,这是主要问题。
逐步进行
scanner.Scan
读取所有数据- 您获取到所有的文本
- 您寻找一个标记,找到第一个换行符,只有一个换行符
- 您通过从匹配中删除换行符来返回
nil
作为标记 scanner.Scan
假设:用户需要更多数据scanner.Scan
尝试读取更多数据EOF
发生scanner.Scan
尝试最后一次标记化- 您找到
"Just a test."
scanner.Scan
尝试最后一次标记化- 您寻找一个标记,找到第三行,只有一个换行符
- 您通过从匹配中删除换行符来返回
nil
作为标记 scanner.Scan
看到nil
标记并设置错误(EOF
)- 执行结束
如何规避
任何非nil的标记都可以防止这种情况发生。只要返回非nil的标记,扫描器就不会检查EOF
并继续执行您的标记化程序。
您的代码返回nil
标记的原因是bytes.Replace
在没有任何操作时返回nil
。append([]byte(nil), nil...) == nil
。您可以通过返回一个容量为0且没有元素的切片来防止这种情况发生,因为这将是非nil的:make([]byte, 0, 1) != nil
。
英文:
I would regard this as a bug because an advance value > 0
is not intended to make a further read call, even when the returned token is nil (bufio.SplitFunc):
> If the data does not yet hold a complete token, for instance if it has no newline while scanning lines, SplitFunc can return (0, nil) to signal the Scanner to read more data into the slice and try again with a longer slice starting at the same point in the input.
What happens is this
The input buffer of the bufio.Scanner
defaults to 4096 byte. That means that it reads up to this
amount at once if it can and then executes the split function. In your case the scanner can read your input all at once as it is well below 4096 byte. This means that the next read it will do results in EOF
which is the main problem here.
Step by step
scanner.Scan
reads all your data- You get all the text that is there
- You look for a token, you find the first newline which is only one newline
- You return
nil
as a token by removing the newline from the match scanner.Scan
assumes: user needs more datascanner.Scan
attempts to read moreEOF
happensscanner.Scan
tries to tokenize one last time- You find
"Just a test."
scanner.Scan
tries to tokenize one last time- You look for a token, you find the third line which is only one newline
- You return
nil
as a token by removing the newline from the match scanner.Scan
seesnil
token and set error (EOF
)- Execution ends
How to circumvent
Any token that is non-nil will prevent this. As long as you return non-nil tokens the
scanner will not check for EOF
and continues executing your tokenizer.
The reason why your code returns nil
tokens is that bytes.Replace
returns
nil
when there's nothing to be done. append([]byte(nil), nil...) == nil
.
You could prevent this by returning a slice with a capacity and no elements as
this would be non-nil: make([]byte, 0, 1) != nil
.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论