2013年11月13日 04:28:39go评论89阅读模式

英文:

Scanner terminating early

问题

我正在尝试用Go编写一个扫描器，它可以扫描连续的行，并在返回之前清理该行，以便返回逻辑行。所以，给定以下的SplitLine函数：

func ScanLogicalLines(data []byte, atEOF bool) (int, []byte, error) {
    if atEOF && len(data) == 0 {
        return 0, nil, nil
    }
    
    i := bytes.IndexByte(data, '\n')
    for i > 0 && data[i-1] == '\\' {
        fmt.Printf("i: %d, data[i] = %q\n", i, data[i])
        i = i + bytes.IndexByte(data[i+1:], '\n')
    }

    var match []byte = nil
    advance := 0
    switch {
    case i >= 0:
        advance, match = i + 1, data[0:i]
    case atEOF: 
        advance, match = len(data), data
    }
    token := bytes.Replace(match, []byte("\\\n"), []byte(""), -1)
    return advance, token, nil
}

func main() {
    simple := `
Just a test.

See what is returned. \
when you have empty lines.

Followed by a newline.
`

    scanner := bufio.NewScanner(strings.NewReader(simple))
    scanner.Split(ScanLogicalLines)
    for scanner.Scan() {
        fmt.Printf("line: %q\n", scanner.Text())
    }
}

我期望代码返回类似以下的结果：

line: "Just a test."
line: ""
line: "See what is returned, when you have empty lines."
line: ""
line: "Followed by a newline."

然而，它在返回第一行后停止了。第二次调用返回了1, "", nil。

有人有任何想法，或者这是一个错误吗？

英文:

I am trying to write a scanner in Go that scans continuation lines and also clean the line up before returning it so that you can return logical lines. So, given the following SplitLine function (Play):

func ScanLogicalLines(data []byte, atEOF bool) (int, []byte, error) {
	if atEOF &amp;&amp; len(data) == 0 {
		return 0, nil, nil
	}
	
 	i := bytes.IndexByte(data, &#39;\n&#39;)
	for i &gt; 0 &amp;&amp; data[i-1] == &#39;\\&#39; {
		fmt.Printf(&quot;i: %d, data[i] = %q\n&quot;, i, data[i])
		i = i + bytes.IndexByte(data[i+1:], &#39;\n&#39;)
	}

	var match []byte = nil
	advance := 0
    switch {
	case i &gt;= 0:
		advance, match = i + 1, data[0:i]
 	case atEOF: 
		advance, match = len(data), data
	}
	token := bytes.Replace(match, []byte(&quot;\\\n&quot;), []byte(&quot;&quot;), -1)
	return advance, token, nil
}

func main() {
	simple := `
Just a test.

See what is returned. \
when you have empty lines.

Followed by a newline.
`

	scanner := bufio.NewScanner(strings.NewReader(simple))
	scanner.Split(ScanLogicalLines)
	for scanner.Scan() {
		fmt.Printf(&quot;line: %q\n&quot;, scanner.Text())
	}
}

I expected the code to return something like:

line: &quot;Just a test.&quot;
line: &quot;&quot;
line: &quot;See what is returned, when you have empty lines.&quot;
line: &quot;&quot;
line: &quot;Followed by a newline.&quot;

However, it stops after returning the first line. The second call return 1, "", nil.

Anybody have any ideas, or is it a bug?

答案1

得分: 7

我会将以下内容翻译为中文：

我认为这是一个错误，因为一个大于0的advance值并不意味着需要进行进一步的读取调用，即使返回的标记为nil（bufio.SplitFunc）：

如果数据还没有完整的标记，例如在扫描行时没有换行符，SplitFunc可以返回（0，nil），以向Scanner信号表示将更多数据读入切片中，并尝试使用从输入中相同点开始的更长的切片再次尝试。

发生的情况如下

bufio.Scanner的输入缓冲区默认为4096字节。这意味着如果可能的话，它会一次性读取这么多数据，然后执行分割函数。在您的情况下，扫描器可以一次性读取您的输入，因为它远远低于4096字节。这意味着它将执行的下一次读取将导致EOF，这是主要问题。

逐步进行

scanner.Scan读取所有数据
您获取到所有的文本
您寻找一个标记，找到第一个换行符，只有一个换行符
您通过从匹配中删除换行符来返回nil作为标记
scanner.Scan假设：用户需要更多数据
scanner.Scan尝试读取更多数据
EOF发生
scanner.Scan尝试最后一次标记化
您找到"Just a test."
scanner.Scan尝试最后一次标记化
您寻找一个标记，找到第三行，只有一个换行符
您通过从匹配中删除换行符来返回nil作为标记
scanner.Scan看到nil标记并设置错误（EOF）
执行结束

如何规避

任何非nil的标记都可以防止这种情况发生。只要返回非nil的标记，扫描器就不会检查EOF并继续执行您的标记化程序。

您的代码返回nil标记的原因是bytes.Replace在没有任何操作时返回nil。append([]byte(nil), nil...) == nil。您可以通过返回一个容量为0且没有元素的切片来防止这种情况发生，因为这将是非nil的：make([]byte, 0, 1) != nil。

英文:

I would regard this as a bug because an advance value > 0
is not intended to make a further read call, even when the returned token is nil (bufio.SplitFunc):

> If the data does not yet hold a complete token, for instance if it has no newline while scanning lines, SplitFunc can return (0, nil) to signal the Scanner to read more data into the slice and try again with a longer slice starting at the same point in the input.

What happens is this

The input buffer of the bufio.Scanner defaults to 4096 byte. That means that it reads up to this
amount at once if it can and then executes the split function. In your case the scanner can read your input all at once as it is well below 4096 byte. This means that the next read it will do results in EOF which is the main problem here.

Step by step

scanner.Scan reads all your data
You get all the text that is there
You look for a token, you find the first newline which is only one newline
You return nil as a token by removing the newline from the match
scanner.Scan assumes: user needs more data
scanner.Scan attempts to read more
EOF happens
scanner.Scan tries to tokenize one last time
You find "Just a test."
scanner.Scan tries to tokenize one last time
You look for a token, you find the third line which is only one newline
You return nil as a token by removing the newline from the match
scanner.Scan sees nil token and set error (EOF)
Execution ends

How to circumvent

Any token that is non-nil will prevent this. As long as you return non-nil tokens the
scanner will not check for EOF and continues executing your tokenizer.

The reason why your code returns nil tokens is that bytes.Replace returns
nil when there's nothing to be done. append([]byte(nil), nil...) == nil.
You could prevent this by returning a slice with a capacity and no elements as
this would be non-nil: make([]byte, 0, 1) != nil.

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

扫描仪提前终止

问题

答案1

发生的情况如下

逐步进行

如何规避

What happens is this

Step by step

How to circumvent

在循环中限制 goroutines 的数量。

Golang和Google API – 在使用OAuth进行设备状态更新时的POST请求语法

在Windows上安装InfluxDB的问题

Go语言的select语句无法接收到发送的值。

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论