2014年7月24日 04:06:51go评论127阅读模式

英文:

Strange behavior of buffo.Scanner reading file line-by-line

问题

我使用bufio.Scanner来逐行读取文件并将其存储到变量wordlist（[][]byte）中。

这是代码（在go 1.1 / 1.3中测试通过）。

package main
import (
	"bufio"
	"fmt"
	"log"
	"os"
)
func main() {
	fle, err := os.Open("words.txt")
	if err != nil {
		log.Fatal(err)
	}
	defer fle.Close()
	scanner := bufio.NewScanner(fle)
	n := 1000
	dCnt := 5
	var wordlist [][]byte
	for scanner.Scan() {
		if len(wordlist) == n {
			break
		}
		word := scanner.Bytes()
		for ii := 0; ii < len(wordlist); ii++ {
			if string(word) == string(wordlist[ii]) {
				log.Println(ii, string(word), string(wordlist[ii]))
				log.Println(len(wordlist), "double")
				dCnt--
				if dCnt == 0 {
					for i, v := range wordlist {
						fmt.Println(i, string(v))
					}
					log.Fatal("double")
				}
			}
		}
		wordlist = append(wordlist, word)
	}
	if err := scanner.Err(); err != nil {
		log.Fatal(err)
	}
}

words.txt是一个包含5040行由序列"abcdefg"的排列组合生成的文件：

line 1 ..
abcdefg
abcdegf
abcdfeg
abcdfge
..
line 510 ..
afcdbge
afcdebg
afcdegb
afcdgbe
afcdgeb
.. line 5040

由以下小型Python脚本生成：

from itertools import permutations as perm
c = "abcdefg"
p = perm(c, len(c))
with open('words.txt','wb') as outFle:
    for i in xrange(5040):
        n = ''.join(p.next())
        print >> outFle, n

问题是，在运行上述go程序后，wordlist包含以下内容：

index string(wordlist[])
0 afcdebg      <-- 这是words.txt的第513行
1 afcdegb
2 afcdgbe
3 afcdgeb
...
510 bdefcag
511 bdefcga
512 afcdebg    <-- 这是words.txt中第513行到1024行的重复部分
513 afcdegb
514 afcdgbe

而实际上，wordlist应该包含words.txt的前1000行。

有任何想法吗？

答案由Daniel Darabos给出（请参见下文）

将

word := scanner.Bytes()

更改为

word := scanner.Text()

就可以解决问题了。

（感谢您的帮助！）

英文:

i use bufio.Scanner for reading a file line-by-line into the variable wordlist ([][]byte)

This is the code (tested with go 1.1 / 1.3).

package main
import (
	&quot;bufio&quot;
	&quot;fmt&quot;
	&quot;log&quot;
	&quot;os&quot;
)
func main() {
	fle, err := os.Open(&quot;words.txt&quot;)
	if err != nil {
		log.Fatal(err)
	}
	defer fle.Close()
	scanner := bufio.NewScanner(fle)
	
	n := 1000
	dCnt := 5
	var wordlist [][]byte
	for scanner.Scan() {
		if len(wordlist) == n {
			break
		}
		word := scanner.Bytes()
		for ii := 0; ii &lt; len(wordlist); ii++ {
			if string(word) == string(wordlist[ii]) {
				log.Println(ii, string(word), string(wordlist[ii]))
				log.Println(len(wordlist), &quot;double&quot;)
				dCnt--
				if dCnt == 0 {
					for i, v := range wordlist {
						fmt.Println(i, string(v))
					}
					log.Fatal(&quot;double&quot;)
				}
			}
		}
		wordlist = append(wordlist, word)
	}
	if err := scanner.Err(); err != nil {
		log.Fatal(err)
	}
}

words.txt is a file of 5040 lines of permutations of the sequenz "abcdefg":

line 1 .. 
abcdefg
abcdegf
abcdfeg
abcdfge
..
line 510 ..
afcdbge
afcdebg
afcdegb
afcdgbe
afcdgeb
.. line 5040

generated by this small python script:

from itertools import permutations as perm
c = &quot;abcdefg&quot;
p = perm(c, len(c))
with file(&#39;words.txt&#39;,&#39;wb&#39;) as outFle:
    for i in xrange(5040):
        n = &#39;&#39;.join(p.next())
        print &gt;&gt; outFle, n

The problem is, that after running the above go program the wordlist contains the following:

index string(wordlist[])

0 afcdebg      &lt;-- this is line 513 of words.txt
1 afcdegb
2 afcdgbe
3 afcdgeb
...
510 bdefcag
511 bdefcga
512 afcdebg    &lt;-- this is the begin of a repition of line 513 .. 1024 in words.ttx
513 afcdegb
514 afcdgbe

Instead wordlist should contain the first 1000 lines of words.txt

Any Ideas ?

The answer was given by Daniel Darabos (see below)

changing

word := scanner.Bytes()

word := scanner.Text() ' did the job.

(Thanks for your help!)

答案1

得分: 10

Scanner.Bytes的文档中提到：

>底层数组可能指向将在后续调用Scan时被覆盖的数据。

因此，如果你保存返回的切片，可以预期到其内容会发生变化。这可能会对你的应用程序造成混乱。最好不要保存返回的切片！

一个好的解决方案是从字节构建一个字符串：

word := string(scanner.Bytes())

然后你可以在任何地方使用字符串，代码会变得更加愉快。

发生了什么？

为什么Scanner.Bytes讨厌我？答案也在文档中：

>它不进行分配。

这使得Scanner非常高效。从你所看到的，我猜测它在构造函数中为512行分配缓冲区，然后在它们之间进行轮换。

这在不需要保留对行的引用的应用程序中不是问题。（例如类似于grep的程序只查看每一行一次。）通常你会解析行并存储对它的引用。但是，如果你想存储原始字节数据，你需要从Scanner中复制出来。

这可能有点麻烦，但是虽然你可以在不方便的基础上实现方便的行为，但在低效的基础上实现高效的行为是不可能的。

还有一个更简单的用于生成输入的脚本：

import itertools
for p in itertools.permutations('abcdefg'):
  print(''.join(p))

英文:

The documentation of Scanner.Bytes says:

> The underlying array may point to data that will be overwritten by a subsequent call to Scan.

So if you save the returned slice, you can expect to see its contents change. This wreaks havoc in your application. Better to not save the returned slice!

A nice solution is to build a string from the bytes:

word := string(scanner.Bytes())

Then you can work with strings everywhere and the code becomes more pleasant.

What is going on?

Why does Scanner.Bytes hate me? The answer is also in the documentation:

> It does no allocation.

This makes the Scanner nicely efficient. From what you see, I guess it allocates buffers for 512 lines in the constructor and then rotates over them.

This is not a problem in applications where you do not need to keep references to the lines. (For example a grep-like program only looks at each line once.) Often you parse the line and store a reference to that. But if you want to store the raw byte data, you are responsible for copying it out from the Scanner.

This may be a hassle, but while you can implement the convenient behavior on top of the inconvenient one, it would be impossible to implement the efficient behavior on top of the inefficient one.

Also a simpler script for generating the input:

import itertools
for p in itertools.permutations(&#39;abcdefg&#39;):
  print &#39;&#39;.join(p)

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

buffo.Scanner读取文件逐行的行为异常。

问题

答案1

发生了什么？

What is going on?

golang reading long text from stdin

构建 runc 源代码时出现错误。

在Go语言中进行字符串的安全比较

在两个不同的函数中使用具有相似变量的指针。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。