buffo.Scanner读取文件逐行的行为异常。

huangapple go评论92阅读模式
英文:

Strange behavior of buffo.Scanner reading file line-by-line

问题

我使用bufio.Scanner来逐行读取文件并将其存储到变量wordlist([][]byte)中。

这是代码(在go 1.1 / 1.3中测试通过)。

package main

import (
	"bufio"
	"fmt"
	"log"
	"os"
)

func main() {
	fle, err := os.Open("words.txt")
	if err != nil {
		log.Fatal(err)
	}
	defer fle.Close()

	scanner := bufio.NewScanner(fle)

	n := 1000
	dCnt := 5
	var wordlist [][]byte

	for scanner.Scan() {
		if len(wordlist) == n {
			break
		}
		word := scanner.Bytes()
		for ii := 0; ii < len(wordlist); ii++ {
			if string(word) == string(wordlist[ii]) {
				log.Println(ii, string(word), string(wordlist[ii]))
				log.Println(len(wordlist), "double")

				dCnt--
				if dCnt == 0 {
					for i, v := range wordlist {
						fmt.Println(i, string(v))
					}
					log.Fatal("double")
				}
			}
		}
		wordlist = append(wordlist, word)
	}
	if err := scanner.Err(); err != nil {
		log.Fatal(err)
	}
}

words.txt是一个包含5040行由序列"abcdefg"的排列组合生成的文件:

line 1 ..
abcdefg
abcdegf
abcdfeg
abcdfge
..
line 510 ..
afcdbge
afcdebg
afcdegb
afcdgbe
afcdgeb
.. line 5040

由以下小型Python脚本生成:

from itertools import permutations as perm
c = "abcdefg"
p = perm(c, len(c))
with open('words.txt','wb') as outFle:
    for i in xrange(5040):
        n = ''.join(p.next())
        print >> outFle, n

问题是,在运行上述go程序后,wordlist包含以下内容:

index string(wordlist[])

0 afcdebg      <-- 这是words.txt的第513行
1 afcdegb
2 afcdgbe
3 afcdgeb
...
510 bdefcag
511 bdefcga
512 afcdebg    <-- 这是words.txt中第513行到1024行的重复部分
513 afcdegb
514 afcdgbe

而实际上,wordlist应该包含words.txt的前1000行。

有任何想法吗?

答案由Daniel Darabos给出(请参见下文)

word := scanner.Bytes()

更改为

word := scanner.Text()

就可以解决问题了。

(感谢您的帮助!)

英文:

i use bufio.Scanner for reading a file line-by-line into the variable wordlist ([][]byte)

This is the code (tested with go 1.1 / 1.3).

package main

import (
	&quot;bufio&quot;
	&quot;fmt&quot;
	&quot;log&quot;
	&quot;os&quot;
)

func main() {
	fle, err := os.Open(&quot;words.txt&quot;)
	if err != nil {
		log.Fatal(err)
	}
	defer fle.Close()

	scanner := bufio.NewScanner(fle)
	
	n := 1000
	dCnt := 5
	var wordlist [][]byte

	for scanner.Scan() {
		if len(wordlist) == n {
			break
		}
		word := scanner.Bytes()
		for ii := 0; ii &lt; len(wordlist); ii++ {
			if string(word) == string(wordlist[ii]) {
				log.Println(ii, string(word), string(wordlist[ii]))
				log.Println(len(wordlist), &quot;double&quot;)

				dCnt--
				if dCnt == 0 {
					for i, v := range wordlist {
						fmt.Println(i, string(v))
					}
					log.Fatal(&quot;double&quot;)
				}
			}
		}
		wordlist = append(wordlist, word)
	}
	if err := scanner.Err(); err != nil {
		log.Fatal(err)
	}
}

words.txt is a file of 5040 lines of permutations of the sequenz "abcdefg":

line 1 .. 
abcdefg
abcdegf
abcdfeg
abcdfge
..
line 510 ..
afcdbge
afcdebg
afcdegb
afcdgbe
afcdgeb
.. line 5040

generated by this small python script:

from itertools import permutations as perm
c = &quot;abcdefg&quot;
p = perm(c, len(c))
with file(&#39;words.txt&#39;,&#39;wb&#39;) as outFle:
    for i in xrange(5040):
        n = &#39;&#39;.join(p.next())
        print &gt;&gt; outFle, n

The problem is, that after running the above go program the wordlist contains the following:

index string(wordlist[])

0 afcdebg      &lt;-- this is line 513 of words.txt
1 afcdegb
2 afcdgbe
3 afcdgeb
...
510 bdefcag
511 bdefcga
512 afcdebg    &lt;-- this is the begin of a repition of line 513 .. 1024 in words.ttx
513 afcdegb
514 afcdgbe 

Instead wordlist should contain the first 1000 lines of words.txt

Any Ideas ?

The answer was given by Daniel Darabos (see below)

changing

word := scanner.Bytes()

to

word := scanner.Text() ' did the job.

(Thanks for your help!)

答案1

得分: 10

Scanner.Bytes的文档中提到:

>底层数组可能指向将在后续调用Scan时被覆盖的数据。

因此,如果你保存返回的切片,可以预期到其内容会发生变化。这可能会对你的应用程序造成混乱。最好不要保存返回的切片!

一个好的解决方案是从字节构建一个字符串:

word := string(scanner.Bytes())

然后你可以在任何地方使用字符串,代码会变得更加愉快。

发生了什么?

为什么Scanner.Bytes讨厌我?答案也在文档中:

>它不进行分配。

这使得Scanner非常高效。从你所看到的,我猜测它在构造函数中为512行分配缓冲区,然后在它们之间进行轮换。

这在不需要保留对行的引用的应用程序中不是问题。(例如类似于grep的程序只查看每一行一次。)通常你会解析行并存储对它的引用。但是,如果你想存储原始字节数据,你需要从Scanner中复制出来。

这可能有点麻烦,但是虽然你可以在不方便的基础上实现方便的行为,但在低效的基础上实现高效的行为是不可能的。


还有一个更简单的用于生成输入的脚本:

import itertools
for p in itertools.permutations('abcdefg'):
  print(''.join(p))
英文:

The documentation of Scanner.Bytes says:

> The underlying array may point to data that will be overwritten by a subsequent call to Scan.

So if you save the returned slice, you can expect to see its contents change. This wreaks havoc in your application. Better to not save the returned slice!

A nice solution is to build a string from the bytes:

word := string(scanner.Bytes())

Then you can work with strings everywhere and the code becomes more pleasant.

What is going on?

Why does Scanner.Bytes hate me? The answer is also in the documentation:

> It does no allocation.

This makes the Scanner nicely efficient. From what you see, I guess it allocates buffers for 512 lines in the constructor and then rotates over them.

This is not a problem in applications where you do not need to keep references to the lines. (For example a grep-like program only looks at each line once.) Often you parse the line and store a reference to that. But if you want to store the raw byte data, you are responsible for copying it out from the Scanner.

This may be a hassle, but while you can implement the convenient behavior on top of the inconvenient one, it would be impossible to implement the efficient behavior on top of the inefficient one.


Also a simpler script for generating the input:

import itertools
for p in itertools.permutations(&#39;abcdefg&#39;):
  print &#39;&#39;.join(p)

huangapple
  • 本文由 发表于 2014年7月24日 04:06:51
  • 转载请务必保留本文链接:https://go.coder-hub.com/24919968.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定