英文:
Strange behavior of buffo.Scanner reading file line-by-line
问题
我使用bufio.Scanner来逐行读取文件并将其存储到变量wordlist([][]byte)中。
这是代码(在go 1.1 / 1.3中测试通过)。
package main
import (
"bufio"
"fmt"
"log"
"os"
)
func main() {
fle, err := os.Open("words.txt")
if err != nil {
log.Fatal(err)
}
defer fle.Close()
scanner := bufio.NewScanner(fle)
n := 1000
dCnt := 5
var wordlist [][]byte
for scanner.Scan() {
if len(wordlist) == n {
break
}
word := scanner.Bytes()
for ii := 0; ii < len(wordlist); ii++ {
if string(word) == string(wordlist[ii]) {
log.Println(ii, string(word), string(wordlist[ii]))
log.Println(len(wordlist), "double")
dCnt--
if dCnt == 0 {
for i, v := range wordlist {
fmt.Println(i, string(v))
}
log.Fatal("double")
}
}
}
wordlist = append(wordlist, word)
}
if err := scanner.Err(); err != nil {
log.Fatal(err)
}
}
words.txt是一个包含5040行由序列"abcdefg"的排列组合生成的文件:
line 1 ..
abcdefg
abcdegf
abcdfeg
abcdfge
..
line 510 ..
afcdbge
afcdebg
afcdegb
afcdgbe
afcdgeb
.. line 5040
由以下小型Python脚本生成:
from itertools import permutations as perm
c = "abcdefg"
p = perm(c, len(c))
with open('words.txt','wb') as outFle:
for i in xrange(5040):
n = ''.join(p.next())
print >> outFle, n
问题是,在运行上述go程序后,wordlist包含以下内容:
index string(wordlist[])
0 afcdebg <-- 这是words.txt的第513行
1 afcdegb
2 afcdgbe
3 afcdgeb
...
510 bdefcag
511 bdefcga
512 afcdebg <-- 这是words.txt中第513行到1024行的重复部分
513 afcdegb
514 afcdgbe
而实际上,wordlist应该包含words.txt的前1000行。
有任何想法吗?
答案由Daniel Darabos给出(请参见下文)
将
word := scanner.Bytes()
更改为
word := scanner.Text()
就可以解决问题了。
(感谢您的帮助!)
英文:
i use bufio.Scanner for reading a file line-by-line into the variable wordlist ([][]byte)
This is the code (tested with go 1.1 / 1.3).
package main
import (
"bufio"
"fmt"
"log"
"os"
)
func main() {
fle, err := os.Open("words.txt")
if err != nil {
log.Fatal(err)
}
defer fle.Close()
scanner := bufio.NewScanner(fle)
n := 1000
dCnt := 5
var wordlist [][]byte
for scanner.Scan() {
if len(wordlist) == n {
break
}
word := scanner.Bytes()
for ii := 0; ii < len(wordlist); ii++ {
if string(word) == string(wordlist[ii]) {
log.Println(ii, string(word), string(wordlist[ii]))
log.Println(len(wordlist), "double")
dCnt--
if dCnt == 0 {
for i, v := range wordlist {
fmt.Println(i, string(v))
}
log.Fatal("double")
}
}
}
wordlist = append(wordlist, word)
}
if err := scanner.Err(); err != nil {
log.Fatal(err)
}
}
words.txt is a file of 5040 lines of permutations of the sequenz "abcdefg":
line 1 ..
abcdefg
abcdegf
abcdfeg
abcdfge
..
line 510 ..
afcdbge
afcdebg
afcdegb
afcdgbe
afcdgeb
.. line 5040
generated by this small python script:
from itertools import permutations as perm
c = "abcdefg"
p = perm(c, len(c))
with file('words.txt','wb') as outFle:
for i in xrange(5040):
n = ''.join(p.next())
print >> outFle, n
The problem is, that after running the above go program the wordlist contains the following:
index string(wordlist[])
0 afcdebg <-- this is line 513 of words.txt
1 afcdegb
2 afcdgbe
3 afcdgeb
...
510 bdefcag
511 bdefcga
512 afcdebg <-- this is the begin of a repition of line 513 .. 1024 in words.ttx
513 afcdegb
514 afcdgbe
Instead wordlist should contain the first 1000 lines of words.txt
Any Ideas ?
The answer was given by Daniel Darabos (see below)
changing
word := scanner.Bytes()
to
word := scanner.Text() ' did the job.
(Thanks for your help!)
答案1
得分: 10
Scanner.Bytes
的文档中提到:
>底层数组可能指向将在后续调用Scan时被覆盖的数据。
因此,如果你保存返回的切片,可以预期到其内容会发生变化。这可能会对你的应用程序造成混乱。最好不要保存返回的切片!
一个好的解决方案是从字节构建一个字符串:
word := string(scanner.Bytes())
然后你可以在任何地方使用字符串,代码会变得更加愉快。
发生了什么?
为什么Scanner.Bytes
讨厌我?答案也在文档中:
>它不进行分配。
这使得Scanner非常高效。从你所看到的,我猜测它在构造函数中为512行分配缓冲区,然后在它们之间进行轮换。
这在不需要保留对行的引用的应用程序中不是问题。(例如类似于grep
的程序只查看每一行一次。)通常你会解析行并存储对它的引用。但是,如果你想存储原始字节数据,你需要从Scanner中复制出来。
这可能有点麻烦,但是虽然你可以在不方便的基础上实现方便的行为,但在低效的基础上实现高效的行为是不可能的。
还有一个更简单的用于生成输入的脚本:
import itertools
for p in itertools.permutations('abcdefg'):
print(''.join(p))
英文:
The documentation of Scanner.Bytes
says:
> The underlying array may point to data that will be overwritten by a subsequent call to Scan.
So if you save the returned slice, you can expect to see its contents change. This wreaks havoc in your application. Better to not save the returned slice!
A nice solution is to build a string from the bytes:
word := string(scanner.Bytes())
Then you can work with strings everywhere and the code becomes more pleasant.
What is going on?
Why does Scanner.Bytes
hate me? The answer is also in the documentation:
> It does no allocation.
This makes the Scanner nicely efficient. From what you see, I guess it allocates buffers for 512 lines in the constructor and then rotates over them.
This is not a problem in applications where you do not need to keep references to the lines. (For example a grep
-like program only looks at each line once.) Often you parse the line and store a reference to that. But if you want to store the raw byte data, you are responsible for copying it out from the Scanner
.
This may be a hassle, but while you can implement the convenient behavior on top of the inconvenient one, it would be impossible to implement the efficient behavior on top of the inefficient one.
Also a simpler script for generating the input:
import itertools
for p in itertools.permutations('abcdefg'):
print ''.join(p)
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论