英文:
Reading in very large files
问题
我正在尝试读取一个拥有200多列和1000多行的文件。我使用以下代码:
var result []string
file, err := os.Open("t8.txt")
if (err != nil) {
fmt.Println(err)
}
defer file.Close()
scan := bufio.NewScanner(file)
for scan.Scan() {
result = append(result, scan.Text())
}
fmt.Println(scan.Err()) //token too long
然而,当我打印结果时,只能得到第一行,因为它显示令牌太长。当我在较小的文件上尝试时,它可以正常工作。在Go语言中,有没有一种方法可以扫描大文件?
英文:
I'm trying to read in a file with 200+ columns and 1000+ rows. I use the following code:
var result []string
file, err := os.Open("t8.txt")
if (err != nil) {
fmt.Println(err)
}
defer file.Close()
scan := bufio.NewScanner(file)
for scan.Scan() {
result = append(result, scan.Text())
}
fmt.Println(scan.Err()) //token too long
However, when I print out the results, all I get is the first line because it says the token is too long. When I try it on smaller files, it works fine. Is there a way in Go that I could scan in large files?
答案1
得分: 6
正如评论中的@Dave C所指出的,您遇到了MaxScanTokenSize = 64 * 1024的限制。
为了解决这个限制,可以使用bufio.Reader,它具有适用于您情况的ReadString(delim byte)方法。
从Scanner的go doc中可以看到(特别是最后一句):
Scanner提供了一个方便的接口,用于读取数据,例如以换行符分隔的文本行文件。对Scan方法的连续调用将逐步遍历文件的“标记”,跳过标记之间的字节。标记的规范由类型为SplitFunc的拆分函数定义;默认的拆分函数将输入拆分为带有行终止符的行。此包中定义了用于将文件扫描为行、字节、UTF-8编码的符文和以空格分隔的单词的拆分函数。客户端也可以提供自定义的拆分函数。
扫描在EOF、第一个I/O错误或无法适应缓冲区的标记时不可恢复地停止。当扫描停止时,读取器可能已经在最后一个标记之后任意远地前进。需要更多控制错误处理或大型标记,或者必须在读取器上运行顺序扫描的程序应改用bufio.Reader。
英文:
As already pointed out by @Dave C in the comments you are running into MaxScanTokenSize = 64 * 1024
To get around that limitation, use bufio.Reader which has a ReadString(delim byte) method which seems appropriate for your case.
From the Scanner go doc (specifically the last sentence):
> Scanner provides a convenient interface for reading data such as a
> file of newline-delimited lines of text. Successive calls to the Scan
> method will step through the 'tokens' of a file, skipping the bytes
> between the tokens. The specification of a token is defined by a split
> function of type SplitFunc; the default split function breaks the
> input into lines with line termination stripped. Split functions are
> defined in this package for scanning a file into lines, bytes,
> UTF-8-encoded runes, and space-delimited words. The client may instead
> provide a custom split function.
>
> Scanning stops unrecoverably at EOF, the first I/O error, or a token
> too large to fit in the buffer. When a scan stops, the reader may have
> advanced arbitrarily far past the last token. Programs that need more
> control over error handling or large tokens, or must run sequential
> scans on a reader, should use bufio.Reader instead.
答案2
得分: 0
你可以更改默认的缓冲区:
sc := bufio.NewScanner(r)
buf := make([]byte, 0, 64*1024)
sc.Buffer(buf, 1024*1024)
英文:
you can change default buff:
sc := bufio.NewScanner(r)
buf := make([]byte, 0, 64*1024)
sc.Buffer(buf, 1024*1024)
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论