读取非常大的文件

huangapple go评论69阅读模式
英文:

Reading in very large files

问题

我正在尝试读取一个拥有200多列和1000多行的文件。我使用以下代码:

var result []string

file, err := os.Open("t8.txt")
if (err != nil) {
  fmt.Println(err)
}
defer file.Close()
scan := bufio.NewScanner(file)
for scan.Scan() {
  result = append(result, scan.Text())
  
}


fmt.Println(scan.Err()) //token too long

然而,当我打印结果时,只能得到第一行,因为它显示令牌太长。当我在较小的文件上尝试时,它可以正常工作。在Go语言中,有没有一种方法可以扫描大文件?

英文:

I'm trying to read in a file with 200+ columns and 1000+ rows. I use the following code:

var result []string

file, err := os.Open("t8.txt")
if (err != nil) {
  fmt.Println(err)
}
defer file.Close()
scan := bufio.NewScanner(file)
for scan.Scan() {
  result = append(result, scan.Text())
  
}


fmt.Println(scan.Err()) //token too long

However, when I print out the results, all I get is the first line because it says the token is too long. When I try it on smaller files, it works fine. Is there a way in Go that I could scan in large files?

答案1

得分: 6

正如评论中的@Dave C所指出的,您遇到了MaxScanTokenSize = 64 * 1024的限制。

为了解决这个限制,可以使用bufio.Reader,它具有适用于您情况的ReadString(delim byte)方法。

从Scanner的go doc中可以看到(特别是最后一句):

Scanner提供了一个方便的接口,用于读取数据,例如以换行符分隔的文本行文件。对Scan方法的连续调用将逐步遍历文件的“标记”,跳过标记之间的字节。标记的规范由类型为SplitFunc的拆分函数定义;默认的拆分函数将输入拆分为带有行终止符的行。此包中定义了用于将文件扫描为行、字节、UTF-8编码的符文和以空格分隔的单词的拆分函数。客户端也可以提供自定义的拆分函数。

扫描在EOF、第一个I/O错误或无法适应缓冲区的标记时不可恢复地停止。当扫描停止时,读取器可能已经在最后一个标记之后任意远地前进。需要更多控制错误处理或大型标记,或者必须在读取器上运行顺序扫描的程序应改用bufio.Reader。

英文:

As already pointed out by @Dave C in the comments you are running into MaxScanTokenSize = 64 * 1024

To get around that limitation, use bufio.Reader which has a ReadString(delim byte) method which seems appropriate for your case.

From the Scanner go doc (specifically the last sentence):

> Scanner provides a convenient interface for reading data such as a
> file of newline-delimited lines of text. Successive calls to the Scan
> method will step through the 'tokens' of a file, skipping the bytes
> between the tokens. The specification of a token is defined by a split
> function of type SplitFunc; the default split function breaks the
> input into lines with line termination stripped. Split functions are
> defined in this package for scanning a file into lines, bytes,
> UTF-8-encoded runes, and space-delimited words. The client may instead
> provide a custom split function.
>
> Scanning stops unrecoverably at EOF, the first I/O error, or a token
> too large to fit in the buffer. When a scan stops, the reader may have
> advanced arbitrarily far past the last token. Programs that need more
> control over error handling or large tokens, or must run sequential
> scans on a reader, should use bufio.Reader instead.

答案2

得分: 0

你可以更改默认的缓冲区:

sc := bufio.NewScanner(r)
buf := make([]byte, 0, 64*1024)
sc.Buffer(buf, 1024*1024)
英文:

you can change default buff:

sc := bufio.NewScanner(r)
buf := make([]byte, 0, 64*1024)
sc.Buffer(buf, 1024*1024)

huangapple
  • 本文由 发表于 2015年4月4日 10:13:43
  • 转载请务必保留本文链接:https://go.coder-hub.com/29442006.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定