问题

我正在尝试读取一个拥有200多列和1000多行的文件。我使用以下代码：

var result []string

file, err := os.Open("t8.txt")
if (err != nil) {
  fmt.Println(err)
}
defer file.Close()
scan := bufio.NewScanner(file)
for scan.Scan() {
  result = append(result, scan.Text())
  
}


fmt.Println(scan.Err()) //token too long

然而，当我打印结果时，只能得到第一行，因为它显示令牌太长。当我在较小的文件上尝试时，它可以正常工作。在Go语言中，有没有一种方法可以扫描大文件？

英文:

I'm trying to read in a file with 200+ columns and 1000+ rows. I use the following code:

var result []string

file, err := os.Open(&quot;t8.txt&quot;)
if (err != nil) {
  fmt.Println(err)
}
defer file.Close()
scan := bufio.NewScanner(file)
for scan.Scan() {
  result = append(result, scan.Text())
  
}


fmt.Println(scan.Err()) //token too long

However, when I print out the results, all I get is the first line because it says the token is too long. When I try it on smaller files, it works fine. Is there a way in Go that I could scan in large files?

答案1

得分: 6

正如评论中的@Dave C所指出的，您遇到了MaxScanTokenSize = 64 * 1024的限制。

为了解决这个限制，可以使用bufio.Reader，它具有适用于您情况的ReadString(delim byte)方法。

从Scanner的go doc中可以看到（特别是最后一句）：

Scanner提供了一个方便的接口，用于读取数据，例如以换行符分隔的文本行文件。对Scan方法的连续调用将逐步遍历文件的“标记”，跳过标记之间的字节。标记的规范由类型为SplitFunc的拆分函数定义；默认的拆分函数将输入拆分为带有行终止符的行。此包中定义了用于将文件扫描为行、字节、UTF-8编码的符文和以空格分隔的单词的拆分函数。客户端也可以提供自定义的拆分函数。

扫描在EOF、第一个I/O错误或无法适应缓冲区的标记时不可恢复地停止。当扫描停止时，读取器可能已经在最后一个标记之后任意远地前进。需要更多控制错误处理或大型标记，或者必须在读取器上运行顺序扫描的程序应改用bufio.Reader。

英文:

As already pointed out by @Dave C in the comments you are running into MaxScanTokenSize = 64 * 1024

To get around that limitation, use bufio.Reader which has a ReadString(delim byte) method which seems appropriate for your case.

From the Scanner go doc (specifically the last sentence):

> Scanner provides a convenient interface for reading data such as a
> file of newline-delimited lines of text. Successive calls to the Scan
> method will step through the 'tokens' of a file, skipping the bytes
> between the tokens. The specification of a token is defined by a split
> function of type SplitFunc; the default split function breaks the
> input into lines with line termination stripped. Split functions are
> defined in this package for scanning a file into lines, bytes,
> UTF-8-encoded runes, and space-delimited words. The client may instead
> provide a custom split function.
>
> Scanning stops unrecoverably at EOF, the first I/O error, or a token
> too large to fit in the buffer. When a scan stops, the reader may have
> advanced arbitrarily far past the last token. Programs that need more
> control over error handling or large tokens, or must run sequential
> scans on a reader, should use bufio.Reader instead.

答案2

得分: 0

你可以更改默认的缓冲区：

sc := bufio.NewScanner(r)
buf := make([]byte, 0, 64*1024)
sc.Buffer(buf, 1024*1024)

英文:

you can change default buff:

sc := bufio.NewScanner(r)
buf := make([]byte, 0, 64*1024)
sc.Buffer(buf, 1024*1024)

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

读取非常大的文件

问题

答案1

答案2

如何使用Regexp包的ReplaceAll函数在Go中替换字符？

命名的 cookie 不存在

创建一个字典（数组）从CSV数据

How do you typecast a map in Go into a custom type?

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论