英文:
Go. Working with huge csv files
问题
我们有一个大型数据集,由几十个CSV文件组成,每个文件大小约为130GB。我们需要在CSV表上模拟SQL查询。
当我们使用encoding/csv
读取1.1GB的测试文件时,程序分配了526GB的虚拟内存。为什么会这样?csv.Reader
是否像生成器一样工作,当我们使用reader.Read()
方法时,它会将行保存在内存中?
以下是读取文件的代码示例:
rf, err := os.Open(input_file)
if err != nil {
log.Fatal("Error: %s", err)
}
r := csv.NewReader(rf)
for {
record, err := r.Read()
}
在执行record, err := r.Read()
这一行时,出现了内存错误。
在读取过程中的内存快照如下所示:
2731.44MB 94.63% 94.63% 2731.44MB 94.63% encoding/csv.(*Reader).parseRecord
151MB 5.23% 99.86% 2885.96MB 100% main.main
0 0% 99.86% 2731.44MB 94.63% encoding/csv.(*Reader).Read
0 0% 99.86% 2886.49MB 100% runtime.goexit
0 0% 99.86% 2886.49MB 100% runtime.main
你可以在这里找到完整的代码:链接,以及代码审查的链接:链接。
英文:
We have big dataset - couple of tens of csv files, that ~130Gb each.
We must emulate sql query on csv table.
When we're reading test table using encoding/csv
on test 1.1 Gb file - program allocates 526 Gb of virtual memory. Why? csv.Reader
works like generator, when we using reader.Read()
method, or it keeps row in memory?
Full code after codereview.
UPD
Reading file like:
rf, err := os.Open(input_file)
if err != nil {
log.Fatal("Error: %s", err)
}
r := csv.NewReader(rf)
for {
record, err := r.Read()
}
Falling on line record, err := r.Read()
with memory error.
UPD2
Snapshot of memory during read process:
2731.44MB 94.63% 94.63% 2731.44MB 94.63% encoding/csv.(*Reader).parseRecord
151MB 5.23% 99.86% 2885.96MB 100% main.main
0 0% 99.86% 2731.44MB 94.63% encoding/csv.(*Reader).Read
0 0% 99.86% 2886.49MB 100% runtime.goexit
0 0% 99.86% 2886.49MB 100% runtime.main
答案1
得分: 4
很可能是换行符没有被检测到,导致将所有内容都读取为单个记录。
如果你跟随代码到210行,你会看到它在寻找'\n'
。
通常情况下,我看到换行符被定义为\n\r
,当某些系统导出时,他们认为这样做是聪明的,但实际上是错误的。正确的Windows换行符是\r\n
。
或者,你可以编写一个自定义的Scanner
,使用你输入中的任何技术来分隔行,并将其作为csv.Reader
的io.Reader
输入。例如,使用我上面提到的无效的\n\r
。
英文:
Most likely the line breaks aren't being detected and its reading everything as a single record.
https://golang.org/src/encoding/csv/reader.go?s=4071:4123#L124
If you follow the code to line 210, you'll see it look for '\n'
.
Often times I see line breaks defined as \n\r
when some system exported it, thinking they were being Windows-smart when in fact it's wrong. The correct Windows linebreak is \r\n
.
Alternatively, you can write a custom Scanner
that will deliminate the lines for you using whatever technique you have in your input, and use it as the io.Reader
input for your csv.Reader
. For example, to use the invalid \n\r
I mentioned above.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论