Go. Working with huge csv files

huangapple go评论84阅读模式
英文:

Go. Working with huge csv files

问题

我们有一个大型数据集,由几十个CSV文件组成,每个文件大小约为130GB。我们需要在CSV表上模拟SQL查询。

当我们使用encoding/csv读取1.1GB的测试文件时,程序分配了526GB的虚拟内存。为什么会这样?csv.Reader是否像生成器一样工作,当我们使用reader.Read()方法时,它会将行保存在内存中?

以下是读取文件的代码示例:

rf, err := os.Open(input_file)
if err != nil {
    log.Fatal("Error: %s", err)
}
r := csv.NewReader(rf)
for {
    record, err := r.Read()
}

在执行record, err := r.Read()这一行时,出现了内存错误。

在读取过程中的内存快照如下所示:

2731.44MB 94.63% 94.63% 2731.44MB 94.63% encoding/csv.(*Reader).parseRecord
151MB 5.23% 99.86% 2885.96MB 100% main.main
0 0% 99.86% 2731.44MB 94.63% encoding/csv.(*Reader).Read
0 0% 99.86% 2886.49MB 100% runtime.goexit
0 0% 99.86% 2886.49MB 100% runtime.main

你可以在这里找到完整的代码:链接,以及代码审查的链接:链接

英文:

We have big dataset - couple of tens of csv files, that ~130Gb each.
We must emulate sql query on csv table.

When we're reading test table using encoding/csv on test 1.1 Gb file - program allocates 526 Gb of virtual memory. Why? csv.Reader works like generator, when we using reader.Read() method, or it keeps row in memory?

Full code after codereview.

UPD

Reading file like:

rf, err := os.Open(input_file)
if err != nil {
	log.Fatal("Error: %s", err)
}
r := csv.NewReader(rf)
for {
	record, err := r.Read()
}

Falling on line record, err := r.Read() with memory error.

UPD2
Snapshot of memory during read process:

 2731.44MB 94.63% 94.63%  2731.44MB 94.63%  encoding/csv.(*Reader).parseRecord
     151MB  5.23% 99.86%  2885.96MB   100%  main.main
         0     0% 99.86%  2731.44MB 94.63%  encoding/csv.(*Reader).Read
         0     0% 99.86%  2886.49MB   100%  runtime.goexit
         0     0% 99.86%  2886.49MB   100%  runtime.main

答案1

得分: 4

很可能是换行符没有被检测到,导致将所有内容都读取为单个记录。

如果你跟随代码到210行,你会看到它在寻找'\n'

通常情况下,我看到换行符被定义为\n\r,当某些系统导出时,他们认为这样做是聪明的,但实际上是错误的。正确的Windows换行符是\r\n

或者,你可以编写一个自定义的Scanner,使用你输入中的任何技术来分隔行,并将其作为csv.Readerio.Reader输入。例如,使用我上面提到的无效的\n\r

英文:

Most likely the line breaks aren't being detected and its reading everything as a single record.

https://golang.org/src/encoding/csv/reader.go?s=4071:4123#L124

If you follow the code to line 210, you'll see it look for '\n'.

Often times I see line breaks defined as \n\r when some system exported it, thinking they were being Windows-smart when in fact it's wrong. The correct Windows linebreak is \r\n.

Alternatively, you can write a custom Scanner that will deliminate the lines for you using whatever technique you have in your input, and use it as the io.Reader input for your csv.Reader. For example, to use the invalid \n\r I mentioned above.

huangapple
  • 本文由 发表于 2016年4月5日 09:17:29
  • 转载请务必保留本文链接:https://go.coder-hub.com/36415530.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定