在Go语言中高效读写CSV文件

huangapple go评论82阅读模式
英文:

Efficient read and write CSV in Go

问题

下面是翻译好的内容:

下面的Go代码读取一个包含10,000条记录的CSV文件(包含时间戳times和浮点数values),对数据进行一些操作,然后将原始值与额外的score列一起写入另一个CSV文件。然而,这段代码非常慢(即需要几个小时,但大部分时间都花在了calculateStuff()上),我想知道CSV读写中是否存在任何低效的地方可以优化。

package main

import (
  "encoding/csv"
  "log"
  "os"
  "strconv"
)

func ReadCSV(filepath string) ([][]string, error) {
  csvfile, err := os.Open(filepath)

  if err != nil {
    return nil, err
  }
  defer csvfile.Close()

  reader := csv.NewReader(csvfile)
  fields, err := reader.ReadAll()

  return fields, nil
}

func main() {
  // 加载数据CSV
  records, err := ReadCSV("./path/to/datafile.csv")
  if err != nil {
    log.Fatal(err)
  }

  // 将结果写入新的CSV文件
  outfile, err := os.Create("./where/to/write/resultsfile.csv")
  if err != nil {
    log.Fatal("无法打开输出文件")
  }
  defer outfile.Close()
  writer := csv.NewWriter(outfile)

  for i, record := range records {
    time := record[0]
    value := record[1]

    // 跳过标题行
    if i == 0 {
      writer.Write([]string{time, value, "score"})
      continue
    }

    // 获取浮点数值
    floatValue, err := strconv.ParseFloat(value, 64)
    if err != nil {
      log.Fatal("记录: %v, 错误: %v", floatValue, err)
    }

    // 计算分数;无法更改此外部方法
    score := calculateStuff(floatValue)

    valueString := strconv.FormatFloat(floatValue, 'f', 8, 64)
    scoreString := strconv.FormatFloat(prob, 'f', 8, 64)
    //fmt.Printf("结果: %v\n", []string{time, valueString, scoreString})

    writer.Write([]string{time, valueString, scoreString})
  }

  writer.Flush()
}

我希望能帮助将这段CSV读写模板代码尽可能地提速。在本问题的范围内,我们不需要担心calculateStuff方法。

英文:

The Go code below reads in a 10,000 record CSV (of timestamp times and float values), runs some operations on the data, and then writes the original values to another CSV along with an additional column for score. However it is terribly slow (i.e. hours, but most of that is calculateStuff()) and I'm curious if there are any inefficiencies in the CSV reading/writing I can take care of.

package main
import (
"encoding/csv"
"log"
"os"
"strconv"
)
func ReadCSV(filepath string) ([][]string, error) {
csvfile, err := os.Open(filepath)
if err != nil {
return nil, err
}
defer csvfile.Close()
reader := csv.NewReader(csvfile)
fields, err := reader.ReadAll()
return fields, nil
}
func main() {
// load data csv
records, err := ReadCSV("./path/to/datafile.csv")
if err != nil {
log.Fatal(err)
}
// write results to a new csv
outfile, err := os.Create("./where/to/write/resultsfile.csv"))
if err != nil {
log.Fatal("Unable to open output")
}
defer outfile.Close()
writer := csv.NewWriter(outfile)
for i, record := range records {
time := record[0]
value := record[1]
// skip header row
if i == 0 {
writer.Write([]string{time, value, "score"})
continue
}
// get float values
floatValue, err := strconv.ParseFloat(value, 64)
if err != nil {
log.Fatal("Record: %v, Error: %v", floatValue, err)
}
// calculate scores; THIS EXTERNAL METHOD CANNOT BE CHANGED
score := calculateStuff(floatValue)
valueString := strconv.FormatFloat(floatValue, 'f', 8, 64)
scoreString := strconv.FormatFloat(prob, 'f', 8, 64)
//fmt.Printf("Result: %v\n", []string{time, valueString, scoreString})
writer.Write([]string{time, valueString, scoreString})
}
writer.Flush()
}

I'm looking for help making this CSV read/write template code as fast as possible. For the scope of this question we need not worry about the calculateStuff method.

答案1

得分: 22

你首先需要将文件加载到内存中,然后再进行处理,对于大文件来说这可能会很慢。

你需要循环调用.Read方法,逐行处理文件。

func processCSV(rc io.Reader) (ch chan []string) {
    ch = make(chan []string, 10)
    go func() {
        r := csv.NewReader(rc)
        if _, err := r.Read(); err != nil { //读取标题行
            log.Fatal(err)
        }
        defer close(ch)
        for {
            rec, err := r.Read()
            if err != nil {
                if err == io.EOF {
                    break
                }
                log.Fatal(err)
            }
            ch <- rec
        }
    }()
    return
}

注意,这个代码大致基于DaveC的评论。

[kbd]playground/kbd

英文:

You're loading the file in memory first then processing it, that can be slow with a big file.

You need to loop and call .Read and process one line at a time.

func processCSV(rc io.Reader) (ch chan []string) {
ch = make(chan []string, 10)
go func() {
r := csv.NewReader(rc)
if _, err := r.Read(); err != nil { //read header
log.Fatal(err)
}
defer close(ch)
for {
rec, err := r.Read()
if err != nil {
if err == io.EOF {
break
}
log.Fatal(err)
}
ch &lt;- rec
}
}()
return
}

<kbd>playground</kbd>

//note it's roughly based on DaveC's comment.

答案2

得分: 7

这基本上是来自评论部分的Dave C的答案:

package main

import (
  "encoding/csv"
  "log"
  "os"
  "strconv"
)

func main() {
  // 设置读取器
  csvIn, err := os.Open("./path/to/datafile.csv")
  if err != nil {
    log.Fatal(err)
  }
  r := csv.NewReader(csvIn)

  // 设置写入器
  csvOut, err := os.Create("./where/to/write/resultsfile.csv")
  if err != nil {
    log.Fatal("无法打开输出文件")
  }
  w := csv.NewWriter(csvOut)
  defer csvOut.Close()

  // 处理标题
  rec, err := r.Read()
  if err != nil {
    log.Fatal(err)
  }
  rec = append(rec, "score")
  if err = w.Write(rec); err != nil {
    log.Fatal(err)
  }

  for {
    rec, err = r.Read()
    if err != nil {
      if err == io.EOF {
        break
      }
      log.Fatal(err)
    }

    // 获取浮点数值
    value := rec[1]
    floatValue, err := strconv.ParseFloat(value, 64)
    if err != nil {
      log.Fatal("记录错误: %v, %v", value, err)
    }

    // 计算分数;此外部方法不能更改
    score := calculateStuff(floatValue)

    scoreString := strconv.FormatFloat(score, 'f', 8, 64)
    rec = append(rec, scoreString)

    if err = w.Write(rec); err != nil {
      log.Fatal(err)
    }
  w.Flush()
  }
}

当然,这里的逻辑都堆积在main()函数中,更好的做法是将其拆分为几个函数,但这超出了本问题的范围。

英文:

This is essentially Dave C's answer from the comments sections:

package main
import (
&quot;encoding/csv&quot;
&quot;log&quot;
&quot;os&quot;
&quot;strconv&quot;
)
func main() {
// setup reader
csvIn, err := os.Open(&quot;./path/to/datafile.csv&quot;)
if err != nil {
log.Fatal(err)
}
r := csv.NewReader(csvIn)
// setup writer
csvOut, err := os.Create(&quot;./where/to/write/resultsfile.csv&quot;))
if err != nil {
log.Fatal(&quot;Unable to open output&quot;)
}
w := csv.NewWriter(csvOut)
defer csvOut.Close()
// handle header
rec, err := r.Read()
if err != nil {
log.Fatal(err)
}
rec = append(rec, &quot;score&quot;)
if err = w.Write(rec); err != nil {
log.Fatal(err)
}
for {
rec, err = r.Read()
if err != nil {
if err == io.EOF {
break
}
log.Fatal(err)
}
// get float value
value := rec[1]
floatValue, err := strconv.ParseFloat(value, 64)
if err != nil {
log.Fatal(&quot;Record, error: %v, %v&quot;, value, err)
}
// calculate scores; THIS EXTERNAL METHOD CANNOT BE CHANGED
score := calculateStuff(floatValue)
scoreString := strconv.FormatFloat(score, &#39;f&#39;, 8, 64)
rec = append(rec, scoreString)
if err = w.Write(rec); err != nil {
log.Fatal(err)
}
w.Flush()
}
}

Note of course the logic is all jammed into main(), better would be to split it into several functions, but that's beyond the scope of this question.

答案3

得分: 1

encoding/csv在处理大文件时确实非常慢,因为它执行了很多内存分配操作。由于你的格式非常简单,我建议改用strings.Split,它的速度要快得多。

如果即使使用strings.Split仍然不够快,你可以考虑使用strings.IndexByte自己实现解析,它是用汇编语言实现的:http://golang.org/src/strings/strings_decl.go?s=274:310#L1

话虽如此,如果文件的大小超过了内存限制,你还应该重新考虑使用ReadAll

英文:

encoding/csv is indeed very slow on big files, as it performs a lot of allocations. Since your format is so simple I recommend using strings.Split instead which is much faster.

If even that is not fast enough you can consider implementing the parsing yourself using strings.IndexByte which is implemented in assembly: http://golang.org/src/strings/strings_decl.go?s=274:310#L1

Having said that, you should also reconsider using ReadAll if the file is larger than your memory.

huangapple
  • 本文由 发表于 2015年8月16日 01:53:53
  • 转载请务必保留本文链接:https://go.coder-hub.com/32027590.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定