使用Golang读取CSV文件,重新排序列,然后使用并发将结果写入新的CSV文件。

huangapple go评论78阅读模式
英文:

Using Golang to read csv, reorder columns then write result to a new csv with Concurrency

问题

这是我的起点。

这是一个用于读取包含3列的CSV文件、重新排序列并将结果写入新的CSV文件的Golang脚本。

package main

import (
   "fmt"
   "encoding/csv"
   "io"
   "os"
   "math/rand"
   "time"
)

func main(){
  start_time := time.Now()

  // 加载CSV文件
  rFile, err := os.Open("data/small.csv") //3列
  if err != nil {
    fmt.Println("错误:", err)
    return
   }
  defer rFile.Close()

  // 创建CSV读取器
  reader := csv.NewReader(rFile)

  lines, err := reader.ReadAll()
  if err == io.EOF {
      fmt.Println("错误:", err)
      return
  }

  // 创建CSV写入器
  wFile, err := os.Create("data/result.csv")
  if err != nil {
      fmt.Println("错误:",err)
      return
  }
  defer wFile.Close()
  writer := csv.NewWriter(wFile)

  // 读取数据,随机化列并将新行写入results.csv
  rand.Seed(int64(time.Now().Nanosecond()))
  var col_index []int
  for i,line :=range lines{
      if i == 0 {
        // 根据第一行记录的列数随机化列索引
        col_index = rand.Perm(len(line))
    }
    writer.Write([]string{line[col_index[0]], line[col_index[1]], line[col_index[2]]}) //3列
    writer.Flush()
}

// 打印报告
fmt.Println("行数:",len(lines))
fmt.Println("耗时:", time.Since(start_time))

}

问题:

  1. 我的代码符合Golang的惯用方式吗?

  2. 如何在这段代码中添加并发性?

英文:

Here's my starting point.

It is a Golang script to read in a csv with 3 columns, re-order the columns and write the result to a new csv file.

package main
import (
"fmt"
"encoding/csv"
"io"
"os"
"math/rand"
"time"
)
func main(){
start_time := time.Now()
// Loading csv file
rFile, err := os.Open("data/small.csv") //3 columns
if err != nil {
fmt.Println("Error:", err)
return
}
defer rFile.Close()
// Creating csv reader
reader := csv.NewReader(rFile)
lines, err := reader.ReadAll()
if err == io.EOF {
fmt.Println("Error:", err)
return
}
// Creating csv writer
wFile, err := os.Create("data/result.csv")
if err != nil {
fmt.Println("Error:",err)
return
}
defer wFile.Close()
writer := csv.NewWriter(wFile)
// Read data, randomize columns and write new lines to results.csv
rand.Seed(int64(time.Now().Nanosecond()))
var col_index []int
for i,line :=range lines{
if i == 0 {
//randomize column index based on the number of columns recorded in the 1st line
col_index = rand.Perm(len(line))
}
writer.Write([]string{line[col_index[0]], line[col_index[1]], line[col_index[2]]}) //3 columns
writer.Flush()
}
//print report
fmt.Println("No. of lines: ",len(lines))
fmt.Println("Time taken: ", time.Since(start_time))
}

Question:

  1. Is my code idiomatic for Golang?

  2. How can I add concurrency to this code?

答案1

得分: 1

你的代码没问题。并发情况下没有太多的情况需要考虑。但是你可以通过实时重新排序来减少内存消耗。只需使用Read()而不是ReadAll(),以避免为整个输入文件分配切片。

for line, err := reader.Read(); err == nil; line, err = reader.Read(){
    if err = writer.Write([]string{line[col_index[0]], line[col_index[1]], line[col_index[2]]}); err != nil {
        fmt.Println("Error:", err)
        break
    }
    writer.Flush()
}
英文:

Your code is OK. There are no much case for concurrency. But you can at least reduce memory consumption reordering on the fly. Just use Read() instead of ReadAll() to avoid allocating slice for hole input file.

for line, err := reader.Read(); err == nil; line, err = reader.Read(){
if err = writer.Write([]string{line[col_index[0]], line[col_index[1]], line[col_index[2]]}); err != nil {
fmt.Println("Error:", err)
break
}
writer.Flush()
}

答案2

得分: 0

col_index的初始化移到写入循环之外:

if len(lines) > 0 {
    // 根据第一行记录的列数随机化列索引
    col_index := rand.Perm(len(lines[0]))
    newLine := make([]string, len(col_index))

    for _, line := range lines[1:] {
        for from, to := range col_index {
            newLine[to] = line[from]
        }
        writer.Write(newLine)
        writer.Flush()
    }
}

要使用并发,不能使用reader.ReadAll。而是创建一个goroutine调用reader.Read,并将输出写入一个通道,该通道将替代lines数组。主goroutine将读取通道并进行洗牌和写入操作。

英文:

Move the col_index initialisation outside the write loop:

if len(lines) > 0 {
//randomize column index based on the number of columns recorded in the 1st line
col_index := rand.Perm(len(lines[0]))
newLine := make([]string, len(col_index))
for _, line :=range lines[1:] {
for from, to := range col_index {
newLine[to] = line[from]
}
writer.Write(newLine)
writer.Flush()
}
}

To use concurrency, you must not use reader.ReadAll. Instead make a goroutine that calls reader.Read and write the output on a channel that would replace the lines array. The main goroutine would read the channel and do the shuffle and the write.

huangapple
  • 本文由 发表于 2017年1月30日 22:15:50
  • 转载请务必保留本文链接:https://go.coder-hub.com/41938068.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定