使用Golang读取CSV文件,重新排序列,然后使用并发将结果写入新的CSV文件。

huangapple go评论104阅读模式
英文:

Using Golang to read csv, reorder columns then write result to a new csv with Concurrency

问题

这是我的起点。

这是一个用于读取包含3列的CSV文件、重新排序列并将结果写入新的CSV文件的Golang脚本。

  1. package main
  2. import (
  3. "fmt"
  4. "encoding/csv"
  5. "io"
  6. "os"
  7. "math/rand"
  8. "time"
  9. )
  10. func main(){
  11. start_time := time.Now()
  12. // 加载CSV文件
  13. rFile, err := os.Open("data/small.csv") //3列
  14. if err != nil {
  15. fmt.Println("错误:", err)
  16. return
  17. }
  18. defer rFile.Close()
  19. // 创建CSV读取器
  20. reader := csv.NewReader(rFile)
  21. lines, err := reader.ReadAll()
  22. if err == io.EOF {
  23. fmt.Println("错误:", err)
  24. return
  25. }
  26. // 创建CSV写入器
  27. wFile, err := os.Create("data/result.csv")
  28. if err != nil {
  29. fmt.Println("错误:",err)
  30. return
  31. }
  32. defer wFile.Close()
  33. writer := csv.NewWriter(wFile)
  34. // 读取数据,随机化列并将新行写入results.csv
  35. rand.Seed(int64(time.Now().Nanosecond()))
  36. var col_index []int
  37. for i,line :=range lines{
  38. if i == 0 {
  39. // 根据第一行记录的列数随机化列索引
  40. col_index = rand.Perm(len(line))
  41. }
  42. writer.Write([]string{line[col_index[0]], line[col_index[1]], line[col_index[2]]}) //3列
  43. writer.Flush()
  44. }
  45. // 打印报告
  46. fmt.Println("行数:",len(lines))
  47. fmt.Println("耗时:", time.Since(start_time))
  48. }

问题:

  1. 我的代码符合Golang的惯用方式吗?

  2. 如何在这段代码中添加并发性?

英文:

Here's my starting point.

It is a Golang script to read in a csv with 3 columns, re-order the columns and write the result to a new csv file.

  1. package main
  2. import (
  3. "fmt"
  4. "encoding/csv"
  5. "io"
  6. "os"
  7. "math/rand"
  8. "time"
  9. )
  10. func main(){
  11. start_time := time.Now()
  12. // Loading csv file
  13. rFile, err := os.Open("data/small.csv") //3 columns
  14. if err != nil {
  15. fmt.Println("Error:", err)
  16. return
  17. }
  18. defer rFile.Close()
  19. // Creating csv reader
  20. reader := csv.NewReader(rFile)
  21. lines, err := reader.ReadAll()
  22. if err == io.EOF {
  23. fmt.Println("Error:", err)
  24. return
  25. }
  26. // Creating csv writer
  27. wFile, err := os.Create("data/result.csv")
  28. if err != nil {
  29. fmt.Println("Error:",err)
  30. return
  31. }
  32. defer wFile.Close()
  33. writer := csv.NewWriter(wFile)
  34. // Read data, randomize columns and write new lines to results.csv
  35. rand.Seed(int64(time.Now().Nanosecond()))
  36. var col_index []int
  37. for i,line :=range lines{
  38. if i == 0 {
  39. //randomize column index based on the number of columns recorded in the 1st line
  40. col_index = rand.Perm(len(line))
  41. }
  42. writer.Write([]string{line[col_index[0]], line[col_index[1]], line[col_index[2]]}) //3 columns
  43. writer.Flush()
  44. }
  45. //print report
  46. fmt.Println("No. of lines: ",len(lines))
  47. fmt.Println("Time taken: ", time.Since(start_time))
  48. }

Question:

  1. Is my code idiomatic for Golang?

  2. How can I add concurrency to this code?

答案1

得分: 1

你的代码没问题。并发情况下没有太多的情况需要考虑。但是你可以通过实时重新排序来减少内存消耗。只需使用Read()而不是ReadAll(),以避免为整个输入文件分配切片。

  1. for line, err := reader.Read(); err == nil; line, err = reader.Read(){
  2. if err = writer.Write([]string{line[col_index[0]], line[col_index[1]], line[col_index[2]]}); err != nil {
  3. fmt.Println("Error:", err)
  4. break
  5. }
  6. writer.Flush()
  7. }
英文:

Your code is OK. There are no much case for concurrency. But you can at least reduce memory consumption reordering on the fly. Just use Read() instead of ReadAll() to avoid allocating slice for hole input file.

  1. for line, err := reader.Read(); err == nil; line, err = reader.Read(){
  2. if err = writer.Write([]string{line[col_index[0]], line[col_index[1]], line[col_index[2]]}); err != nil {
  3. fmt.Println("Error:", err)
  4. break
  5. }
  6. writer.Flush()
  7. }

答案2

得分: 0

col_index的初始化移到写入循环之外:

  1. if len(lines) > 0 {
  2. // 根据第一行记录的列数随机化列索引
  3. col_index := rand.Perm(len(lines[0]))
  4. newLine := make([]string, len(col_index))
  5. for _, line := range lines[1:] {
  6. for from, to := range col_index {
  7. newLine[to] = line[from]
  8. }
  9. writer.Write(newLine)
  10. writer.Flush()
  11. }
  12. }

要使用并发,不能使用reader.ReadAll。而是创建一个goroutine调用reader.Read,并将输出写入一个通道,该通道将替代lines数组。主goroutine将读取通道并进行洗牌和写入操作。

英文:

Move the col_index initialisation outside the write loop:

  1. if len(lines) > 0 {
  2. //randomize column index based on the number of columns recorded in the 1st line
  3. col_index := rand.Perm(len(lines[0]))
  4. newLine := make([]string, len(col_index))
  5. for _, line :=range lines[1:] {
  6. for from, to := range col_index {
  7. newLine[to] = line[from]
  8. }
  9. writer.Write(newLine)
  10. writer.Flush()
  11. }
  12. }

To use concurrency, you must not use reader.ReadAll. Instead make a goroutine that calls reader.Read and write the output on a channel that would replace the lines array. The main goroutine would read the channel and do the shuffle and the write.

huangapple
  • 本文由 发表于 2017年1月30日 22:15:50
  • 转载请务必保留本文链接:https://go.coder-hub.com/41938068.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定