逐行处理花费的时间太长了。

huangapple go评论69阅读模式
英文:

Go line-by-line processing taking way too long

问题

我正在处理一些非常大的文件,我的简单 Go 程序运行时间为 2 分钟,而等效的 C 程序只需要 15 秒(https://gist.github.com/g2boojum/5729bf75a41f537b8251af25a816c2fc)。显然,我错过了一些重要的东西。(这也是我第一个 Go 程序,所以我确定代码的习惯性不好。)

我正在处理的文件是 csv 文件,格式如下,唯一的问题是它们的大小为 GB 级别。

BOARD;CHANNEL;TIMETAG;ENERGY;ENERGYSHORT;FLAGS
0;0;179096000;316;105;0x0
0;0;682168000;449;146;0x0
0;0;905440000;92;35;0x0

我本可以使用 csv 模块,但是 Scanf 可以将字段转换为正确的类型,这似乎更简单。代码的作用只是跳过标题行,然后逐行读取文件的其余部分,创建第四个字段的直方图,然后在末尾写入直方图。

import (    
    "bufio"    
    "fmt"    
    "os"     
    "log"    
)    
     
const num_channels int = 4096    
    
func main () {    
    if len(os.Args) != 3 {    
        fmt.Printf("Usage: %s infile outfile\n", os.Args[0])    
        os.Exit(1)    
    }    
    fin, err := os.Open(os.Args[1])    
    if err != nil {    
        log.Fatal(err)    
    }    
    scanner := bufio.NewScanner(fin)               
    scanner.Scan() // 跳过标题行     
    fout, err := os.Create(os.Args[2])    
    if err != nil {    
        log.Fatal(err)    
    }    
    fmt.Fprintf(fout, "channel,total\n")    
    var total[num_channels] uint64    
    var tmax int64    
    var board, channel, energy, energyshort, flags int    
    var timetag int64    
    for scanner.Scan() {    
        fmt.Sscanf(scanner.Text(), "%d;%d;%d;%d;%d;%x", &board, &channel, &timetag, &energy, &energyshort, &flags)    
        total[energy] += 1    
        tmax = timetag        
    }                                                                                    
    tmax_s := float64(tmax)/1e12    
    fmt.Println("tmax = ", tmax_s)    
    for i, val := range total {         
        fmt.Fprintf(fout, "%v,%v\n", i, float64(val)/tmax_s)    
    }    
}        

有什么帮助吗?谢谢!

[更新,解决方案和一些奇怪的问题]

我简化了代码,以便更好地了解发生了什么。我删除了用于测试的 csv 文件的标题行,并创建了一个更短的 csv 文件版本,以防有人需要一个测试用例(https://grantgoodyear.org/files/sample60.csv)。

这是一个简化的 C 代码:

#include <stdio.h>

int main(int argc, char* argv[]) {
  FILE* fp = fopen(argv[1], "r");
  int board, channel, energy, energyshort, flags;
  long timetag;
  double tmax = 0;
  while ((fscanf(fp, "%d;%d;%ld;%d;%d;%x", &board, &channel, &timetag, &energy, &energyshort, &flags)) != EOF) {
    tmax = timetag / 1.0e12;
  }
  printf("tmax: %f\n", tmax);
  fclose(fp);
}

它分别处理了短文件和 1.5GB 文件,分别需要 0.16 秒和 15 秒。

$ time cmaxt sample60.csv 
tmax: 59.999983

real	0m0.160s
user	0m0.152s
sys	0m0.008s
$ time cmaxt long.csv 
tmax: 7200.265511

real	0m14.730s
user	0m14.451s
sys	0m0.255s

相比之下,这是一个几乎相同的 Go 程序:

import (
    "io"
    "fmt"
    "os"
)


func main () {
    fin, _ := os.Open(os.Args[1])
    var tmax float64
    var board, channel, energy, energyshort, flags int
    var timetag int64
    for {
        _, err := fmt.Fscanf(fin,"%d;%d;%d;%d;%d;%x", &board, &channel, &timetag, &energy, &energyshort, &flags)
        if err == io.EOF {
            break
        }
        tmax = float64(timetag)/1e12
    }
    fmt.Println("tmax = ", tmax)
}

它运行的时间非常长:

$ time gomaxt sample60.csv 
tmax =  59.999983344

real	0m8.044s
user	0m4.677s
sys	0m3.555s
$ time gomaxt long.csv 
tmax =  7200.265510652

real	18m37.472s
user	10m58.221s
sys	8m28.282s

我不知道发生了什么,但它比 C 版本慢了 50-75 倍。特别奇怪的是 sys 时间如此之长。

我将 Go 版本更改为更接近我原始的帖子,使用 bufio.NewScanner 和 fmt.Sscanf 进行拆分:

import (
    "bufio"
    "fmt"
    "os"
)


func main () {
    fin, _ := os.Open(os.Args[1])
    scanner := bufio.NewScanner(fin)
    var tmax float64
    var timetag int64
    var board, channel, energy, energyshort, flags int
    for scanner.Scan() {
        fmt.Sscanf(scanner.Text(), "%d;%d;%d;%d;%d;%x", &board, &channel, &timetag, &energy, &energyshort, &flags)
        tmax = float64(timetag)/1e12
    }
    fmt.Println("tmax = ", tmax)
}

这个版本只需要(?!)比 C 版本多 6 倍的时间:

$ time gomaxtorig sample60.csv 
tmax =  59.999983344

real	0m0.895s
user	0m0.905s
sys	0m0.038s
$ time gomaxtorig long.csv 
tmax =  7200.265510652

real	1m53.030s
user	2m1.039s
sys	0m3.021s

所以现在让我们用字符串拆分替换 fmt.Sscanf:

import (
    "bufio"
    "fmt"
    "os"
    "strconv"
    "strings"
)


func main () {
    fin, _ := os.Open(os.Args[1])
    scanner := bufio.NewScanner(fin)
    var tmax float64
    var timetag int64
    for scanner.Scan() {
        ss := strings.Split(scanner.Text(), ";")
        timetag, _ = strconv.ParseInt(ss[2], 10, 64)
        tmax = float64(timetag)/1e12
    }
    fmt.Println("tmax = ", tmax)
}

如建议的那样,大部分时间实际上都花在了 fmt.Sscanf 上。这个版本的运行时间是 C 版本的两倍:

$ time ../gammaprocess_go/maxtsplit sample60.csv 
tmax =  59.999983344

real	0m0.226s
user	0m0.243s
sys	0m0.022s
$ time ../gammaprocess_go/maxtsplit long.csv 
tmax =  7200.265510652

real	0m26.434s
user	0m28.834s
sys	0m1.683s

我写了一个稍微有些 hacky 的版本,强制将 csv 文件的每一行中的其他字段转换为字符串,只是为了看看是否有任何区别,时间与上述版本基本相同。

因为我使用了 Scanner.Text(),所以创建了很多字符串然后丢弃,有人建议我使用字节而不是字符串。在我看来,这就是 csv 包所做的,所以我只是使用了它:

import (
    "encoding/csv"
    "fmt"
    "os"
    "io"
    "strconv"
)


func main () {
    fin, _ := os.Open(os.Args[1])
    r := csv.NewReader(fin)
    r.Comma = ';'

    var tmax float64
    var timetag int64
    for {
        rec, err := r.Read()
        if err == io.EOF {
            break
        }
        timetag, _ = strconv.ParseInt(rec[2], 10, 64)
        tmax = float64(timetag)/1e12
    }
    fmt.Println("tmax = ", tmax)
}

时间比字符串拆分版本稍长:

$ time gomaxtcsv sample60.csv 
tmax =  59.999983344

real	0m0.281s
user	0m0.300s
sys	0m0.019s
$ time gomaxtcsv long.csv 
tmax =  7200.265510652

real	0m32.736s
user	0m35.619s
sys	0m1.702s

但是,由于 csv 包做了更多的工作,所以可能有更多的开销,所以我认为这是不确定的。

无论如何,我对效率的提高因素为 2 感到满意,所以我不会继续优化。非常感谢回答这个问题的人们。

[另一个更新]

或者只需查看来自 2014 年的 https://groups.google.com/g/golang-nuts/c/W08rFBcHKbc/m/oIrdQcBxKa4J

英文:

I'm processing some very big files, and my simple Go program to do this is taking 2 minutes to run instead of the 15 seconds it takes for the equivalent C program (https://gist.github.com/g2boojum/5729bf75a41f537b8251af25a816c2fc). Clearly I'm missing something important. (It's also my first Go program, so I'm sure the code is idiomatically poor, too.)

The files I'm processing are csv files, which look like the following, and the only issue is that they're GB in size.

BOARD;CHANNEL;TIMETAG;ENERGY;ENERGYSHORT;FLAGS
0;0;179096000;316;105;0x0
0;0;682168000;449;146;0x0
0;0;905440000;92;35;0x0

I could have used the csv module, but Scanf converts the fields to the correct types for me, and that seemed simpler. All the code does is skip the header line, and then read the rest of the file line-by-line to create a histogram of the 4th field, and then write the histogram at the end.

import (    
    &quot;bufio&quot;    
    &quot;fmt&quot;    
    &quot;os&quot;     
    &quot;log&quot;    
)    
     
const num_channels int = 4096    
    
func main () {    
    if len(os.Args) != 3 {    
        fmt.Printf(&quot;Usage: %s infile outfile\n&quot;, os.Args[0])    
        os.Exit(1)    
    }    
    fin, err := os.Open(os.Args[1])    
    if err != nil {    
        log.Fatal(err)    
    }    
    scanner := bufio.NewScanner(fin)               
    scanner.Scan() // skip the header line     
    fout, err := os.Create(os.Args[2])    
    if err != nil {    
        log.Fatal(err)    
    }    
    fmt.Fprintf(fout, &quot;channel,total\n&quot;)    
    var total[num_channels] uint64    
    var tmax int64    
    var board, channel, energy, energyshort, flags int    
    var timetag int64    
    for scanner.Scan() {    
        fmt.Sscanf(scanner.Text(), &quot;%d;%d;%d;%d;%d;%x&quot;, &amp;board, &amp;channel, &amp;timetag, &amp;energy, &amp;energyshort, &amp;flags)    
        total[energy] += 1    
        tmax = timetag        
    }                                                                                    
    tmax_s := float64(tmax)/1e12    
    fmt.Println(&quot;tmax = &quot;, tmax_s)    
    for i, val := range total {         
        fmt.Fprintf(fout, &quot;%v,%v\n&quot;, i, float64(val)/tmax_s)    
    }    
}        

Help? Thanks!

[Update, and a solution, and some weirdness]

I simplified things, so I could get a better feeling for what is going on with even simpler code. I removed the header line of the csv file I was using to test everything, and I also created an even shorter version of the csv file that I could share, in case anybody wanted an test case (https://grantgoodyear.org/files/sample60.csv).

Here's a streamlined C code:

#include &lt;stdio.h&gt;

int main(int argc, char* argv[]) {
  FILE* fp = fopen(argv[1], &quot;r&quot;);
  int board, channel, energy, energyshort, flags;
  long timetag;
  double tmax = 0;
  while ((fscanf(fp, &quot;%d;%d;%ld;%d;%d;%x&quot;, &amp;board, &amp;channel, &amp;timetag, &amp;energy, &amp;energyshort, &amp;flags)) != EOF) {
    tmax = timetag / 1.0e12;
  }
  printf(&quot;tmax: %f\n&quot;, tmax);
  fclose(fp);
}

It processes the short and the 1.5GB files in 0.16s and 15s, respectively.

$ time cmaxt sample60.csv 
tmax: 59.999983

real	0m0.160s
user	0m0.152s
sys	0m0.008s
$ time cmaxt long.csv 
tmax: 7200.265511

real	0m14.730s
user	0m14.451s
sys	0m0.255s

In comparison, here's a nearly identical program in go:

import (
    &quot;io&quot;
    &quot;fmt&quot;
    &quot;os&quot;
)


func main () {
    fin, _ := os.Open(os.Args[1])
    var tmax float64
    var board, channel, energy, energyshort, flags int
    var timetag int64
    for {
        _, err := fmt.Fscanf(fin,&quot;%d;%d;%d;%d;%d;%x&quot;, &amp;board, &amp;channel, &amp;timetag, &amp;energy, &amp;energyshort, &amp;flags)
        if err == io.EOF {
            break
        }
        tmax = float64(timetag)/1e12
    }
    fmt.Println(&quot;tmax = &quot;, tmax)
}

It takes an amazingly long time to run:

$ time gomaxt sample60.csv 
tmax =  59.999983344

real	0m8.044s
user	0m4.677s
sys	0m3.555s
$ time gomaxt long.csv 
tmax =  7200.265510652

real	18m37.472s
user	10m58.221s
sys	8m28.282s

I have no idea what's happening here, but it's taking 50-75 times longer than the C version. It's especially weird that the sys time is so long.

I changed the go version so that it more resembles my original post, using a bufio.NewScanner and fmt.Sscanf to do the splitting:

 import (
    &quot;bufio&quot;
    &quot;fmt&quot;
    &quot;os&quot;
)


func main () {
    fin, _ := os.Open(os.Args[1])
    scanner := bufio.NewScanner(fin)
    var tmax float64
    var timetag int64
    var board, channel, energy, energyshort, flags int
    for scanner.Scan() {
        fmt.Sscanf(scanner.Text(), &quot;%d;%d;%d;%d;%d;%x&quot;, &amp;board, &amp;channel, &amp;timetag, &amp;energy, &amp;energyshort, &amp;flags)
        tmax = float64(timetag)/1e12
    }
    fmt.Println(&quot;tmax = &quot;, tmax)
}

This version only(?!) takes 6 times longer than the C version:

$ time gomaxtorig sample60.csv 
tmax =  59.999983344

real	0m0.895s
user	0m0.905s
sys	0m0.038s
$ time gomaxtorig long.csv 
tmax =  7200.265510652

real	1m53.030s
user	2m1.039s
sys	0m3.021s

So now let's replace fmt.Sscanf with string splitting:

import (
    &quot;bufio&quot;
    &quot;fmt&quot;
    &quot;os&quot;
    &quot;strconv&quot;
    &quot;strings&quot;
)


func main () {
    fin, _ := os.Open(os.Args[1])
    scanner := bufio.NewScanner(fin)
    var tmax float64
    var timetag int64
    for scanner.Scan() {
        ss := strings.Split(scanner.Text(), &quot;;&quot;)
        timetag, _ = strconv.ParseInt(ss[2], 10, 64)
        tmax = float64(timetag)/1e12
    }
    fmt.Println(&quot;tmax = &quot;, tmax)
}

As suggested, the majority of time actually was being spent in fmt.Sscanf. This version takes slightly twice as long as the C version:

$ time ../gammaprocess_go/maxtsplit sample60.csv 
tmax =  59.999983344

real	0m0.226s
user	0m0.243s
sys	0m0.022s
$ time ../gammaprocess_go/maxtsplit long.csv 
tmax =  7200.265510652

real	0m26.434s
user	0m28.834s
sys	0m1.683s

I did write a slightly hacky version that forced the string conversion of the other fields in each row of the csv file, just to see if that made any difference, and the times were essentially identical to the above version.

Because I'm using Scanner.Text(), there is a lot of strings created and then thrown out, and it was suggested that I use bytes instead of strings. It looks to me like that's what the csv package does, so I just used it:

import (
    &quot;encoding/csv&quot;
    &quot;fmt&quot;
    &quot;os&quot;
    &quot;io&quot;
    &quot;strconv&quot;
)


func main () {
    fin, _ := os.Open(os.Args[1])
    r := csv.NewReader(fin)
    r.Comma = &#39;;&#39;

    var tmax float64
    var timetag int64
    for {
        rec, err := r.Read()
        if err == io.EOF {
            break
        }
        timetag, _ = strconv.ParseInt(rec[2], 10, 64)
        tmax = float64(timetag)/1e12
    }
    fmt.Println(&quot;tmax = &quot;, tmax)
}

The timing is just slightly longer than the string-splitting version with Scanner.Text*():

$ time gomaxtcsv sample60.csv 
tmax =  59.999983344

real	0m0.281s
user	0m0.300s
sys	0m0.019s
$ time gomaxtcsv long.csv 
tmax =  7200.265510652

real	0m32.736s
user	0m35.619s
sys	0m1.702s

but presumably there's more overhead since the csv package is doing a lot more than my simple example, so I'd say that's inconclusive.

Regardless, I'm fine with a factor of 2 inefficiency, so I'm not going to try to keep optimizing this. Many thanks to the folks who answered this question.

[Another update]

Or just look at https://groups.google.com/g/golang-nuts/c/W08rFBcHKbc/m/oIrdQcBxKa4J, from 2014.

答案1

得分: 1

sscanf占用了大部分时间。请执行以下操作:

        ss := strings.Split(scanner.Text(), ";")
        board, _ = strconv.Atoi(ss[0])
        channel, _ = strconv.Atoi(ss[1])
        timetag, _ = strconv.Atoi(ss[2])
        energy, _ = strconv.Atoi(ss[3])
        flags, _ = strconv.ParseUint(ss[4], 16, 64)

省略了错误检查。

英文:

sscanf takes most of the time. Do:

        ss := strings.Split(scanner.Text(), &quot;;&quot;)
        board, _ = strconv.Atoi(ss[0])
        channel, _ = strconv.Atoi(ss[1])
        timetag, _ = strconv.Atoi(ss[2])
        energy, _ = strconv.Atoi(ss[3])
        flags, _ = strconv.ParseUint(ss[4], 16, 64)

Checking errors omitted.

huangapple
  • 本文由 发表于 2022年1月27日 07:52:28
  • 转载请务必保留本文链接:https://go.coder-hub.com/70871683.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定