英文:
Go line-by-line processing taking way too long
问题
我正在处理一些非常大的文件,我的简单 Go 程序运行时间为 2 分钟,而等效的 C 程序只需要 15 秒(https://gist.github.com/g2boojum/5729bf75a41f537b8251af25a816c2fc)。显然,我错过了一些重要的东西。(这也是我第一个 Go 程序,所以我确定代码的习惯性不好。)
我正在处理的文件是 csv 文件,格式如下,唯一的问题是它们的大小为 GB 级别。
BOARD;CHANNEL;TIMETAG;ENERGY;ENERGYSHORT;FLAGS
0;0;179096000;316;105;0x0
0;0;682168000;449;146;0x0
0;0;905440000;92;35;0x0
我本可以使用 csv 模块,但是 Scanf 可以将字段转换为正确的类型,这似乎更简单。代码的作用只是跳过标题行,然后逐行读取文件的其余部分,创建第四个字段的直方图,然后在末尾写入直方图。
import (
"bufio"
"fmt"
"os"
"log"
)
const num_channels int = 4096
func main () {
if len(os.Args) != 3 {
fmt.Printf("Usage: %s infile outfile\n", os.Args[0])
os.Exit(1)
}
fin, err := os.Open(os.Args[1])
if err != nil {
log.Fatal(err)
}
scanner := bufio.NewScanner(fin)
scanner.Scan() // 跳过标题行
fout, err := os.Create(os.Args[2])
if err != nil {
log.Fatal(err)
}
fmt.Fprintf(fout, "channel,total\n")
var total[num_channels] uint64
var tmax int64
var board, channel, energy, energyshort, flags int
var timetag int64
for scanner.Scan() {
fmt.Sscanf(scanner.Text(), "%d;%d;%d;%d;%d;%x", &board, &channel, &timetag, &energy, &energyshort, &flags)
total[energy] += 1
tmax = timetag
}
tmax_s := float64(tmax)/1e12
fmt.Println("tmax = ", tmax_s)
for i, val := range total {
fmt.Fprintf(fout, "%v,%v\n", i, float64(val)/tmax_s)
}
}
有什么帮助吗?谢谢!
[更新,解决方案和一些奇怪的问题]
我简化了代码,以便更好地了解发生了什么。我删除了用于测试的 csv 文件的标题行,并创建了一个更短的 csv 文件版本,以防有人需要一个测试用例(https://grantgoodyear.org/files/sample60.csv)。
这是一个简化的 C 代码:
#include <stdio.h>
int main(int argc, char* argv[]) {
FILE* fp = fopen(argv[1], "r");
int board, channel, energy, energyshort, flags;
long timetag;
double tmax = 0;
while ((fscanf(fp, "%d;%d;%ld;%d;%d;%x", &board, &channel, &timetag, &energy, &energyshort, &flags)) != EOF) {
tmax = timetag / 1.0e12;
}
printf("tmax: %f\n", tmax);
fclose(fp);
}
它分别处理了短文件和 1.5GB 文件,分别需要 0.16 秒和 15 秒。
$ time cmaxt sample60.csv
tmax: 59.999983
real 0m0.160s
user 0m0.152s
sys 0m0.008s
$ time cmaxt long.csv
tmax: 7200.265511
real 0m14.730s
user 0m14.451s
sys 0m0.255s
相比之下,这是一个几乎相同的 Go 程序:
import (
"io"
"fmt"
"os"
)
func main () {
fin, _ := os.Open(os.Args[1])
var tmax float64
var board, channel, energy, energyshort, flags int
var timetag int64
for {
_, err := fmt.Fscanf(fin,"%d;%d;%d;%d;%d;%x", &board, &channel, &timetag, &energy, &energyshort, &flags)
if err == io.EOF {
break
}
tmax = float64(timetag)/1e12
}
fmt.Println("tmax = ", tmax)
}
它运行的时间非常长:
$ time gomaxt sample60.csv
tmax = 59.999983344
real 0m8.044s
user 0m4.677s
sys 0m3.555s
$ time gomaxt long.csv
tmax = 7200.265510652
real 18m37.472s
user 10m58.221s
sys 8m28.282s
我不知道发生了什么,但它比 C 版本慢了 50-75 倍。特别奇怪的是 sys 时间如此之长。
我将 Go 版本更改为更接近我原始的帖子,使用 bufio.NewScanner 和 fmt.Sscanf 进行拆分:
import (
"bufio"
"fmt"
"os"
)
func main () {
fin, _ := os.Open(os.Args[1])
scanner := bufio.NewScanner(fin)
var tmax float64
var timetag int64
var board, channel, energy, energyshort, flags int
for scanner.Scan() {
fmt.Sscanf(scanner.Text(), "%d;%d;%d;%d;%d;%x", &board, &channel, &timetag, &energy, &energyshort, &flags)
tmax = float64(timetag)/1e12
}
fmt.Println("tmax = ", tmax)
}
这个版本只需要(?!)比 C 版本多 6 倍的时间:
$ time gomaxtorig sample60.csv
tmax = 59.999983344
real 0m0.895s
user 0m0.905s
sys 0m0.038s
$ time gomaxtorig long.csv
tmax = 7200.265510652
real 1m53.030s
user 2m1.039s
sys 0m3.021s
所以现在让我们用字符串拆分替换 fmt.Sscanf:
import (
"bufio"
"fmt"
"os"
"strconv"
"strings"
)
func main () {
fin, _ := os.Open(os.Args[1])
scanner := bufio.NewScanner(fin)
var tmax float64
var timetag int64
for scanner.Scan() {
ss := strings.Split(scanner.Text(), ";")
timetag, _ = strconv.ParseInt(ss[2], 10, 64)
tmax = float64(timetag)/1e12
}
fmt.Println("tmax = ", tmax)
}
如建议的那样,大部分时间实际上都花在了 fmt.Sscanf 上。这个版本的运行时间是 C 版本的两倍:
$ time ../gammaprocess_go/maxtsplit sample60.csv
tmax = 59.999983344
real 0m0.226s
user 0m0.243s
sys 0m0.022s
$ time ../gammaprocess_go/maxtsplit long.csv
tmax = 7200.265510652
real 0m26.434s
user 0m28.834s
sys 0m1.683s
我写了一个稍微有些 hacky 的版本,强制将 csv 文件的每一行中的其他字段转换为字符串,只是为了看看是否有任何区别,时间与上述版本基本相同。
因为我使用了 Scanner.Text(),所以创建了很多字符串然后丢弃,有人建议我使用字节而不是字符串。在我看来,这就是 csv 包所做的,所以我只是使用了它:
import (
"encoding/csv"
"fmt"
"os"
"io"
"strconv"
)
func main () {
fin, _ := os.Open(os.Args[1])
r := csv.NewReader(fin)
r.Comma = ';'
var tmax float64
var timetag int64
for {
rec, err := r.Read()
if err == io.EOF {
break
}
timetag, _ = strconv.ParseInt(rec[2], 10, 64)
tmax = float64(timetag)/1e12
}
fmt.Println("tmax = ", tmax)
}
时间比字符串拆分版本稍长:
$ time gomaxtcsv sample60.csv
tmax = 59.999983344
real 0m0.281s
user 0m0.300s
sys 0m0.019s
$ time gomaxtcsv long.csv
tmax = 7200.265510652
real 0m32.736s
user 0m35.619s
sys 0m1.702s
但是,由于 csv 包做了更多的工作,所以可能有更多的开销,所以我认为这是不确定的。
无论如何,我对效率的提高因素为 2 感到满意,所以我不会继续优化。非常感谢回答这个问题的人们。
[另一个更新]
或者只需查看来自 2014 年的 https://groups.google.com/g/golang-nuts/c/W08rFBcHKbc/m/oIrdQcBxKa4J。
英文:
I'm processing some very big files, and my simple Go program to do this is taking 2 minutes to run instead of the 15 seconds it takes for the equivalent C program (https://gist.github.com/g2boojum/5729bf75a41f537b8251af25a816c2fc). Clearly I'm missing something important. (It's also my first Go program, so I'm sure the code is idiomatically poor, too.)
The files I'm processing are csv files, which look like the following, and the only issue is that they're GB in size.
BOARD;CHANNEL;TIMETAG;ENERGY;ENERGYSHORT;FLAGS
0;0;179096000;316;105;0x0
0;0;682168000;449;146;0x0
0;0;905440000;92;35;0x0
I could have used the csv module, but Scanf converts the fields to the correct types for me, and that seemed simpler. All the code does is skip the header line, and then read the rest of the file line-by-line to create a histogram of the 4th field, and then write the histogram at the end.
import (
"bufio"
"fmt"
"os"
"log"
)
const num_channels int = 4096
func main () {
if len(os.Args) != 3 {
fmt.Printf("Usage: %s infile outfile\n", os.Args[0])
os.Exit(1)
}
fin, err := os.Open(os.Args[1])
if err != nil {
log.Fatal(err)
}
scanner := bufio.NewScanner(fin)
scanner.Scan() // skip the header line
fout, err := os.Create(os.Args[2])
if err != nil {
log.Fatal(err)
}
fmt.Fprintf(fout, "channel,total\n")
var total[num_channels] uint64
var tmax int64
var board, channel, energy, energyshort, flags int
var timetag int64
for scanner.Scan() {
fmt.Sscanf(scanner.Text(), "%d;%d;%d;%d;%d;%x", &board, &channel, &timetag, &energy, &energyshort, &flags)
total[energy] += 1
tmax = timetag
}
tmax_s := float64(tmax)/1e12
fmt.Println("tmax = ", tmax_s)
for i, val := range total {
fmt.Fprintf(fout, "%v,%v\n", i, float64(val)/tmax_s)
}
}
Help? Thanks!
[Update, and a solution, and some weirdness]
I simplified things, so I could get a better feeling for what is going on with even simpler code. I removed the header line of the csv file I was using to test everything, and I also created an even shorter version of the csv file that I could share, in case anybody wanted an test case (https://grantgoodyear.org/files/sample60.csv).
Here's a streamlined C code:
#include <stdio.h>
int main(int argc, char* argv[]) {
FILE* fp = fopen(argv[1], "r");
int board, channel, energy, energyshort, flags;
long timetag;
double tmax = 0;
while ((fscanf(fp, "%d;%d;%ld;%d;%d;%x", &board, &channel, &timetag, &energy, &energyshort, &flags)) != EOF) {
tmax = timetag / 1.0e12;
}
printf("tmax: %f\n", tmax);
fclose(fp);
}
It processes the short and the 1.5GB files in 0.16s and 15s, respectively.
$ time cmaxt sample60.csv
tmax: 59.999983
real 0m0.160s
user 0m0.152s
sys 0m0.008s
$ time cmaxt long.csv
tmax: 7200.265511
real 0m14.730s
user 0m14.451s
sys 0m0.255s
In comparison, here's a nearly identical program in go:
import (
"io"
"fmt"
"os"
)
func main () {
fin, _ := os.Open(os.Args[1])
var tmax float64
var board, channel, energy, energyshort, flags int
var timetag int64
for {
_, err := fmt.Fscanf(fin,"%d;%d;%d;%d;%d;%x", &board, &channel, &timetag, &energy, &energyshort, &flags)
if err == io.EOF {
break
}
tmax = float64(timetag)/1e12
}
fmt.Println("tmax = ", tmax)
}
It takes an amazingly long time to run:
$ time gomaxt sample60.csv
tmax = 59.999983344
real 0m8.044s
user 0m4.677s
sys 0m3.555s
$ time gomaxt long.csv
tmax = 7200.265510652
real 18m37.472s
user 10m58.221s
sys 8m28.282s
I have no idea what's happening here, but it's taking 50-75 times longer than the C version. It's especially weird that the sys time is so long.
I changed the go version so that it more resembles my original post, using a bufio.NewScanner and fmt.Sscanf to do the splitting:
import (
"bufio"
"fmt"
"os"
)
func main () {
fin, _ := os.Open(os.Args[1])
scanner := bufio.NewScanner(fin)
var tmax float64
var timetag int64
var board, channel, energy, energyshort, flags int
for scanner.Scan() {
fmt.Sscanf(scanner.Text(), "%d;%d;%d;%d;%d;%x", &board, &channel, &timetag, &energy, &energyshort, &flags)
tmax = float64(timetag)/1e12
}
fmt.Println("tmax = ", tmax)
}
This version only(?!) takes 6 times longer than the C version:
$ time gomaxtorig sample60.csv
tmax = 59.999983344
real 0m0.895s
user 0m0.905s
sys 0m0.038s
$ time gomaxtorig long.csv
tmax = 7200.265510652
real 1m53.030s
user 2m1.039s
sys 0m3.021s
So now let's replace fmt.Sscanf with string splitting:
import (
"bufio"
"fmt"
"os"
"strconv"
"strings"
)
func main () {
fin, _ := os.Open(os.Args[1])
scanner := bufio.NewScanner(fin)
var tmax float64
var timetag int64
for scanner.Scan() {
ss := strings.Split(scanner.Text(), ";")
timetag, _ = strconv.ParseInt(ss[2], 10, 64)
tmax = float64(timetag)/1e12
}
fmt.Println("tmax = ", tmax)
}
As suggested, the majority of time actually was being spent in fmt.Sscanf. This version takes slightly twice as long as the C version:
$ time ../gammaprocess_go/maxtsplit sample60.csv
tmax = 59.999983344
real 0m0.226s
user 0m0.243s
sys 0m0.022s
$ time ../gammaprocess_go/maxtsplit long.csv
tmax = 7200.265510652
real 0m26.434s
user 0m28.834s
sys 0m1.683s
I did write a slightly hacky version that forced the string conversion of the other fields in each row of the csv file, just to see if that made any difference, and the times were essentially identical to the above version.
Because I'm using Scanner.Text(), there is a lot of strings created and then thrown out, and it was suggested that I use bytes instead of strings. It looks to me like that's what the csv package does, so I just used it:
import (
"encoding/csv"
"fmt"
"os"
"io"
"strconv"
)
func main () {
fin, _ := os.Open(os.Args[1])
r := csv.NewReader(fin)
r.Comma = ';'
var tmax float64
var timetag int64
for {
rec, err := r.Read()
if err == io.EOF {
break
}
timetag, _ = strconv.ParseInt(rec[2], 10, 64)
tmax = float64(timetag)/1e12
}
fmt.Println("tmax = ", tmax)
}
The timing is just slightly longer than the string-splitting version with Scanner.Text*():
$ time gomaxtcsv sample60.csv
tmax = 59.999983344
real 0m0.281s
user 0m0.300s
sys 0m0.019s
$ time gomaxtcsv long.csv
tmax = 7200.265510652
real 0m32.736s
user 0m35.619s
sys 0m1.702s
but presumably there's more overhead since the csv package is doing a lot more than my simple example, so I'd say that's inconclusive.
Regardless, I'm fine with a factor of 2 inefficiency, so I'm not going to try to keep optimizing this. Many thanks to the folks who answered this question.
[Another update]
Or just look at https://groups.google.com/g/golang-nuts/c/W08rFBcHKbc/m/oIrdQcBxKa4J, from 2014.
答案1
得分: 1
sscanf占用了大部分时间。请执行以下操作:
ss := strings.Split(scanner.Text(), ";")
board, _ = strconv.Atoi(ss[0])
channel, _ = strconv.Atoi(ss[1])
timetag, _ = strconv.Atoi(ss[2])
energy, _ = strconv.Atoi(ss[3])
flags, _ = strconv.ParseUint(ss[4], 16, 64)
省略了错误检查。
英文:
sscanf takes most of the time. Do:
ss := strings.Split(scanner.Text(), ";")
board, _ = strconv.Atoi(ss[0])
channel, _ = strconv.Atoi(ss[1])
timetag, _ = strconv.Atoi(ss[2])
energy, _ = strconv.Atoi(ss[3])
flags, _ = strconv.ParseUint(ss[4], 16, 64)
Checking errors omitted.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论