优化Go文件读取程序

huangapple go评论82阅读模式
英文:

Optimize Go file reading program

问题

我正在尝试处理一个日志文件,每一行看起来都像这样:

flow_stats: 0.30062869162666672 gid 0 fid 1 pkts 5.0 fldur 0.30001386666666674 avgfldur 0.30001386666666674 actfl 3142 avgpps 16.665896331902879 finfl 1

我对pkts字段和fldur字段感兴趣。我有一个Python脚本,可以读取一个百万行的日志文件,为所有不同持续时间的数据包创建一个列表,对这些列表进行排序,并在大约3秒钟内计算出中位数。

我正在尝试使用Go编程语言重写这个脚本,希望它能运行得更快。

到目前为止,我感到失望。仅仅将文件读入数据结构就需要大约5.5秒的时间。所以我想知道你们中的一些了不起的人是否可以帮助我使它运行得更快。

这是我的循环:

data := make(map[int][]float32)
infile, err := os.Open("tmp/flow.tr")
defer infile.Close()
if err != nil {
  panic(err)
}
reader := bufio.NewReader(infile)

line, err := reader.ReadString('\n')
for {
  if len(line) == 0 {
    break
  }
  if err != nil && err != io.EOF {
    panic(err)
  }
  split_line := strings.Fields(line)
  num_packets, err := strconv.ParseFloat(split_line[7], 32)
  duration, err := strconv.ParseFloat(split_line[9], 32)
  data[int(num_packets)] = append(data[int(num_packets)], float32(duration))

  line, err = reader.ReadString('\n')
}

请注意,我实际上在循环中检查了err,为了简洁起见,我省略了它。google-pprof表明大部分时间都花在了strings.Fieldsstrings.FieldsFuncunicode.IsSpaceruntime.stringiter2上。

我该如何使其运行得更快?

英文:

I'm trying to process a log file, each line of which looks something like this:

flow_stats: 0.30062869162666672 gid 0 fid 1 pkts 5.0 fldur 0.30001386666666674 avgfldur 0.30001386666666674 actfl 3142 avgpps 16.665896331902879 finfl 1

I'm interested in the pkts field and the fldur field. I've got a Python script that can read in a million-line log file, create a list for each number of packets of all the different durations, sort those lists and figure out the median in about 3 seconds.

I'm playing around with the Go programming language and thought I'd rewrite this, in the hope that it would run faster.

So far, I've been disappointed. Just reading the file in to the data structure takes about 5.5 seconds. So I'm wondering if some of you wonderful people can help me make this go (hehe) faster.

Here's my loop:

data := make(map[int][]float32)
infile, err := os.Open("tmp/flow.tr")
defer infile.Close()
if err != nil {
  panic(err)
}
reader := bufio.NewReader(infile)

line, err := reader.ReadString('\n')
for {
  if len(line) == 0 {
    break
  }
  if err != nil && err != io.EOF {
    panic(err)
  }
  split_line := strings.Fields(line)
  num_packets, err := strconv.ParseFloat(split_line[7], 32)
  duration, err := strconv.ParseFloat(split_line[9], 32)
  data[int(num_packets)] = append(data[int(num_packets)], float32(duration))

  line, err = reader.ReadString('\n')
}

Note that I do actually check the errs in the loop -- I've omitted that for brevity. google-pprof indicates that a majority of the time is being spent in strings.Fields by strings.FieldsFunc, unicode.IsSpace, and runtime.stringiter2.

How can I make this run faster?

答案1

得分: 7

split_line := strings.Fields(line)

替换为

split_line := strings.SplitN(line, " ", 11)

在一个模拟你提供的格式的1M行随机生成文件上,提高了大约4倍的速度:

使用strings.Fields版本:完成时间为4.232525975秒

使用strings.SplitN版本:完成时间为1.111450755秒

一部分效率来自于在分割持续时间后能够避免解析和分割输入行,但大部分效率来自于SplitN中更简单的分割逻辑。即使分割所有字符串,所花费的时间也不比在持续时间后停止要长。使用:

split_line := strings.SplitN(line, " ", -1)

完成时间为1.554971313秒

SplitN和Fields不是相同的。Fields假设令牌由一个或多个空白字符界定,而SplitN将令牌视为由分隔符字符串界定的任何内容。如果输入的令牌之间有多个空格,split_line将包含每对空格的空令牌。

排序和计算中位数并不会增加太多时间。我将代码更改为在排序时使用float64而不是float32,这只是为了方便。以下是完整的程序:

package main

import (
	"bufio"
	"fmt"
	"os"
	"sort"
	"strconv"
	"strings"
	"time"
)

// SortKeys从map[int][]float64返回一个排序后的键值列表。
func sortKeys(items map[int][]float64) []int {
	keys := make([]int, len(items))
	i := 0
	for k, _ := range items {
		keys[i] = k
		i++
	}
	sort.Ints(keys)
	return keys
}

// Median计算未排序的float64切片的中位数值。
func median(d []float64) (m float64) {
	sort.Float64s(d)
	length := len(d)
	if length%2 == 1 {
		m = d[length/2]
	} else {
		m = (d[length/2] + d[length/2-1]) / 2
	}
	return m
}

func main() {
	data := make(map[int][]float64)
	infile, err := os.Open("sample.log")
	defer infile.Close()
	if err != nil {
		panic(err)
	}
	reader := bufio.NewReaderSize(infile, 256*1024)

	s := time.Now()
	for {
		line, err := reader.ReadString('\n')
		if len(line) == 0 {
			break
		}
		if err != nil {
			panic(err)
		}
		split_line := strings.SplitN(line, " ", 11)
		num_packets, err := strconv.ParseFloat(split_line[7], 32)
		if err != nil {
			panic(err)
		}
		duration, err := strconv.ParseFloat(split_line[9], 32)
		if err != nil {
			panic(err)
		}
		pkts := int(num_packets)
		data[pkts] = append(data[pkts], duration)
	}

	for _, k := range sortKeys(data) {
		fmt.Printf("pkts: %d, median: %f\n", k, median(data[k]))
	}
	fmt.Println("\nCompleted in ", time.Since(s))
}

输出结果为:

pkts: 0, median: 0.498146
pkts: 1, median: 0.511023
pkts: 2, median: 0.501408
...
pkts: 99, median: 0.501517
pkts: 100, median: 0.491499
Completed in  1.497052072s
英文:

Replacing

split_line := strings.Fields(line)

with

split_line := strings.SplitN(line, " ", 11)

Yielded ~4x speed improvement on a 1M line randomly generated file that mimicked the format you provided above:

strings.Fields version: Completed in 4.232525975s

strings.SplitN version: Completed in 1.111450755s

Some of the efficiency comes from being able to avoid parsing and splitting the input line after the duration has be split, but most of it comes from the simpler splitting logic in SplitN. Even splitting all of the strings doesn't take much longer than stopping after the duration. Using:

split_line := strings.SplitN(line, " ", -1)

Completed in 1.554971313s

SplitN and Fields are not the same. Fields assumes tokens are bounded by 1 or more whitespace characters, where SplitN treats tokens as anything bounded by the separator string. If your input had multiple spaces between tokens, split_line would contain empty tokens for each pair of spaces.

Sorting and calculating the median does not add much time. I changed the code to use a float64 rather than a float32 as a matter of convenience when sorting. Here's the complete program:

package main
import (
"bufio"
"fmt"
"os"
"sort"
"strconv"
"strings"
"time"
)
// SortKeys returns a sorted list of key values from a map[int][]float64.
func sortKeys(items map[int][]float64) []int {
keys := make([]int, len(items))
i := 0
for k, _ := range items {
keys[i] = k
i++
}
sort.Ints(keys)
return keys
}
// Median calculates the median value of an unsorted slice of float64.
func median(d []float64) (m float64) {
sort.Float64s(d)
length := len(d)
if length%2 == 1 {
m = d[length/2]
} else {
m = (d[length/2] + d[length/2-1]) / 2
}
return m
}
func main() {
data := make(map[int][]float64)
infile, err := os.Open("sample.log")
defer infile.Close()
if err != nil {
panic(err)
}
reader := bufio.NewReaderSize(infile, 256*1024)
s := time.Now()
for {
line, err := reader.ReadString('\n')
if len(line) == 0 {
break
}
if err != nil {
panic(err)
}
split_line := strings.SplitN(line, " ", 11)
num_packets, err := strconv.ParseFloat(split_line[7], 32)
if err != nil {
panic(err)
}
duration, err := strconv.ParseFloat(split_line[9], 32)
if err != nil {
panic(err)
}
pkts := int(num_packets)
data[pkts] = append(data[pkts], duration)
}
for _, k := range sortKeys(data) {
fmt.Printf("pkts: %d, median: %f\n", k, median(data[k]))
}
fmt.Println("\nCompleted in ", time.Since(s))
}

And the output:

pkts: 0, median: 0.498146
pkts: 1, median: 0.511023
pkts: 2, median: 0.501408
...
pkts: 99, median: 0.501517
pkts: 100, median: 0.491499
Completed in  1.497052072s

huangapple
  • 本文由 发表于 2012年10月6日 08:16:11
  • 转载请务必保留本文链接:https://go.coder-hub.com/12755575.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定