what can create huge overhead of goroutines?

huangapple go评论163阅读模式
英文:

what can create huge overhead of goroutines?

问题

为了一个作业,我们正在使用Go语言,其中一项任务是逐行解析一个uniprot数据库文件以收集uniprot记录。

我不想分享太多代码,但我有一个可以正确解析这样一个文件(2.5 GB)的工作代码片段(使用go-package计时,耗时48秒)。它以迭代的方式解析文件,并将行添加到记录中,直到达到记录结束信号(完整的记录),然后创建记录的元数据。然后将记录字符串置空,并逐行收集新的记录。然后我想尝试使用go-routines。

我之前从stackoverflow上得到了一些建议,然后在原始代码中简单地添加了一个处理元数据创建的函数。

所以,代码的执行过程是:

  1. 创建一个空记录,
  2. 迭代文件并将行添加到记录中,
  3. 如果找到记录停止信号(现在我们有一个完整的记录)- 将其传递给一个go routine来创建元数据,
  4. 将记录字符串置空,并从第2步继续。

我还添加了一个sync.WaitGroup(),以确保我在最后等待每个routine完成。我认为这实际上会降低解析数据库文件所花费的时间,因为它会在goroutines处理每个记录时继续解析。然而,代码似乎运行了超过20分钟,这表明出了问题或者开销变得非常大。有什么建议吗?

package main

import (
	"bufio"
	"crypto/sha1"
	"fmt"
	"io"
	"log"
	"os"
	"strings"
	"sync"
	"time"
)

type producer struct {
	parser uniprot
}

type unit struct {
	tag string
}

type uniprot struct {
	filenames     []string
	recordUnits   chan unit
	recordStrings map[string]string
}

func main() {
	p := producer{parser: uniprot{}}
	p.parser.recordUnits = make(chan unit, 1000000)
	p.parser.recordStrings = make(map[string]string)
	p.parser.collectRecords(os.Args[1])
}

func (u *uniprot) collectRecords(name string) {
	fmt.Println("file to open ", name)
	t0 := time.Now()
	wg := new(sync.WaitGroup)
	record := []string{}
	file, err := os.Open(name)
	errorCheck(err)
	scanner := bufio.NewScanner(file)
	for scanner.Scan() { //Scan the file
		retText := scanner.Text()
		if strings.HasPrefix(retText, "//") {
			wg.Add(1)
			go u.handleRecord(record, wg)
			record = []string{}
		} else {
			record = append(record, retText)
		}
	}
	file.Close()
	wg.Wait()
	t1 := time.Now()
	fmt.Println(t1.Sub(t0))
}

func (u *uniprot) handleRecord(record []string, wg *sync.WaitGroup) {
	defer wg.Done()
	recString := strings.Join(record, "\n")
	t := hashfunc(recString)
	u.recordUnits <- unit{tag: t}
	u.recordStrings[t] = recString
}

func hashfunc(record string) (hashtag string) {
	hash := sha1.New()
	io.WriteString(hash, record)
	hashtag = string(hash.Sum(nil))
	return
}

func errorCheck(err error) {
	if err != nil {
		log.Fatal(err)
	}
}

希望对你有帮助!

英文:

for an assignment we are using go and one of the things we are going to do is to parse a uniprotdatabasefile line-by-line to collect uniprot-records.

I prefer not to share too much code, but I have a working code snippet that does parse such a file (2.5 GB) correctly in 48 s (measured using the time go-package). It parses the file iteratively and add lines to a record until a record end signal is reached (a full record), and metadata on the record is created. Then the record string is nulled, and a new record is collected line-by-line. Then I thought that I would try to use go-routines.

I have got some tips before from stackoverflow, and then to the original code I simple added a function to handle everything concerning the metadata-creation.

So, the code is doing

  1. create an empty record,
  2. iterate the file and add lines to the record,
  3. if a record stop signal is found (now we have a full record) - give it to a go routine to create the metadata
  4. null the record string and continue from 2).

I also added a sync.WaitGroup() to make sure that I waited (in the end) for each routine to finish. I thought that this would actually lower the time spent on parsing the databasefile as it continued to parse while the goroutines would act on each record. However, the code seems to run for more than 20 minutes indicating that something is wrong or the overhead went crazy. Any suggestions?

package main
import (
&quot;bufio&quot;
&quot;crypto/sha1&quot;
&quot;fmt&quot;
&quot;io&quot;
&quot;log&quot;
&quot;os&quot;
&quot;strings&quot;
&quot;sync&quot;
&quot;time&quot;
)
type producer struct {
parser uniprot
}
type unit struct {
tag string
}
type uniprot struct {
filenames     []string
recordUnits   chan unit
recordStrings map[string]string
}
func main() {
p := producer{parser: uniprot{}}
p.parser.recordUnits = make(chan unit, 1000000)
p.parser.recordStrings = make(map[string]string)
p.parser.collectRecords(os.Args[1])
}
func (u *uniprot) collectRecords(name string) {
fmt.Println(&quot;file to open &quot;, name)
t0 := time.Now()
wg := new(sync.WaitGroup)
record := []string{}
file, err := os.Open(name)
errorCheck(err)
scanner := bufio.NewScanner(file)
for scanner.Scan() { //Scan the file
retText := scanner.Text()
if strings.HasPrefix(retText, &quot;//&quot;) {
wg.Add(1)
go u.handleRecord(record, wg)
record = []string{}
} else {
record = append(record, retText)
}
}
file.Close()
wg.Wait()
t1 := time.Now()
fmt.Println(t1.Sub(t0))
}
func (u *uniprot) handleRecord(record []string, wg *sync.WaitGroup) {
defer wg.Done()
recString := strings.Join(record, &quot;\n&quot;)
t := hashfunc(recString)
u.recordUnits &lt;- unit{tag: t}
u.recordStrings[t] = recString
}
func hashfunc(record string) (hashtag string) {
hash := sha1.New()
io.WriteString(hash, record)
hashtag = string(hash.Sum(nil))
return
}
func errorCheck(err error) {
if err != nil {
log.Fatal(err)
}
}

答案1

得分: 3

首先,你的代码不是线程安全的。主要是因为你同时访问了一个哈希映射。在Go语言中,这些不适用于并发操作,需要进行锁定。代码中有一个错误的行:

u.recordStrings[t] = recString

如果你在运行Go时的GOMAXPROCS大于1,这行代码会出错。我猜你没有这样做。确保你使用GOMAXPROCS=2或更高的值来实现并行性。默认值是1,因此你的代码在一个单独的操作系统线程上运行,当然不能同时在两个CPU或CPU核上调度。示例:

$ GOMAXPROCS=2 go run udb.go uniprot_sprot_viruses.dat

最后,从通道中获取值,否则你的程序将无法终止。如果goroutine的数量超过了限制,你将创建一个死锁。我使用了一个76MiB的数据文件进行了测试,你说你的文件大约是2.5GB。我有16347个条目。假设线性增长,你的文件将超过1e6,因此通道中没有足够的槽位,你的程序将发生死锁,没有结果,同时积累了无法运行的goroutine,最终以失败告终。

因此,解决方案是添加一个goroutine,从通道中获取值并对其进行处理。

另外提一下:如果你关心性能,请不要使用字符串,因为它们总是会被复制。改用[]byte代替。

英文:

First of all: your code is not thread-safe. Mainly because you're accessing a hashmap
concurrently. These are not safe for concurrency in go and need to be locked. Faulty line in your code:

u.recordStrings[t] = recString

As this will blow up when you're running go with GOMAXPROCS > 1, I'm assuming that you're not doing that. Make sure you're running your application with GOMAXPROCS=2 or higher to achieve parallelism.
The default value is 1, therefore your code runs on one single OS thread which, of course, can't be scheduled on two CPU or CPU cores simultaneously. Example:

$ GOMAXPROCS=2 go run udb.go uniprot_sprot_viruses.dat

At last: pull the values from the channel or otherwise your program will not terminate.
You're creating a deadlock if the number of goroutines exceeds your limit. I tested with a
76MiB file of data, you said your file was about 2.5GB. I have 16347 entries. Assuming linear growth,
your file will exceed 1e6 and therefore there are not enough slots in the channel and your program
will deadlock, giving no result while accumulating goroutines which don't run to fail at the end
(miserably).

So the solution should be to add a go routine which pulls the values from the channel and does
something with them.

As a side note: If you're worried about performance, do not use strings as they're always copied. Use []byte instead.

huangapple
  • 本文由 发表于 2013年10月10日 05:52:11
  • 转载请务必保留本文链接:https://go.coder-hub.com/19283462.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定