英文:
Should we do nested goroutines?
问题
我正在尝试为大量文件构建一个解析器,但我找不到关于可能被称为"嵌套goroutines"的信息(也许这不是正确的名称?)。
给定许多文件,每个文件都有很多行。我应该这样做:
for file in folder:
go do1
def do1:
for line in file:
go do2
def do2:
do_something
还是应该只使用"一级"的goroutines,并进行以下操作:
for file in folder:
for line in file:
go do_something
我的问题主要是关于性能问题。
感谢您阅读这句话!
英文:
I'm trying to build a parser for a large number of files, and I can't find information about what might possibly be called "nested goroutines" (maybe this is not the right name ?).
Given a lot of files, each of them having a lot of lines. Should I do:
for file in folder:
go do1
def do1:
for line in file:
go do2
def do2:
do_something
Or should I use only "one level" of goroutines, and do the following:
for file in folder:
for line in file:
go do_something
My question target primarily performance issues.
Thanks for reaching that sentence !
答案1
得分: 7
如果按照你指定的架构进行,你很有可能会因为创建了任意数量的工作线程而耗尽CPU、内存等资源。我建议你选择一种通过通道进行限流的架构。例如:
在主进程中将文件输入到一个通道中:
for _, file := range folder {
fileChan <- file
}
然后在另一个goroutine中将文件拆分成行,并将其输入到另一个通道中:
for {
select {
case file := <-fileChan:
for _, line := range file {
lineChan <- line
}
}
}
然后在第三个goroutine中取出行并进行相应处理:
for {
select {
case line := <-lineChan:
// 处理这一行
}
}
这种方式的主要优势在于你可以根据系统的处理能力创建多个或少量的goroutine,并将它们都传递给相同的通道。无论哪个goroutine先到达通道,它都会处理相应的数据,因此你可以控制所使用的资源量。
这里有一个可工作的示例:http://play.golang.org/p/-Qjd0sTtyP
英文:
If you go through with the architecture you've specified, you have a good chance of running out of CPU/Mem/etc because you're going to be creating an arbitrary amount of workers. I suggest, instead go with an architecture that allows you to throttle via channels. For example:
In your main process feed the files into a channel:
for _, file := range folder {
fileChan <- file
}
then in another goroutine break the files into lines and feed those into a channel:
for {
select{
case file := <-fileChan
for _, line := range file {
lineChan <- line
}
}
}
then in a 3rd goroutine pop out the lines and do what you will with them:
for {
select{
case line := <-lineChan:
// process the line
}
}
The main advantage to this is that you can create as many or as few go routines as your system can handle and pass them all the same channels and whichever go routine gets to the channel first will just handle it, so you're able to throttle the amount of resources you're using.
Here is a working example: http://play.golang.org/p/-Qjd0sTtyP
答案2
得分: 2
答案取决于每行操作的处理器密集程度。
如果行操作很短暂,绝对不要为每行操作生成一个goroutine。
如果操作很耗费资源(大约5秒或更长时间),请谨慎操作。你可能会耗尽内存。从Go 1.4开始,生成一个goroutine会分配2048字节的栈空间。对于200万行,仅为goroutine栈分配的内存就可能超过2GB。请考虑是否值得分配这么多内存。
简而言之,以下设置可能会获得最佳结果:
for file in folder:
go process_file(file)
如果文件数量超过CPU数量,你可能会有足够的并发性来掩盖从磁盘读取文件时涉及的磁盘I/O延迟。
英文:
The answer depends on how processor-intensive the operation on each line is.
If the line operation is short-lived, definitely don't bother to spawn a goroutine for each line.
If it's expensive (think ~5 secs or more), proceed with caution. You may run out of memory. As of Go 1.4, spawning a goroutine allocates a 2048 byte stack. For 2 million lines, you could allocate over 2GB of RAM for the goroutine stacks alone. Consider whether it's worth allocating this memory.
In short, you will probably get the best results with the following setup:
for file in folder:
go process_file(file)
If the number of files exceeds the number of CPUs, you're likely to have enough concurrency to mask the disk I/O latency involved in reading the files from disk.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论