为什么使用goroutines来并行调用会变得更慢?

huangapple go评论87阅读模式
英文:

Golang: why using goroutines to parallelize calls ends up being slower?

问题

我有两个版本的归并排序实现。第一个是“普通”版本,第二个使用goroutines来并行化递归过程中对切片子集的处理。

人们可能会认为能够并行化这个工作会使并发实现更快:如果我需要处理切片A和切片B,那么并行处理它们应该比同步处理更快。

现在我假设我的实现或理解中有问题,因为我的并发版本比同步版本慢了13-14倍。

有人能指点我错在哪里吗?

"普通"(同步实现):

// MergeSort使用归并排序算法对切片s进行排序
func MergeSort(s []int) []int {
    if len(s) <= 1 {
        return s
    }

    n := len(s) / 2

    var l []int
    var r []int

    l = MergeSort(s[:n])
    r = MergeSort(s[n:])

    return merge(l, r)
}

"并发"版本:

// MergeSortMulti使用归并排序算法对切片s进行排序
func MergeSortMulti(s []int) []int {
    if len(s) <= 1 {
        return s
    }

    n := len(s) / 2

    wg := sync.WaitGroup{}
    wg.Add(2)

    var l []int
    var r []int

    go func() {
        l = MergeSortMulti(s[:n])
        wg.Done()
    }()

    go func() {
        r = MergeSortMulti(s[n:])
        wg.Done()
    }()

    wg.Wait()
    return merge(l, r)
}

两者都使用相同的merge函数:

func merge(l, r []int) []int {
    ret := make([]int, 0, len(l)+len(r))
    for len(l) > 0 || len(r) > 0 {
        if len(l) == 0 {
            return append(ret, r...)
        }
        if len(r) == 0 {
            return append(ret, l...)
        }
        if l[0] <= r[0] {
            ret = append(ret, l[0])
            l = l[1:]
        } else {
            ret = append(ret, r[0])
            r = r[1:]
        }
    }
    return ret
}

这是我的基准测试代码:

package msort

import "testing"

var a []int

func init() {
    for i := 0; i < 1000000; i++ {
        a = append(a, i)
    }
}
func BenchmarkMergeSortMulti(b *testing.B) {
    for n := 0; n < b.N; n++ {
        MergeSortMulti(a)
    }
}

func BenchmarkMergeSort(b *testing.B) {
    for n := 0; n < b.N; n++ {
        MergeSort(a)
    }
}

它显示并发版本比普通同步版本慢得多:

BenchmarkMergeSortMulti-8           1    1711428093 ns/op
BenchmarkMergeSort-8                10    131232885 ns/op
英文:

I have two versions of merge sort implementation. The first is a "normal" version and the second uses goroutines that parallelize the work being done on each subset of the slice in each step of the recursion.

One would assume that being able to parallelize this work would make the concurrent implementation faster: if I need to work on slice A and slice B, then working on them concurrently should be faster than doing so synchronously.

Now I'm assuming something is wrong with either my implementation of my understanding, because my concurrent version ends up being 13-14x slower than the sync version.

Can anyone point me in the right direction as to what I'm missing?

"Normal" (synchronous implementation):

// MergeSort sorts the slice s using Merge Sort Algorithm
func MergeSort(s []int) []int {
	if len(s) &lt;= 1 {
		return s
	}

	n := len(s) / 2

	var l []int
	var r []int

	l = MergeSort(s[:n])
	r = MergeSort(s[n:])

	return merge(l, r)
}

"Concurrent" version:

// MergeSortMulti sorts the slice s using Merge Sort Algorithm
func MergeSortMulti(s []int) []int {
	if len(s) &lt;= 1 {
		return s
	}

	n := len(s) / 2

	wg := sync.WaitGroup{}
	wg.Add(2)

	var l []int
	var r []int

	go func() {
		l = MergeSortMulti(s[:n])
		wg.Done()
	}()

	go func() {
		r = MergeSortMulti(s[n:])
		wg.Done()
	}()

	wg.Wait()
	return merge(l, r)
}

Both use the same merge function:

func merge(l, r []int) []int {
	ret := make([]int, 0, len(l)+len(r))
	for len(l) &gt; 0 || len(r) &gt; 0 {
		if len(l) == 0 {
			return append(ret, r...)
		}
		if len(r) == 0 {
			return append(ret, l...)
		}
		if l[0] &lt;= r[0] {
			ret = append(ret, l[0])
			l = l[1:]
		} else {
			ret = append(ret, r[0])
			r = r[1:]
		}
	}
	return ret
}

This is my benchmarking code:

package msort

import &quot;testing&quot;

var a []int

func init() {
	for i := 0; i &lt; 1000000; i++ {
		a = append(a, i)
	}
}
func BenchmarkMergeSortMulti(b *testing.B) {
	for n := 0; n &lt; b.N; n++ {
		MergeSortMulti(a)
	}
}

func BenchmarkMergeSort(b *testing.B) {
	for n := 0; n &lt; b.N; n++ {
		MergeSort(a)
	}
}

It reveals that the concurrent version is a lot slower than the normal synchronous version:

BenchmarkMergeSortMulti-8   	       1	1711428093 ns/op
BenchmarkMergeSort-8        	      10	 131232885 ns/op

答案1

得分: 3

这是因为你生成了大量的goroutine,在调用wg.Wait()时会发生抢占。调度器无法确定选择哪一个goroutine,它可以随机选择被阻塞的goroutine,直到找到一个最终可以返回并解除另一个goroutine阻塞的goroutine。当我限制并发调用MergeSortMulti的数量时,它比同步版本大约快3倍。

这段代码并不美观,但它是一个证明。

// MergeSortMulti使用归并排序算法对切片s进行排序
func MergeSortMulti(s []int) []int {
    if len(s) <= 1 {
        return s
    }

    n := len(s) / 2

    wg := sync.WaitGroup{}
    wg.Add(2)

    var l []int
    var r []int

    const N = len(s)
    const FACTOR = 8 // 丑陋但简单的限制goroutine数量的方法

    go func() {
        if n < N/FACTOR {
            l = MergeSort(s[:n])
        } else {
            l = MergeSortMulti(s[:n])
        }
        wg.Done()
    }()

    go func() {
        if n < N/FACTOR {
            r = MergeSort(s[n:])
        } else {
            r = MergeSortMulti(s[n:])
        }
        wg.Done()
    }()

    wg.Wait()
    return merge(l, r)
}

在你的机器上结果可能会有所不同,但是:

FACTOR = 4:

BenchmarkMergeSortMulti-8             50          33268370 ns/op
BenchmarkMergeSort-8                  20          91479573 ns/op

FACTOR = 10000:

BenchmarkMergeSortMulti-8             20          84822824 ns/op
BenchmarkMergeSort-8                  20         103068609 ns/op

FACTOR = N/4:

BenchmarkMergeSortMulti-8              3         352870855 ns/op
BenchmarkMergeSort-8                  10         129107177 ns/op

**额外内容:**你还可以使用信号量来限制goroutine的数量,这在我的机器上稍微慢一些(使用select来避免死锁):

var sem = make(chan struct{}, 100)

// MergeSortMulti使用归并排序算法对切片s进行排序
func MergeSortMulti(s []int) []int {
    if len(s) <= 1 {
        return s
    }

    n := len(s) / 2

    wg := sync.WaitGroup{}
    wg.Add(2)

    var l []int
    var r []int

    select {
    case sem <- struct{}{}:
        go func() {
            l = MergeSortMulti(s[:n])
            <-sem
            wg.Done()
        }()
    default:
        l = MergeSort(s[:n])
        wg.Done()
    }

    select {
    case sem <- struct{}{}:
        go func() {
            r = MergeSortMulti(s[n:])
            <-sem
            wg.Done()
        }()
    default:
        r = MergeSort(s[n:])
        wg.Done()
    }

    wg.Wait()
    return merge(l, r)
}

它的结果是:

BenchmarkMergeSortMulti-8             30          36741152 ns/op
BenchmarkMergeSort-8                  20          90843812 ns/op
英文:

This is because you spawn tons of goroutines which preempt when calling wg.Wait(). Scheduler has no idea which one to pick, it can pick randomly blocked ones till it meets one that finally can return and unblock another one. When I limited number of concurrent calls to MergeSortMulti it became roughly 3x faster than synchronous version.

This code isn't beautiful, but it is a proof.

// MergeSortMulti sorts the slice s using Merge Sort Algorithm
func MergeSortMulti(s []int) []int {
	if len(s) &lt;= 1 {
		return s
	}

	n := len(s) / 2

	wg := sync.WaitGroup{}
	wg.Add(2)

	var l []int
	var r []int

	const N = len(s)
	const FACTOR = 8  // ugly but easy way to limit number of goroutines

	go func() {
		if n &lt; N/FACTOR {
			l = MergeSort(s[:n])
		} else {
			l = MergeSortMulti(s[:n])
		}
		wg.Done()
	}()

	go func() {
		if n &lt; N/FACTOR {
			r = MergeSort(s[n:])
		} else {
			r = MergeSortMulti(s[n:])
		}
		wg.Done()
	}()

	wg.Wait()
	return merge(l, r)
}

Results will be different on your machine, but:

FACTOR = 4:

BenchmarkMergeSortMulti-8             50          33268370 ns/op
BenchmarkMergeSort-8                  20          91479573 ns/op

FACTOR = 10000

BenchmarkMergeSortMulti-8             20          84822824 ns/op
BenchmarkMergeSort-8                  20         103068609 ns/op

FACTOR = N/4

BenchmarkMergeSortMulti-8              3         352870855 ns/op
BenchmarkMergeSort-8                  10         129107177 ns/op

Bonus: You can do also use semaphore to limit the number of goroutines which is a bit slower on my machine (select is used to avoid dead-lock):

var sem = make(chan struct{}, 100)

// MergeSortMulti sorts the slice s using Merge Sort Algorithm
func MergeSortMulti(s []int) []int {
	if len(s) &lt;= 1 {
		return s
	}

	n := len(s) / 2

	wg := sync.WaitGroup{}
	wg.Add(2)

	var l []int
	var r []int

	select {
	case sem &lt;- struct{}{}:
		go func() {
			l = MergeSortMulti(s[:n])
			&lt;-sem
			wg.Done()
		}()
	default:
		l = MergeSort(s[:n])
		wg.Done()
	}

	select {
	case sem &lt;- struct{}{}:
		go func() {
			r = MergeSortMulti(s[n:])
			&lt;-sem
			wg.Done()
		}()
	default:
		r = MergeSort(s[n:])
		wg.Done()
	}

	wg.Wait()
	return merge(l, r)
}

It yields:

BenchmarkMergeSortMulti-8             30          36741152 ns/op
BenchmarkMergeSort-8                  20          90843812 ns/op

答案2

得分: 1

你的假设是不正确的:

> 人们可能会认为,能够并行处理这项工作会使并发实现更快:如果我需要同时处理切片A和切片B,那么并行处理它们应该比同步处理更快。

所有并行软件都适用于阿姆达尔定律(维基百科),我可以简单解释为“顺序设置不是免费的”。

当然,这在只使用单个 CPU 核心时尤为重要。然而,即使在多个核心的情况下,如果旨在实现高性能,仍然需要仔细考虑工作单元的编排和分配。幸运的是,kopiczko 的回答在特定情况下提供了一些很好的建议。

这已经是一个研究课题几十年了:例如,可以参考 Tidmus 和 Chalmers 的《Practical Parallel Processing: An Introduction to Problem Solving in Parallel》中对行业技巧的旧(但仍然相关)总结。

英文:

Your assumption is not correct:

> One would assume that being able to parallelize this work would make
> the concurrent implementation faster: if I need to work on slice A and
> slice B, then working on them concurrently should be faster than doing
> so synchronously.

All parallel software falls under Amdahl's Law (on Wikipedia), which I could paraphrase as 'the sequential setup is not free'.

This is of course especially so when only using a single CPU core. However, it still matters even with multiple cores, for which the choreography and distribution of units of work across cores needs to be thought through if high performance is the aim. Fortunately, kopiczko's answer provides some good tips in the specific case in question.

This has been a research topic for decades: see e.g. an old (but still relevant) summary of tricks of the trade in "Practical Parallel Processing: An Introduction to Problem Solving in Parallel" by Tidmus and Chalmers.

huangapple
  • 本文由 发表于 2017年1月2日 03:13:18
  • 转载请务必保留本文链接:https://go.coder-hub.com/41418210.html
  • concurrency
  • go
  • multithreading
  • parallel-processing
  • performance
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定