英文:
Parallel execution of prime finding algorithm slows runtime
问题
所以我在Go语言中实现了以下的素数查找算法。
- primes = []
- 假设所有的数字都是素数(显然是真的)
- check = 2
- 如果check仍然被假设为素数,则将其添加到primes中
- 将check乘以小于等于其最小因子的每个素数,并从假设的素数中消除结果。
- 将check增加1,并重复步骤4到6,直到check > limit。
这是我的串行实现:
package main
import (
"fmt"
"time"
)
type numWithMinFactor struct {
number int
minfactor int
}
func pow(base int, power int) int {
result := 1
for i := 0; i < power; i++ {
result *= base
}
return result
}
func process(check numWithMinFactor, primes []int, top int, minFactors []numWithMinFactor) {
var n int
for i := 0; primes[i] <= check.minfactor; i++ {
n = check.number * primes[i]
if n > top {
break
}
minFactors[n] = numWithMinFactor{n, primes[i]}
if i+1 == len(primes) {
break
}
}
}
func findPrimes(top int) []int {
primes := []int{}
minFactors := make([]numWithMinFactor, top+2)
check := 2
for power := 1; check <= top; power++ {
if minFactors[check].number == 0 {
primes = append(primes, check)
minFactors[check] = numWithMinFactor{check, check}
}
process(minFactors[check], primes, top, minFactors)
check++
}
return primes
}
func main() {
fmt.Println("Welcome to prime finder!")
start := time.Now()
fmt.Println(findPrimes(1000000))
elapsed := time.Since(start)
fmt.Println("Finding primes took", elapsed)
}
这个实现在我的电脑上运行得很好,大约在63毫秒内(主要是打印输出)找到小于1,000,000的所有素数,并在600毫秒内找到小于10,000,000的所有素数。现在我发现,对于2^n < check <= 2^(n+1)的所有check,没有任何一个数具有大于2^n的因子,因此一旦我找到了小于等于2^n的素数,我可以并行地进行每个check在该范围内的所有乘法和消除操作。我的并行实现如下:
package main
import (
"fmt"
"time"
"sync"
)
type numWithMinFactor struct {
number int
minfactor int
}
func pow(base int, power int) int {
result := 1
for i := 0; i < power; i++ {
result *= base
}
return result
}
func process(check numWithMinFactor, primes []int, top int, minFactors []numWithMinFactor, wg *sync.WaitGroup) {
defer wg.Done()
var n int
for i := 0; primes[i] <= check.minfactor; i++ {
n = check.number * primes[i]
if n > top {
break
}
minFactors[n] = numWithMinFactor{n, primes[i]}
if i+1 == len(primes) {
break
}
}
}
func findPrimes(top int) []int {
primes := []int{}
minFactors := make([]numWithMinFactor, top+2)
check := 2
var wg sync.WaitGroup
for power := 1; check <= top; power++ {
for check <= pow(2, power) {
if minFactors[check].number == 0 {
primes = append(primes, check)
minFactors[check] = numWithMinFactor{check, check}
}
wg.Add(1)
go process(minFactors[check], primes, top, minFactors, &wg)
check++
if check > top {
break
}
}
wg.Wait()
}
return primes
}
func main() {
fmt.Println("Welcome to prime finder!")
start := time.Now()
fmt.Println(findPrimes(1000000))
elapsed := time.Since(start)
fmt.Println("Finding primes took", elapsed)
}
不幸的是,这个实现不仅速度较慢,运行1,000,000个数需要600毫秒,运行10,000,000个数需要6秒。我的直觉告诉我,有潜力通过并行化来提高性能,但显然我还没有能够实现这一点,我非常希望能够得到任何关于如何改进运行时间的建议,或者更具体地说,为什么并行解决方案比串行解决方案更慢的见解。
此外,并行解决方案相对于串行解决方案消耗更多的内存,但这是可以预期的;串行解决方案可以在大约22秒内计算出小于1,000,000,000的所有素数,而并行解决方案在我的系统上(32GB内存)为相同的目标耗尽了内存。但我在这里询问的是运行时间,而不是内存使用,例如,我可以使用minFactors数组的零值状态而不是单独的isPrime []bool true状态,但我认为现在的代码更易读。
我尝试过将primes []int传递给指针,但似乎没有什么区别,使用通道而不是将minFactors数组传递给process函数会导致大量的内存使用和更慢的性能(大约10倍)。我已经重写了这个算法几次,以查看是否能够解决任何问题,但没有运气。非常感谢任何见解或建议,因为我认为并行化可以使这个算法更快,而不是慢10倍!
英文:
So I implemented the following prime finding algorithm in go.
- primes = []
- Assume all numbers are primes (vacuously true)
- check = 2
- if check is still assumed to be prime append it to primes
- multiply check by each prime less than or equal to its minimum factor and
eliminate results from assumed primes. - increment check by 1 and repeat 4 thru 6 until check > limit.
Here is my serial implementation:
package main
import(
"fmt"
"time"
)
type numWithMinFactor struct {
number int
minfactor int
}
func pow(base int, power int) int{
result := 1
for i:=0;i<power;i++{
result*=base
}
return result
}
func process(check numWithMinFactor,primes []int,top int,minFactors []numWithMinFactor){
var n int
for i:=0;primes[i]<=check.minfactor;i++{
n = check.number*primes[i]
if n>top{
break;
}
minFactors[n] = numWithMinFactor{n,primes[i]}
if i+1 == len(primes){
break;
}
}
}
func findPrimes(top int) []int{
primes := []int{}
minFactors := make([]numWithMinFactor,top+2)
check := 2
for power:=1;check <= top;power++{
if minFactors[check].number == 0{
primes = append(primes,check)
minFactors[check] = numWithMinFactor{check,check}
}
process(minFactors[check],primes,top,minFactors)
check++
}
return primes
}
func main(){
fmt.Println("Welcome to prime finder!")
start := time.Now()
fmt.Println(findPrimes(1000000))
elapsed := time.Since(start)
fmt.Println("Finding primes took %s", elapsed)
}
This runs great producing all the primes <1,000,000 in about 63ms (mostly printing) and primes <10,000,000 in 600ms on my pc. Now I figure none of the numbers check such that 2^n < check <= 2^(n+1) have factors > 2^n so I can do all the multiplications and elimination for each check in that range in parallel once I have primes up to 2^n. And my parallel implementation is as follows:
package main
import(
"fmt"
"time"
"sync"
)
type numWithMinFactor struct {
number int
minfactor int
}
func pow(base int, power int) int{
result := 1
for i:=0;i<power;i++{
result*=base
}
return result
}
func process(check numWithMinFactor,primes []int,top int,minFactors []numWithMinFactor, wg *sync.WaitGroup){
defer wg.Done()
var n int
for i:=0;primes[i]<=check.minfactor;i++{
n = check.number*primes[i]
if n>top{
break;
}
minFactors[n] = numWithMinFactor{n,primes[i]}
if i+1 == len(primes){
break;
}
}
}
func findPrimes(top int) []int{
primes := []int{}
minFactors := make([]numWithMinFactor,top+2)
check := 2
var wg sync.WaitGroup
for power:=1;check <= top;power++{
for check <= pow(2,power){
if minFactors[check].number == 0{
primes = append(primes,check)
minFactors[check] = numWithMinFactor{check,check}
}
wg.Add(1)
go process(minFactors[check],primes,top,minFactors,&wg)
check++
if check>top{
break;
}
}
wg.Wait()
}
return primes
}
func main(){
fmt.Println("Welcome to prime finder!")
start := time.Now()
fmt.Println(findPrimes(1000000))
elapsed := time.Since(start)
fmt.Println("Finding primes took %s", elapsed)
}
Unfortunately not only is this implementation slower running up to 1,000,000 in 600ms and up to 10 million in 6 seconds. My intuition tells me that there is potential for parallelism to improve performance however I clearly haven't been able to achieve that and would greatly appreciate any input on how to improve runtime here, or more specifically any insight as to why the parallel solution is slower.
Additionally the parallel solution consumes more memory relative to the serial solution but that is to be expected; the serial solution can grid up to 1,000,000,000 in about 22 seconds where the parallel solution runs out of memory on my system (32GB ram) going for the same target. But I'm asking about runtime here not memory use, I could for example use the zero value state of the minFactors array rather than a separate isPrime []bool true state but I think it is more readable as is.
I've tried passing a pointer for primes []int but that didn't seem to make a difference, using a channel instead of passing the minFactors array to the process function resulted in big time memory use and a much(10x ish) slower performance. I've re-written this algo a couple times to see if I could iron anything out but no luck. Any insights or suggestions would be much appreciated because I think parallelism could make this faster not 10x slower!
Par @Volker's suggestion I limited the number of processes to somthing less than my pc's available logical processes with the following revision however I am still getting runtimes that are 10x slower than the serial implementation.
package main
import(
"fmt"
"time"
"sync"
)
type numWithMinFactor struct {
number int
minfactor int
}
func pow(base int, power int) int{
result := 1
for i:=0;i<power;i++{
result*=base
}
return result
}
func process(check numWithMinFactor,primes []int,top int,minFactors []numWithMinFactor, wg *sync.WaitGroup){
defer wg.Done()
var n int
for i:=0;primes[i]<=check.minfactor;i++{
n = check.number*primes[i]
if n>top{
break;
}
minFactors[n] = numWithMinFactor{n,primes[i]}
if i+1 == len(primes){
break;
}
}
}
func findPrimes(top int) []int{
primes := []int{}
minFactors := make([]numWithMinFactor,top+2)
check := 2
nlogicalProcessors := 20
var wg sync.WaitGroup
var twoPow int
for power:=1;check <= top;power++{
twoPow = pow(2,power)
for check <= twoPow{
for nLogicalProcessorsInUse := 0 ; nLogicalProcessorsInUse < nlogicalProcessors; nLogicalProcessorsInUse++{
if minFactors[check].number == 0{
primes = append(primes,check)
minFactors[check] = numWithMinFactor{check,check}
}
wg.Add(1)
go process(minFactors[check],primes,top,minFactors,&wg)
check++
if check>top{
break;
}
if check>twoPow{
break;
}
}
wg.Wait()
if check>top{
break;
}
}
}
return primes
}
func main(){
fmt.Println("Welcome to prime finder!")
start := time.Now()
fmt.Println(findPrimes(10000000))
elapsed := time.Since(start)
fmt.Println("Finding primes took %s", elapsed)
}
tldr; Why is my parallel implementation slower than serial implementation how do I make it faster?
Par @mh-cbon's I made larger jobs for parallel processing resulting in the following code.
package main
import(
"fmt"
"time"
"sync"
)
func pow(base int, power int) int{
result := 1
for i:=0;i<power;i++{
result*=base
}
return result
}
func process(check int,primes []int,top int,minFactors []int){
var n int
for i:=0;primes[i]<=minFactors[check];i++{
n = check*primes[i]
if n>top{
break;
}
minFactors[n] = primes[i]
if i+1 == len(primes){
break;
}
}
}
func processRange(start int,end int,primes []int,top int,minFactors []int, wg *sync.WaitGroup){
defer wg.Done()
for start <= end{
process(start,primes,top,minFactors)
start++
}
}
func findPrimes(top int) []int{
primes := []int{}
minFactors := make([]int,top+2)
check := 2
nlogicalProcessors := 10
var wg sync.WaitGroup
var twoPow int
var start int
var end int
var stepSize int
var stepsTaken int
for power:=1;check <= top;power++{
twoPow = pow(2,power)
stepSize = (twoPow-start)/nlogicalProcessors
stepsTaken = 0
stepSize = (twoPow/2)/nlogicalProcessors
for check <= twoPow{
start = check
end = check+stepSize
if stepSize == 0{
end = twoPow
}
if stepsTaken == nlogicalProcessors-1{
end = twoPow
}
if end>top {
end = top
}
for check<=end {
if minFactors[check] == 0{
primes = append(primes,check)
minFactors[check] = check
}
check++
}
wg.Add(1)
go processRange(start,end,primes,top,minFactors,&wg)
if check>top{
break;
}
if check>twoPow{
break;
}
stepsTaken++
}
wg.Wait()
if check>top{
break;
}
}
return primes
}
func main(){
fmt.Println("Welcome to prime finder!")
start := time.Now()
fmt.Println(findPrimes(1000000))
elapsed := time.Since(start)
fmt.Println("Finding primes took %s", elapsed)
}
This runs at a similar speed to the serial implementation.
答案1
得分: -2
所以,最终我成功地运行了一个并行版本的代码,比串行版本稍微快一些,这是根据@mh-cbon的建议进行的(请参见上文)。然而,与串行实现相比,这种实现并没有带来巨大的改进(相对于串行版本的50毫秒,10亿次循环需要75毫秒)。考虑到分配和写入一个[]int 0:10000000需要25毫秒,我对这些结果并不失望。正如@Volker所说的,“这种情况通常不是由CPU限制,而是由内存带宽限制。”我相信这在这里也是如此。
然而,我仍然希望看到任何额外的改进,不过对于我在这里所获得的结果,我还是有些满意的。
- 串行代码运行到20亿次循环需要19.4秒
- 并行代码运行到20亿次循环需要11.1秒
- 初始化[]int{0:20亿}需要4.5秒
英文:
So I did eventually get a parallel version of the code to run slightly faster than the serial version. following suggestions from @mh-cbon (See above). However this implementation did not result in vast improvements relative to the serial implementation (50ms to 10 million compared to 75ms serially) Considering that allocating and writing an []int 0:10000000 takes 25ms I'm not disappointed by these results. As @Volker stated "such stuff often is not limited by CPU but by memory bandwidth." which I believe is the case here.
I would still love to see any additional improvements however I am somewhat satisfied with what I've gained here.
- Serial code running up to 2 billion 19.4 seconds
- Parallel code running up to 2 billion 11.1 seconds
- Initializing []int{0:2Billion} 4.5 seconds
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论