2023年5月18日 08:08:02go评论60阅读模式

英文:

Looking for a good way to use openMP with recursive code?

问题

I wanted to optimize recursion with openMP. So, I started from this question:
[best-way-to-parallelize-this-recursion-using-openmp][1]

While looking to optimize recursive functions I was first interested in openMP. Starting from the code below that I found there (https://en.cppreference.com/w/cpp/chrono) :

#include &lt;iostream&gt;
#include &lt;chrono&gt;
 
long fibonacci(unsigned n)
{
    if (n &lt; 2) return n;
    return fibonacci(n-1) + fibonacci(n-2);
}
 
int main()
{
    auto start = std::chrono::steady_clock::now();
    std::cout &lt;&lt; &quot;f(42) = &quot; &lt;&lt; fibonacci(42) &lt;&lt; &#39;\n&#39;;
    auto end = std::chrono::steady_clock::now();
    std::chrono::duration&lt;double&gt; elapsed_seconds = end-start;
    std::cout &lt;&lt; &quot;elapsed time: &quot; &lt;&lt; elapsed_seconds.count() &lt;&lt; &quot;s\n&quot;;
}

Compiled with g++ without any optimization it gives the result:

f(42) = 267914296
elapsed time: 1.88232s

Then I used openMP similarly to the answer upside... but I had to stop the process because it took a very very long time. I do not know a lot about openMP recursion parallelism since I just use it to optimize for loops in my code... Also, I found something in gcc documentation :

attributes((const))

Then I added it to my openMP "optimized" version but all I obtained is a Core Dump !!!

So I removed my omp pragmas to check that the core dump is due to my code or something else...

Then the code became:

#include &lt;iostream&gt;
#include &lt;chrono&gt;

__attribute__ ((const))
long fibonacci(unsigned n)
{
   
   int long r[2];

   if (n &lt; 2) return n;
   r[0]=fibonacci(n-1);
   r[1]=fibonacci(n-2);
   return (r[0]+r[1]);
}
 
int main()
{
    auto start = std::chrono::steady_clock::now();
    std::cout &lt;&lt; &quot;f(42) = &quot; &lt;&lt; fibonacci(42) &lt;&lt; &#39;\n&#39;;
    auto end = std::chrono::steady_clock::now();
    std::chrono::duration&lt;double&gt; elapsed_seconds = end-start;
    std::cout &lt;&lt; &quot;elapsed time: &quot; &lt;&lt; elapsed_seconds.count() &lt;&lt; &quot;s\n&quot;;
}

I compiled it with the following command:

g++ fibonacci.cpp -o fibonacci -Wall -O2 -march=native

Now it takes a lot less time to perform the Fibonacci computation of 42:

f(42) = 267914296
elapsed time: 0.00106504s

in power save mode
and in performance mode it gives

f(42) = 267914296
elapsed time: 0.000187806s

Here is the lscpu of my computer (if somebody wants to compare results):

Architecture:            x86_64
  CPU op-mode(s):        32-bit, 64-bit
  Address sizes:         39 bits physical, 48 bits virtual
  Byte Order:            Little Endian
CPU(s):                  8
  On-line CPU(s) list:   0-7
Vendor ID:               GenuineIntel
  Model name:            Intel(R) Core(TM) i7-7700K CPU @ 4.20GHz
    CPU family:          6
    Model:               158
    Thread(s) per core:  2
    Core(s) per socket:  4
    Socket(s):           1
    Stepping:            9
    CPU(s) scaling MHz:  95%
    CPU max MHz:         4500,0000
    CPU min MHz:         800,0000
    BogoMIPS:            8403,00

Now, I do not know how much time it could take with openMP because I was not able to obtain a satisfactory result... I hope somebody will find a way to add omp pragmas to this kind of recursive cases.

As required, here is the omp version 1 of this case:

#include &lt;iostream&gt;
#include &lt;chrono&gt;
#include &lt;omp.h&gt;

__attribute__ ((const))
long fibonacci(unsigned n)
{
   
	int long a, b;

	if (n &lt; 2) return n;
	#pragma omp parallel
	#pragma omp single nowait
	{         
       #pragma omp task shared(a)
	   a=fibonacci(n-1);
       #pragma omp task shared(b)
       b=fibonacci(n-2);
	   #pragma omp taskwait
	}
    return (a+b);
}
 
int main()
{
    auto start = std::chrono::steady_clock::now();
    std::cout &lt;&lt; &quot;f(42) = &quot; &lt;&lt; fibonacci(42) &lt;&lt; &#39;\n&#39;;
    auto end = std::chrono::steady_clock::now();
    std::chrono::duration&lt;double&gt; elapsed_seconds = end-start;
    std::cout &lt;&lt; &quot;elapsed time: &quot; &lt;&lt; elapsed_seconds.count() &lt;&lt; &quot;s\n&quot;;
}

And the version that drove me to a "silly code":

#include &lt;iostream&gt;
#include &lt;chrono&gt;
#include &lt;omp.h&gt;

__attribute__ ((const))
long fibonacci(unsigned n)
{
   
   int long r[2];
   int i;
   if (n &lt; 2) return n;
   #pragma omp parallel
   #pragma omp single
   for(i=0;i&lt;2;++i) 
   {
       #pragma omp task shared(r)   
       r[i]=fibonacci(n-i-1);
   }      
    
    return (r[0]+r[1]);
}
 
int main()
{
    auto start = std::chrono::steady_clock::now();
    std::cout &lt;&lt; &quot;f(42) = &quot; &lt;&lt; fibonacci(42) &lt;&lt; &#39;\n&#39;;
    auto end = std::chrono::steady_clock::now();
    std::chrono::duration&lt;double&gt; elapsed_seconds = end-start;
    std::cout &lt;&lt; &quot;elapsed time: &quot; &lt;&lt; elapsed_seconds.count() &lt;&lt; &quot;s\n&quot;;
}

The build command for these omp versions is :

g

<details>
<summary>英文:</summary>

I wanted to optimize recursion with openMP. So, I started from this question: 
[best-way-to-parallelize-this-recursion-using-openmp][1]

While looking to optimize recursive functions I was first interested in openMP. Starting from the code below that I found there (https://en.cppreference.com/w/cpp/chrono) :

#include <iostream>
#include <chrono>

long fibonacci(unsigned n)
{
if (n < 2) return n;
return fibonacci(n-1) + fibonacci(n-2);
}

int main()
{
auto start = std::chrono::steady_clock::now();
std::cout << "f(42) = " << fibonacci(42) << '\n';
auto end = std::chrono::steady_clock::now();
std::chrono::duration<double> elapsed_seconds = end-start;
std::cout << "elapsed time: " << elapsed_seconds.count() << "s\n";
}

Compiled with g++ without any optimization it gives the result:

f(42) = 267914296
elapsed time: 1.88232s


Then I used openMP similarly to the answer upside... but I had to stop the process because it took a very very long time. I do not know a lot about openMP recursion parallelism since I just use it to optimize for loops in my code... Also, I found something in gcc documentation :

    __attributes__((const)) 

Then I added it to my openMP &quot;optimized&quot; version but all I obtained is a Core Dump !!!

So I removed my omp pragmas to check that the core dump is due to my code or something else...

Then the code became:

#include <iostream>
#include <chrono>

attribute ((const))
long fibonacci(unsigned n)
{

int long r[2];

if (n < 2) return n;
r[0]=fibonacci(n-1);
r[1]=fibonacci(n-2);
return (r[0]+r[1]);
}

I compiled it with the following command:

g++ fibonacci.cpp -o fibonacci -Wall -O2 -march=native


Now it take a lot less time to perform the fibonacci computation of 42:

f(42) = 267914296
elapsed time: 0.00106504s

in power save mode
and in performance mode it gives

f(42) = 267914296
elapsed time: 0.000187806s

Here a lscpu of my computer (if somebody want to compare results):

Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Address sizes: 39 bits physical, 48 bits virtual
Byte Order: Little Endian
CPU(s): 8
On-line CPU(s) list: 0-7
Vendor ID: GenuineIntel
Model name: Intel(R) Core(TM) i7-7700K CPU @ 4.20GHz
CPU family: 6
Model: 158
Thread(s) per core: 2
Core(s) per socket: 4
Socket(s): 1
Stepping: 9
CPU(s) scaling MHz: 95%
CPU max MHz: 4500,0000
CPU min MHz: 800,0000
BogoMIPS: 8403,00


Now, I do not know how much time it could take with openMP because I was not able to obtain a satisfactory result... I hope somebody will find a way to add omp pragmas to this kind of recursive cases.

As required, here is the omp version 1 of this case:

#include <iostream>
#include <chrono>
#include <omp.h>

attribute ((const))
long fibonacci(unsigned n)
{

int long a, b;

if (n &lt; 2) return n;
#pragma omp parallel
#pragma omp single nowait
{         
   #pragma omp task shared(a)
   a=fibonacci(n-1);
   #pragma omp task shared(b)
   b=fibonacci(n-2);
   #pragma omp taskwait
}
return (a+b);

}


And the version that drive me to a &quot;silly code&quot;:

#include <iostream>
#include <chrono>
#include <omp.h>

attribute ((const))
long fibonacci(unsigned n)
{

int long r[2];
int i;
if (n < 2) return n;
#pragma omp parallel
#pragma omp single
for(i=0;i<2;++i)
{
#pragma omp task shared(r)
r[i]=fibonacci(n-i-1);
}

return (r[0]+r[1]);

}

The build command for these omp versions is :

g++ omp_fibonacci_for.cpp -o omp_fibonacci_for -Wall -fopenmp

Both of them seems to loop &quot;for ever&quot; (fortunately Ctrl-C worked fine).
But, while writing the end of this question, the second version with &quot;for&quot;, process successfully:

f(42) = 267914296
elapsed time: 480.953s


To be clear I&#39;m not looking for &quot;best perf&quot;, my objective is to try to obtain a omp version that process in a reasonable time with the goal of finding the way to use openMP on recursive code. Honestly I was expecting a better execution time than 480s ! I was also surprised by the time it took in the &quot;poor debug&quot; version that took less 0.001s to process...

The thing is now I&#39;d like to know if the way I use openMP is acceptable if I apply this model with &quot;havier&quot; tasks or if I do something wrong.

  [1]: https://stackoverflow.com/questions/40961392/best-way-to-parallelize-this-recursion-using-openmp

</details>


# 答案1
**得分**: 2

如果您想快速计算斐波那契数列，通常应该首先使用迭代而不是递归。

```cpp
unsigned long long fib(unsigned input) {
    unsigned long long old1 = 0;
    unsigned long long old2 = 1;

    for (int i = 1; i &lt; input; i++) {
        unsigned long long sum = old1 + old2;
        old1 = old2;
        old2 = sum;
    }
    return old2;
}

请注意，我稍微修改了此代码，以在整个过程中使用64位无符号长整型，以便我们可以测试稍大一点的数字。64位数字足够大以计算F(93)。

经过一些测试，似乎high_resolution_clock（至少在我的系统上）可以测量的最短时间间隔约为427纳秒。如果我关闭优化并计算F(93)，通常会显示运行时间为427ns，但有时为855ns。如果我打开优化以便计时有意义，通常会显示0ns，但偶尔会显示427ns。

经过一些计算，我猜测实际上更接近于30-35ns（大约1个时钟周期每次迭代，93次迭代/2.8 GHz）。在您的CPU上，可能更接近于20-25ns。

OpenMP/线程

我看到使用线程执行此任务存在两个问题。

首先，线程在您有相互独立的任务时效果最好（例如，循环中的每次迭代执行相同的操作，但每次迭代都在不同的数据上执行）。斐波那契数列正好相反，每次计算都依赖于前一次计算的结果，因此很难并行计算它们。

其次，使用线程会增加一些开销，用于创建线程、将数据传递给线程、从线程收集数据等等。要使线程发挥作用，需要执行的工作量足够多，以便每个线程执行的工作量远远超过了使其执行这些工作的开销。再次强调，与此情况几乎相反。在典型系统上，创建一个线程甚至要比上面的代码计算F(93)花费更长的时间。

尝试使用线程执行此任务就像尝试让几个人分别写一个字母的方式来写一个单词，然后将纸传给下一个人。一个人可以在传递纸的时间内完成整个单词的写作。

英文:

If you want to compute Fibonacci numbers quickly, you usually want to start by using iteration instead of recursion.

unsigned long long fib(unsigned input) {
    unsigned long long old1 = 0;
    unsigned long long old2 = 1;

    for (int i = 1; i &lt; input; i++) {
        unsigned long long sum = old1 + old2;
        old1 = old2;
        old2 = sum;
    }
    return old2;
}

Note that I've changed this a bit, to use 64-bit unsigned long long's throughout, so we can test with a little larger of numbers. 64-bit numbers are large enough to compute F(93).

Doing a bit of testing, it appears that shortest interval that high_resolution_clock (at least on my system) can measure is around 427 nanoseconds. If I compile this with optimization turned off, and compute F(93), it usually says it ran in 427ns, but sometimes in 855ns. If I turn on optimization so timing means something, it usually says 0ns, but once in a while says 427ns.

Doing a bit of computation, I'd guess it's really more like 30-35ns (roughly 1 clock per iteration, 93 iterations / 2.8 GHz). On your CPU probably closer to 20-25ns.

OpenMP/threading

I see two problems with using threading for this task.

First of all, threading works best when you have things that are independent of each other (e.g., a loop in which each iteration of a loop does the same operation, but each of them on different data). Fibonacci numbers are kind of the opposite of that--each computation depends on the result of the previous computation, so it's difficult to do them in parallel.

Second, using threads adds some overhead to create threads, pass data to them, collect data from them, and so on. For threading to work out, you need enough work to be done that each thread does quite a bit more work than the overhead to get it to do that work. Again, that's pretty nearly the opposite of this case. On a typical system, it takes a lot longer to create even one thread than it takes the code above to compute F(93).

Trying to use threading for this job is like trying to write a word on a piece of paper by several people each write part of a letter, then pass the paper to the next person. One person could write the whole word in less time that it takes to pass the piece of paper around.

答案2

得分: 1

你可以通过使用一个良好的递归算法（不包括优化部分）显著提高你的原始结果：

#include <iostream>
#include <chrono>

long fibonacci_recursive(unsigned n, unsigned result, unsigned next)
{
    if (n == 0)
    {
        return result;
    }

    return fibonacci_recursive(n - 1, next, result + next);
}

long fibonacci(unsigned n)
{
    return fibonacci_recursive(n, 0, 1);
}

int main()
{
    auto start = std::chrono::steady_clock::now();
    std::cout << "f(42) = " << fibonacci(42) << '\n';
    auto end = std::chrono::steady_clock::now();
    std::chrono::duration<double> elapsed_seconds = end-start;
    std::cout << "elapsed time: " << elapsed_seconds.count() << "s\n";
}

输出

% ./a.out
f(42) = 267914296
elapsed time: 4.8508e-05s
%

英文:

You can improve your original result by orders of magnitude by starting with a good recursive algorithm (sans optimizations):

#include &lt;iostream&gt;
#include &lt;chrono&gt;

long fibonacci_recursive(unsigned n, unsigned result, unsigned next)
{
	if (n == 0)
	{
		return result;
	}

	return fibonacci_recursive(n - 1, next, result + next);
}

long fibonacci(unsigned n)
{
	return fibonacci_recursive(n, 0, 1);
}

int main()
{
	auto start = std::chrono::steady_clock::now();
	std::cout &lt;&lt; &quot;f(42) = &quot; &lt;&lt; fibonacci(42) &lt;&lt; &#39;\n&#39;;
	auto end = std::chrono::steady_clock::now();
	std::chrono::duration&lt;double&gt; elapsed_seconds = end-start;
	std::cout &lt;&lt; &quot;elapsed time: &quot; &lt;&lt; elapsed_seconds.count() &lt;&lt; &quot;s\n&quot;;
}

OUTPUT

% ./a.out
f(42) = 267914296
elapsed time: 4.8508e-05s
%

答案3

得分: 0

Yours is a silly implementation.

The call

return fibonacci(n-1) + fibonacci(n-2);

Should be a lookup to an (un)ordered set.

英文:

Yours is a silly implementation.

The call

    return fibonacci(n-1) + fibonacci(n-2);

Should be a lookup to an (un)ordered set.

答案4

得分: 0

完善的解决方案

因此，在考虑了所有评论和答案后，这段新代码似乎是在递归代码上应用openMP pragma的更好方式：

#include <iostream>
#include <chrono>
#include <omp.h>

__attribute__ ((const))
long fibonacci(unsigned n)
{
    int long a, b;

    if (n < 2) return n;
    #pragma omp task shared(a)
    a=fibonacci(n-1);
    #pragma omp task shared(b)
    b=fibonacci(n-2);
    #pragma omp taskwait

    return (a+b);
}

int main()
{
    auto start = std::chrono::steady_clock::now();
    long res;
    #pragma omp parallel
    {
        #pragma omp single
        res = fibonacci(42);
    }
    std::cout << "f(42) = " << res << '\n';
    auto end = std::chrono::steady_clock::now();
    std::chrono::duration<double> elapsed_seconds = end-start;
    std::cout << "elapsed time: " << elapsed_seconds.count() << "s\n";
}

构建命令：

g++ omp_fibonacci.cpp -o omp_fibonacci -Wall -fopenmp

输出：

f(42) = 267914296
elapsed time: 255.225s

@DanielLangr的评论尤其有用。

候选解决方案

这是我使用openMP获得的最佳结果。

并且，根据@JerryCoffin的答案和@ThomasMatthiew的评论以及@VictorEijkhout的建议...我以不同的方式考虑并推动了我到这段代码：

#include <iostream>
#include <chrono>
#include <omp.h>

__attribute__ ((const))
long fibonacci(unsigned n)
{ 
    int long a, b;
    if (n < 2) return n;
    a=fibonacci(n-2);
    b=fibonacci(n-1);
    return (a+b);
}

long omp_fibonacci(unsigned n)
{
    int long a, b;
    if (n < 2) return n;
    #pragma omp task shared(a)
    a=fibonacci(n-1);
    #pragma omp task shared(b)
    b=fibonacci(n-2);
    #pragma omp taskwait

    return (a+b);
}

int main()
{
    auto start = std::chrono::steady_clock::now();
    long res;
    #pragma omp parallel 
    {
        #pragma omp single
        res = omp_fibonacci(42);
    }
    std::cout << "f(42) = " << res << '\n';
    auto end = std::chrono::steady_clock::now();
    std::chrono::duration<double> elapsed_seconds = end-start;
    std::cout << "elapsed time: " << elapsed_seconds.count() << "s\n";
}

输出：

f(42) = 267914296
elapsed time: 1.00274s

英文:

Perfectible solution

So, after taken into consideration all comments and answers, this new code seems to be a better way to apply openMP pragmas on a recursive code:

#include &lt;iostream&gt;
#include &lt;chrono&gt;
#include &lt;omp.h&gt;

__attribute__ ((const))
long fibonacci(unsigned n)
{
   
	int long a, b;

	if (n &lt; 2) return n;
    #pragma omp task shared(a)
	a=fibonacci(n-1);
    #pragma omp task shared(b)
    b=fibonacci(n-2);
	#pragma omp taskwait
	
    return (a+b);
}
 
int main()
{
    auto start = std::chrono::steady_clock::now();
	long res;
	#pragma omp parallel
	{
		#pragma omp single
		res = fibonacci(42);
	}
    std::cout &lt;&lt; &quot;f(42) = &quot; &lt;&lt; res &lt;&lt; &#39;\n&#39;;
    auto end = std::chrono::steady_clock::now();
    std::chrono::duration&lt;double&gt; elapsed_seconds = end-start;
    std::cout &lt;&lt; &quot;elapsed time: &quot; &lt;&lt; elapsed_seconds.count() &lt;&lt; &quot;s\n&quot;;
}

Build command:

g++ omp_fibonacci.cpp -o omp_fibonacci -Wall -fopenmp

Output:

f(42) = 267914296
elapsed time: 255.225s

@DanielLangr comments where particularly useful.

candidate solution

This is the best I obtained with the use of openMP.

And, with @JerryCoffin answer and @ThomasMatthiew comments and the advice of @VictorEijkhout... I though about parallelization differently and it drive me to this code:

#include &lt;iostream&gt;
#include &lt;chrono&gt;
#include &lt;omp.h&gt;

__attribute__ ((const))
long fibonacci(unsigned n)
{ 
	int long a, b;
	if (n &lt; 2) return n;
	    a=fibonacci(n-2);
        b=fibonacci(n-1);
    return (a+b);
}

long omp_fibonacci(unsigned n)
{
	int long a, b;
	if (n &lt; 2) return n;
    #pragma omp task shared(a)
    a=fibonacci(n-1);
    #pragma omp task shared(b)
    b=fibonacci(n-2);
    #pragma omp taskwait
	
    return (a+b);
}
 
int main()
{
    auto start = std::chrono::steady_clock::now();
	long res;
	#pragma omp parallel 
	{
		#pragma omp single
		res = omp_fibonacci(42);
	}
    std::cout &lt;&lt; &quot;f(42) = &quot; &lt;&lt; res &lt;&lt; &#39;\n&#39;;
    auto end = std::chrono::steady_clock::now();
    std::chrono::duration&lt;double&gt; elapsed_seconds = end-start;
    std::cout &lt;&lt; &quot;elapsed time: &quot; &lt;&lt; elapsed_seconds.count() &lt;&lt; &quot;s\n&quot;;
}

Outputs:

f(42) = 267914296
elapsed time: 1.00274s

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

寻找一种好的方法来在递归代码中使用openMP？

问题

OpenMP/线程

OpenMP/threading

答案2

答案3

答案4

完善的解决方案

候选解决方案

Perfectible solution

candidate solution

当右键单击文件夹背景并调用上下文菜单时，如何获取文件夹路径？

获取结构成员的地址/偏移量，用于序列化目的。

不同的结果计算MD5哈希在Javascript和C++中。

std::filesystem::copy_file() 为什么忽略了 create_hard_links 选项标志？

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论