OpenMP线程未充分利用新机器上的CPU核心。

huangapple go评论55阅读模式
英文:

OpenMP Threads not fully utilize CPU cores on new machine

问题

I have a multi-threaded application that solves huge matrix in parallel. 我有一个多线程应用程序,可以并行解决大矩阵的问题。

I recently changed my laptop and start getting some weird behavior on the new laptop. 最近我更换了新的笔记本电脑,开始在新笔记本上遇到一些奇怪的行为。

The processor in the old laptop was 11th Gen Intel(R) Core(TM) i9-11950H and the new laptop has 12th Gen Intel(R) Core(TM) i9-12900H. 旧笔记本电脑的处理器是11th Gen Intel(R) Core(TM) i9-11950H,新笔记本电脑的处理器是12th Gen Intel(R) Core(TM) i9-12900H

When running my multi-threaded application (using 4 threads) on the old laptop I am seeing these threads taking over 4 cores and fully utilizing them, and the overall CPU usage is around 50% since the laptop has 8 physical cores. Please see picture below: 在旧笔记本上运行我的多线程应用程序(使用4个线程)时,我看到这些线程占用了4个核心并充分利用它们,总体CPU使用率约为50%,因为笔记本有8个物理核心。请参见下面的图片:

OpenMP线程未充分利用新机器上的CPU核心。

When running the same application using the same (exactly the same) executable binary I am seeing only one core fully utilized and the rest are at around 10%-20% and the overall CPU usage is under 15%. Please see the following picture: 在运行相同的应用程序时,使用相同的(完全相同的)可执行二进制文件,我只看到一个核心完全被利用,其他核心大约在10%-20%之间,总体CPU使用率低于15%。请参见以下图片:

OpenMP线程未充分利用新机器上的CPU核心。

Is there any explanation on why the same binary is running on one machine and not running on the other machine? 有没有解释为什么相同的二进制代码在一台机器上运行,而在另一台机器上不运行的情况?

Notes:

  • I am using OpenMP to start threads
  • I tried to set the priority of the threads to high but it didn't help
    注意:
  • 我正在使用OpenMP启动线程
  • 我尝试将线程的优先级设置为高,但没有帮助

Note: On the laptop that has 12th Gen Intel(R) Core(TM) i9-12900H, I disabled the E cores from the BIOS to make sure threads only got assigned to P cores and that didn't fix the problem. Please see below: 注意:在配备12th Gen Intel(R) Core(TM) i9-12900H的笔记本电脑上,我从BIOS中禁用了E核心,以确保线程只分配给P核心,但这并没有解决问题。请参见下图:

OpenMP线程未充分利用新机器上的CPU核心。

From the picture above we can see that only thread 1 fully utilizes its CPU. 从上图中我们可以看到只有线程1充分利用了CPU。

The following is the way I am launching my threads: 以下是我启动线程的方式:

!$OMP PARALLEL DO PRIVATE(i)  
DO i = 1, 4, 1
    CALL solve_axb_r_submat(n, A, line_A, X, B, flag, i, submatrix_number, Fkluunit(i))
END DO
!$OMP END PARALLEL DO

The above code is being called more than 20000 times, so every iteration called a function that has the piece of code above. 上面的代码被调用了20000多次,所以每次迭代都调用了具有上述代码片段的函数。

I am working on windows with Visual studio 2022 and OpeApi 2023. The following is some of my project property: 我正在使用Windows、Visual Studio 2022和OpeApi 2023工作。以下是我的项目属性的一些信息:

OpenMP线程未充分利用新机器上的CPU核心。
OpenMP线程未充分利用新机器上的CPU核心。
OpenMP线程未充分利用新机器上的CPU核心。
OpenMP线程未充分利用新机器上的CPU核心。
OpenMP线程未充分利用新机器上的CPU核心。

Please note the command line we added to determine the affinity of the threads in the last image. When we added that we see that threads are pinned on the same core and not moving between cores. 请注意我们在最后一张图中添加的命令行,用于确定线程的亲和性。当我们添加了这个命令行后,我们看到线程被固定在同一个核心上,而不会在核心之间移动。

英文:

I have a multi-threaded application that solves huge matrix in parallel. I recently changed my laptop and start getting some weird behavior on the new laptop. The processor in the old laptop was 11th Gen Intel(R) Core(TM) i9-11950H and the new laptop has 12th Gen Intel(R) Core(TM) i9-12900H. When running my multi-threaded application (using 4 threads) on the old laptop I am seeing these threads taking over 4 cores and fully utilizing them, and the overall CPU usage is around 50% since the laptop has 8 physical cores. Please see picture below:
OpenMP线程未充分利用新机器上的CPU核心。

When running the same application using the same (exactly the same) executable binary I am seeing only one core fully utilized and the rest are at around 10%-20% and the overall CPU usage is under 15%. Please see the following picture:
OpenMP线程未充分利用新机器上的CPU核心。

Is there any explanation on why the same binary is running on one machine and not running on the other machine?

Notes:

  • I am using OpenMP to start threads
  • I tried to set the priority of the threads to high but it didn't help

Note:
On the laptop that has 12th Gen Intel(R) Core(TM) i9-12900H, I disabled the E cores from the BIOS to make sure threads only got assigned to P cores and that didn't fix the problem. Please see below:
OpenMP线程未充分利用新机器上的CPU核心。

From the picture above we can see that only thread 1 fully utilizes its CPU.

The following is the way I am launching my threads:

    CALL OMP_SET_NUM_THREADS(4)
    !$OMP PARALLEL DO PRIVATE(i)  
    DO i = 1, 4, 1
        CALL solve_axb_r_submat(n, A, line_A, X, B, flag, i, submatrix_number, Fkluunit(i))
    END DO
    !$OMP END PARALLEL DO

The above code is being called more than 20000 times, so every iteration called a function that has the piece of code above. I am working on windows with Visual studio 2022 and OpeApi 2023. The following is some of my project property:
OpenMP线程未充分利用新机器上的CPU核心。
OpenMP线程未充分利用新机器上的CPU核心。
OpenMP线程未充分利用新机器上的CPU核心。
OpenMP线程未充分利用新机器上的CPU核心。
OpenMP线程未充分利用新机器上的CPU核心。

Please note the command line we added to determine the affinity of the threads in the last image. When we added that we see that threads are pinned on the same core and not moving between cores.

答案1

得分: 5

The i9-11950H processor is a Tiger-Lake processor, while the i9-12900H is an Alder-Lake processor. The main difference is that Alder-Lake has a Big-Little architecture, while the former is uniform (and far more mainstream). In practice, this means there are two sets of cores: big cores which are fast but energetically inefficient, and little cores which are energetically efficient but slow. This architecture is pretty interesting on notebook machines so to provide a good trade-off between performance and power consumption. Efficient cores can also help to improve the overall performance of the CPU in some specific cases. The bad news is that such an architecture is poorly supported by many runtimes and applications so far. One main issue is the load imbalance caused by the different kinds of cores. Indeed, 1 thread running on a performance core generally runs faster than the one on an efficient core for the same workload. The faster thread shall wait for the others, so the overall computation is bound by the slow cores. I guess this is what happens here: 1 core is intensively used, while 3 others are just barely used, and others are IDLE. My hypothesis is that the core used intensively is an efficient core while others are performance cores (waiting for the slow one).

You can request OpenMP to use a dynamic scheduling so as to automatically load-balance the work between the different cores. This has an additional overhead but is likely better in this case. One way is to use the schedule(dynamic) clause on parallel for loops. Another way is to tweak the environment variable OMP_SCHEDULE.

Alternatively, you can bind the OpenMP threads yourself to use the same kind of cores. You can do that by changing the environment variables OMP_PROC_BIND and OMP_PLACES. This should be what the OS is supposed to do automatically, but it looks like it fails (or this is not actually the issue)...

英文:

The i9-11950H processor is a Tiger-Lake processor while the i9-12900H is an Alder-Lake processor. The main difference is that Alder-Lake has a Big-Little architecture while the former is uniform (and far more mainstream). In practice, this means there are two sets of cores: big cores which are fast but energetically inefficient, and little cores which are energetically efficient but slow. This architecture is pretty interesting on notebook machines so to provide a good trade-off between performance and power consumption. Efficient cores can also help to improve the overall performance of the CPU in some specific cases. The bad news is that such an architecture is poorly supported by many runtimes and applications so far. One main issue is the load imbalance caused by the different kind of cores. Indeed, 1 thread running on a performance core generally run faster than the one on an efficient core for the same workload. The faster thread shall wait for the others so the overall computation is bound by the slow cores. I guess this is what happens here : 1 core is intensively uses, while 3 others are just barely used and others are IDLE. My hypothesis is that the core used intensively is an efficient core while others are performance cores (waiting for the slow one).

You can request OpenMP to use a dynamic scheduling so to automatically load-balance the work between the different cores. This has an additional overhead but it is likely better in this case. One way is to use the schedule(dynamic) clause on parallel for loops. Another way is to tweak the environment variable OMP_SCHEDULE.

Alternatively, you can bind the OpenMP threads yourself so to use the same kind of cores. You can do that by changing the environment variables OMP_PROC_BIND and OMP_PLACES This should be what the OS is supposed to do automatically, but it looks like it fails (or this is not actually the issue)...

huangapple
  • 本文由 发表于 2023年4月4日 05:06:00
  • 转载请务必保留本文链接:https://go.coder-hub.com/75923768.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定