2023年4月4日 05:06:00go评论67阅读模式

英文:

OpenMP Threads not fully utilize CPU cores on new machine

问题

I have a multi-threaded application that solves huge matrix in parallel. 我有一个多线程应用程序，可以并行解决大矩阵的问题。

I recently changed my laptop and start getting some weird behavior on the new laptop. 最近我更换了新的笔记本电脑，开始在新笔记本上遇到一些奇怪的行为。

The processor in the old laptop was 11th Gen Intel(R) Core(TM) i9-11950H and the new laptop has 12th Gen Intel(R) Core(TM) i9-12900H. 旧笔记本电脑的处理器是11th Gen Intel(R) Core(TM) i9-11950H，新笔记本电脑的处理器是12th Gen Intel(R) Core(TM) i9-12900H。

When running my multi-threaded application (using 4 threads) on the old laptop I am seeing these threads taking over 4 cores and fully utilizing them, and the overall CPU usage is around 50% since the laptop has 8 physical cores. Please see picture below: 在旧笔记本上运行我的多线程应用程序（使用4个线程）时，我看到这些线程占用了4个核心并充分利用它们，总体CPU使用率约为50%，因为笔记本有8个物理核心。请参见下面的图片：

When running the same application using the same (exactly the same) executable binary I am seeing only one core fully utilized and the rest are at around 10%-20% and the overall CPU usage is under 15%. Please see the following picture: 在运行相同的应用程序时，使用相同的（完全相同的）可执行二进制文件，我只看到一个核心完全被利用，其他核心大约在10%-20%之间，总体CPU使用率低于15%。请参见以下图片：

Is there any explanation on why the same binary is running on one machine and not running on the other machine? 有没有解释为什么相同的二进制代码在一台机器上运行，而在另一台机器上不运行的情况？

Notes:

I am using OpenMP to start threads
I tried to set the priority of the threads to high but it didn't help
注意：
我正在使用OpenMP启动线程
我尝试将线程的优先级设置为高，但没有帮助

Note: On the laptop that has 12th Gen Intel(R) Core(TM) i9-12900H, I disabled the E cores from the BIOS to make sure threads only got assigned to P cores and that didn't fix the problem. Please see below: 注意：在配备12th Gen Intel(R) Core(TM) i9-12900H的笔记本电脑上，我从BIOS中禁用了E核心，以确保线程只分配给P核心，但这并没有解决问题。请参见下图：

From the picture above we can see that only thread 1 fully utilizes its CPU. 从上图中我们可以看到只有线程1充分利用了CPU。

The following is the way I am launching my threads: 以下是我启动线程的方式：

!$OMP PARALLEL DO PRIVATE(i)  
DO i = 1, 4, 1
    CALL solve_axb_r_submat(n, A, line_A, X, B, flag, i, submatrix_number, Fkluunit(i))
END DO
!$OMP END PARALLEL DO

The above code is being called more than 20000 times, so every iteration called a function that has the piece of code above. 上面的代码被调用了20000多次，所以每次迭代都调用了具有上述代码片段的函数。

I am working on windows with Visual studio 2022 and OpeApi 2023. The following is some of my project property: 我正在使用Windows、Visual Studio 2022和OpeApi 2023工作。以下是我的项目属性的一些信息：

Please note the command line we added to determine the affinity of the threads in the last image. When we added that we see that threads are pinned on the same core and not moving between cores. 请注意我们在最后一张图中添加的命令行，用于确定线程的亲和性。当我们添加了这个命令行后，我们看到线程被固定在同一个核心上，而不会在核心之间移动。

英文:

I have a multi-threaded application that solves huge matrix in parallel. I recently changed my laptop and start getting some weird behavior on the new laptop. The processor in the old laptop was 11th Gen Intel(R) Core(TM) i9-11950H and the new laptop has 12th Gen Intel(R) Core(TM) i9-12900H. When running my multi-threaded application (using 4 threads) on the old laptop I am seeing these threads taking over 4 cores and fully utilizing them, and the overall CPU usage is around 50% since the laptop has 8 physical cores. Please see picture below:

Is there any explanation on why the same binary is running on one machine and not running on the other machine?

Notes:

I am using OpenMP to start threads
I tried to set the priority of the threads to high but it didn't help

From the picture above we can see that only thread 1 fully utilizes its CPU.

The following is the way I am launching my threads:

    CALL OMP_SET_NUM_THREADS(4)
    !$OMP PARALLEL DO PRIVATE(i)  
    DO i = 1, 4, 1
        CALL solve_axb_r_submat(n, A, line_A, X, B, flag, i, submatrix_number, Fkluunit(i))
    END DO
    !$OMP END PARALLEL DO

The above code is being called more than 20000 times, so every iteration called a function that has the piece of code above. I am working on windows with Visual studio 2022 and OpeApi 2023. The following is some of my project property:

Please note the command line we added to determine the affinity of the threads in the last image. When we added that we see that threads are pinned on the same core and not moving between cores.

答案1

得分: 5

The i9-11950H processor is a Tiger-Lake processor, while the i9-12900H is an Alder-Lake processor. The main difference is that Alder-Lake has a Big-Little architecture, while the former is uniform (and far more mainstream). In practice, this means there are two sets of cores: big cores which are fast but energetically inefficient, and little cores which are energetically efficient but slow. This architecture is pretty interesting on notebook machines so to provide a good trade-off between performance and power consumption. Efficient cores can also help to improve the overall performance of the CPU in some specific cases. The bad news is that such an architecture is poorly supported by many runtimes and applications so far. One main issue is the load imbalance caused by the different kinds of cores. Indeed, 1 thread running on a performance core generally runs faster than the one on an efficient core for the same workload. The faster thread shall wait for the others, so the overall computation is bound by the slow cores. I guess this is what happens here: 1 core is intensively used, while 3 others are just barely used, and others are IDLE. My hypothesis is that the core used intensively is an efficient core while others are performance cores (waiting for the slow one).

You can request OpenMP to use a dynamic scheduling so as to automatically load-balance the work between the different cores. This has an additional overhead but is likely better in this case. One way is to use the schedule(dynamic) clause on parallel for loops. Another way is to tweak the environment variable OMP_SCHEDULE.

Alternatively, you can bind the OpenMP threads yourself to use the same kind of cores. You can do that by changing the environment variables OMP_PROC_BIND and OMP_PLACES. This should be what the OS is supposed to do automatically, but it looks like it fails (or this is not actually the issue)...

英文:

The i9-11950H processor is a Tiger-Lake processor while the i9-12900H is an Alder-Lake processor. The main difference is that Alder-Lake has a Big-Little architecture while the former is uniform (and far more mainstream). In practice, this means there are two sets of cores: big cores which are fast but energetically inefficient, and little cores which are energetically efficient but slow. This architecture is pretty interesting on notebook machines so to provide a good trade-off between performance and power consumption. Efficient cores can also help to improve the overall performance of the CPU in some specific cases. The bad news is that such an architecture is poorly supported by many runtimes and applications so far. One main issue is the load imbalance caused by the different kind of cores. Indeed, 1 thread running on a performance core generally run faster than the one on an efficient core for the same workload. The faster thread shall wait for the others so the overall computation is bound by the slow cores. I guess this is what happens here : 1 core is intensively uses, while 3 others are just barely used and others are IDLE. My hypothesis is that the core used intensively is an efficient core while others are performance cores (waiting for the slow one).

You can request OpenMP to use a dynamic scheduling so to automatically load-balance the work between the different cores. This has an additional overhead but it is likely better in this case. One way is to use the schedule(dynamic) clause on parallel for loops. Another way is to tweak the environment variable OMP_SCHEDULE.

Alternatively, you can bind the OpenMP threads yourself so to use the same kind of cores. You can do that by changing the environment variables OMP_PROC_BIND and OMP_PLACES This should be what the OS is supposed to do automatically, but it looks like it fails (or this is not actually the issue)...

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

OpenMP线程未充分利用新机器上的CPU核心。

问题

答案1

如何直接访问C++中的VARIANT变量中的内容？

Java ThreadPoolExecutor [Submit More Than MaxPoolSize]

if语句中使用显式的static_cast到bool类型

如何在语言为韩文或中文时更改特定 UI（QML）的字体？

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论