2023年5月28日 04:26:40go评论121阅读模式

英文:

Code Hitting Some Kind of Hardware Performance Bottleneck With Different Struct Layouts of Same Size

问题

我正在分析以下代码的性能：

template <int padding1, int padding2>
struct complex_t {
        float re;
        int p1[padding1];
        double im;
        int p2[padding2];
};

在这个实验中，我使用了padding1和padding2的值，以便sizeof(complex_t)始终为64。我通过padding1来改变成员im的偏移量。我使用了两个随机生成的complex_t数组，每个数组都有1万个元素。接下来，我对这两个数组进行成对乘法，并测量运行时间和执行的指令数。这是乘法代码：

template <typename Complex>
void multiply(Complex* result, Complex* a, Complex* b, int n) {
    for (int i = 0; i < n; ++i) {
        result[i].re = a[i].re * b[i].re - a[i].im * b[i].im;
        result[i].im = a[i].re * b[i].im + a[i].im + b[i].re;
    }
}

以下是测量结果（5次运行，Intel(R) Core(TM) i5-10210U CPU，编译器CLANG 15.0.7，标志-O3）：

`offsetof(im)`	运行时间（最小，平均，最大）（秒）	平均指令数
8字节	0.107, 0.112, 0.116	175027800
16字节	0.088, 0.088, 0.088	175027200
24字节	0.088, 0.088, 0.088	175027100
32字节	0.088, 0.088, 0.088	175027100
40字节	0.088, 0.088, 0.088	175027100
48字节	0.085, 0.085, 0.086	175027100

正如您所看到的，指令计数大致相同。然而，第一个样本，具有最小的偏移量，速度最慢。有一些奇怪的问题，似乎触及了某种硬件瓶颈。但我不明白问题在哪里，因为我对这个低级别的数据缓存工作方式没有一个心智模型。有人能给我一些关于要查找或要测量的想法吗？

更新：最小偏移的计数器MEM_LOAD_RETIRED_L3_HIT异常高：5097404 vs 2775653（16），3015093（24），3277559（32），3261758（40）和3445190（48）。

英文:

I am analyzing the following code for performance:

template &lt;int padding1, int padding2&gt;
struct complex_t {
        float re;
        int p1[padding1];
        double im;
        int p2[padding2];
};

For the experiment, I am using values for padding1 and padding2, so that the sizeof(complex_t) is always 64. I am changing the offset of member im using padding1. I use two randomly generated arrays of complex_t, each of which has 10K elements. Next, I perform a pairwise multiplication between the two arrays and measure the runtime and number of executed instructions. Here is the multiplication code:

template &lt;typename Complex&gt;
void multiply(Complex* result, Complex* a, Complex* b, int n) {
    for (int i = 0; i &lt; n; ++i) {
        result[i].re = a[i].re * b[i].re - a[i].im * b[i].im;
        result[i].im = a[i].re * b[i].im + a[i].im + b[i].re;
    }
}

And here are the measure results (5 runs, Intel(R) Core(TM) i5-10210U CPU, Compiler CLANG 15.0.7, flags -O3):

`offsetof(im)`	Runtime(MIN,AVG,MAX) in sec	Instructions AVG
8 bytes	0.107, 0.112, 0.116	175027800
16 bytes	0.088, 0.088, 0.088	175027200
24 bytes	0.088, 0.088, 0.088	175027100
32 bytes	0.088, 0.088, 0.088	175027100
40 bytes	0.088, 0.088, 0.088	175027100
48 bytes	0.085, 0.085, 0.086	175027100

As you can see, the instruction count is roughly the same. Yet, the first sample, with the smallest offset is the slowest. There is something weird going on, hitting some kind of hardware bottleneck. But I don't understand what is the problem because I am missing a mental model of how data caches work in this low level. Can someone give me some ideas on what to look or what to measure?

UPDATE: A counter MEM_LOAD_RETIRED_L3_HIT is unusually high for the smallest offset: 5097404 vs 2775653 (16), 3015093 (24), 3277559 (32), 3261758 (40) and 3445190 (48).

答案1

得分: 1

你可以尝试使用由英特尔开发的在线CPU模拟器：https://uica.uops.info/ 。

它会收集不同的统计数据，并显示你的程序中的瓶颈是什么。只需上传你的汇编代码，然后点击“运行”按钮。

以下是网站上的示例（不是你的代码）：

吞吐量（每次迭代的周期数）：4.00
瓶颈：依赖关系
如果给定属性是唯一的瓶颈，可以实现以下吞吐量：
  - DSB（分派器分支处理器）：1.00
  - Issue（发射单元）：1.50
  - Ports（执行端口）：1.50
  - Dependencies（依赖关系）：4.00

英文:

You could try using online CPU simulator developed by Intel: https://uica.uops.info/ .

It collects different statistics and shows you what is the bottleneck in your program. Just upload your assembly code and hit the run button.

Here is an example from the website (not your code):

Throughput (in cycles per iteration): 4.00
Bottleneck: Dependencies
The following throughputs could be achieved if the given property were the only bottleneck:
  - DSB: 1.00
  - Issue: 1.50
  - Ports: 1.50
  - Dependencies: 4.00

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

不同结构布局相同大小的代码遇到某种硬件性能瓶颈

问题

答案1

C++项目：将整数转换为双精度浮点数。

sycl get_devices() 返回仅CPU，而我有集成的Intel Iris xe和独立的NVIDIA GPU。

Standard library functions can throw exceptions 标准库函数何时可能抛出异常？

undefined reference to `main' for shared library

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。