英文:
Code Hitting Some Kind of Hardware Performance Bottleneck With Different Struct Layouts of Same Size
问题
我正在分析以下代码的性能:
template <int padding1, int padding2>
struct complex_t {
float re;
int p1[padding1];
double im;
int p2[padding2];
};
在这个实验中,我使用了padding1
和padding2
的值,以便sizeof(complex_t)
始终为64。我通过padding1
来改变成员im
的偏移量。我使用了两个随机生成的complex_t
数组,每个数组都有1万个元素。接下来,我对这两个数组进行成对乘法,并测量运行时间和执行的指令数。这是乘法代码:
template <typename Complex>
void multiply(Complex* result, Complex* a, Complex* b, int n) {
for (int i = 0; i < n; ++i) {
result[i].re = a[i].re * b[i].re - a[i].im * b[i].im;
result[i].im = a[i].re * b[i].im + a[i].im + b[i].re;
}
}
以下是测量结果(5次运行,Intel(R) Core(TM) i5-10210U CPU,编译器CLANG 15.0.7,标志-O3
):
offsetof(im) |
运行时间(最小,平均,最大)(秒) | 平均指令数 |
---|---|---|
8字节 | 0.107, 0.112, 0.116 | 175027800 |
16字节 | 0.088, 0.088, 0.088 | 175027200 |
24字节 | 0.088, 0.088, 0.088 | 175027100 |
32字节 | 0.088, 0.088, 0.088 | 175027100 |
40字节 | 0.088, 0.088, 0.088 | 175027100 |
48字节 | 0.085, 0.085, 0.086 | 175027100 |
正如您所看到的,指令计数大致相同。然而,第一个样本,具有最小的偏移量,速度最慢。有一些奇怪的问题,似乎触及了某种硬件瓶颈。但我不明白问题在哪里,因为我对这个低级别的数据缓存工作方式没有一个心智模型。有人能给我一些关于要查找或要测量的想法吗?
更新:最小偏移的计数器MEM_LOAD_RETIRED_L3_HIT
异常高:5097404 vs 2775653(16),3015093(24),3277559(32),3261758(40)和3445190(48)。
英文:
I am analyzing the following code for performance:
template <int padding1, int padding2>
struct complex_t {
float re;
int p1[padding1];
double im;
int p2[padding2];
};
For the experiment, I am using values for padding1
and padding2
, so that the sizeof(complex_t)
is always 64. I am changing the offset of member im
using padding1
. I use two randomly generated arrays of complex_t
, each of which has 10K elements. Next, I perform a pairwise multiplication between the two arrays and measure the runtime and number of executed instructions. Here is the multiplication code:
template <typename Complex>
void multiply(Complex* result, Complex* a, Complex* b, int n) {
for (int i = 0; i < n; ++i) {
result[i].re = a[i].re * b[i].re - a[i].im * b[i].im;
result[i].im = a[i].re * b[i].im + a[i].im + b[i].re;
}
}
And here are the measure results (5 runs, Intel(R) Core(TM) i5-10210U CPU, Compiler CLANG 15.0.7, flags -O3
):
offsetof(im) |
Runtime(MIN,AVG,MAX) in sec | Instructions AVG |
---|---|---|
8 bytes | 0.107, 0.112, 0.116 | 175027800 |
16 bytes | 0.088, 0.088, 0.088 | 175027200 |
24 bytes | 0.088, 0.088, 0.088 | 175027100 |
32 bytes | 0.088, 0.088, 0.088 | 175027100 |
40 bytes | 0.088, 0.088, 0.088 | 175027100 |
48 bytes | 0.085, 0.085, 0.086 | 175027100 |
As you can see, the instruction count is roughly the same. Yet, the first sample, with the smallest offset is the slowest. There is something weird going on, hitting some kind of hardware bottleneck. But I don't understand what is the problem because I am missing a mental model of how data caches work in this low level. Can someone give me some ideas on what to look or what to measure?
UPDATE: A counter MEM_LOAD_RETIRED_L3_HIT
is unusually high for the smallest offset: 5097404 vs 2775653 (16), 3015093 (24), 3277559 (32), 3261758 (40) and 3445190 (48).
答案1
得分: 1
你可以尝试使用由英特尔开发的在线CPU模拟器:https://uica.uops.info/ 。
它会收集不同的统计数据,并显示你的程序中的瓶颈是什么。只需上传你的汇编代码,然后点击“运行”按钮。
以下是网站上的示例(不是你的代码):
吞吐量(每次迭代的周期数):4.00
瓶颈:依赖关系
如果给定属性是唯一的瓶颈,可以实现以下吞吐量:
- DSB(分派器分支处理器):1.00
- Issue(发射单元):1.50
- Ports(执行端口):1.50
- Dependencies(依赖关系):4.00
英文:
You could try using online CPU simulator developed by Intel: https://uica.uops.info/ .
It collects different statistics and shows you what is the bottleneck in your program. Just upload your assembly code and hit the run
button.
Here is an example from the website (not your code):
Throughput (in cycles per iteration): 4.00
Bottleneck: Dependencies
The following throughputs could be achieved if the given property were the only bottleneck:
- DSB: 1.00
- Issue: 1.50
- Ports: 1.50
- Dependencies: 4.00
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论