不同结构布局相同大小的代码遇到某种硬件性能瓶颈

huangapple go评论83阅读模式
英文:

Code Hitting Some Kind of Hardware Performance Bottleneck With Different Struct Layouts of Same Size

问题

我正在分析以下代码的性能:

template <int padding1, int padding2>
struct complex_t {
        float re;
        int p1[padding1];
        double im;
        int p2[padding2];
};

在这个实验中,我使用了padding1padding2的值,以便sizeof(complex_t)始终为64。我通过padding1来改变成员im的偏移量。我使用了两个随机生成的complex_t数组,每个数组都有1万个元素。接下来,我对这两个数组进行成对乘法,并测量运行时间和执行的指令数。这是乘法代码:

template <typename Complex>
void multiply(Complex* result, Complex* a, Complex* b, int n) {
    for (int i = 0; i < n; ++i) {
        result[i].re = a[i].re * b[i].re - a[i].im * b[i].im;
        result[i].im = a[i].re * b[i].im + a[i].im + b[i].re;
    }
}

以下是测量结果(5次运行,Intel(R) Core(TM) i5-10210U CPU,编译器CLANG 15.0.7,标志-O3):

offsetof(im) 运行时间(最小,平均,最大)(秒) 平均指令数
8字节 0.107, 0.112, 0.116 175027800
16字节 0.088, 0.088, 0.088 175027200
24字节 0.088, 0.088, 0.088 175027100
32字节 0.088, 0.088, 0.088 175027100
40字节 0.088, 0.088, 0.088 175027100
48字节 0.085, 0.085, 0.086 175027100

正如您所看到的,指令计数大致相同。然而,第一个样本,具有最小的偏移量,速度最慢。有一些奇怪的问题,似乎触及了某种硬件瓶颈。但我不明白问题在哪里,因为我对这个低级别的数据缓存工作方式没有一个心智模型。有人能给我一些关于要查找或要测量的想法吗?

更新:最小偏移的计数器MEM_LOAD_RETIRED_L3_HIT异常高:5097404 vs 2775653(16),3015093(24),3277559(32),3261758(40)和3445190(48)。

英文:

I am analyzing the following code for performance:

template &lt;int padding1, int padding2&gt;
struct complex_t {
        float re;
        int p1[padding1];
        double im;
        int p2[padding2];
};

For the experiment, I am using values for padding1 and padding2, so that the sizeof(complex_t) is always 64. I am changing the offset of member im using padding1. I use two randomly generated arrays of complex_t, each of which has 10K elements. Next, I perform a pairwise multiplication between the two arrays and measure the runtime and number of executed instructions. Here is the multiplication code:

template &lt;typename Complex&gt;
void multiply(Complex* result, Complex* a, Complex* b, int n) {
    for (int i = 0; i &lt; n; ++i) {
        result[i].re = a[i].re * b[i].re - a[i].im * b[i].im;
        result[i].im = a[i].re * b[i].im + a[i].im + b[i].re;
    }
}

And here are the measure results (5 runs, Intel(R) Core(TM) i5-10210U CPU, Compiler CLANG 15.0.7, flags -O3):

offsetof(im) Runtime(MIN,AVG,MAX) in sec Instructions AVG
8 bytes 0.107, 0.112, 0.116 175027800
16 bytes 0.088, 0.088, 0.088 175027200
24 bytes 0.088, 0.088, 0.088 175027100
32 bytes 0.088, 0.088, 0.088 175027100
40 bytes 0.088, 0.088, 0.088 175027100
48 bytes 0.085, 0.085, 0.086 175027100

As you can see, the instruction count is roughly the same. Yet, the first sample, with the smallest offset is the slowest. There is something weird going on, hitting some kind of hardware bottleneck. But I don't understand what is the problem because I am missing a mental model of how data caches work in this low level. Can someone give me some ideas on what to look or what to measure?

UPDATE: A counter MEM_LOAD_RETIRED_L3_HIT is unusually high for the smallest offset: 5097404 vs 2775653 (16), 3015093 (24), 3277559 (32), 3261758 (40) and 3445190 (48).

答案1

得分: 1

你可以尝试使用由英特尔开发的在线CPU模拟器:https://uica.uops.info/ 。

它会收集不同的统计数据,并显示你的程序中的瓶颈是什么。只需上传你的汇编代码,然后点击“运行”按钮。

以下是网站上的示例(不是你的代码):

吞吐量(每次迭代的周期数):4.00
瓶颈:依赖关系

如果给定属性是唯一的瓶颈,可以实现以下吞吐量:

  - DSB(分派器分支处理器):1.00
  - Issue(发射单元):1.50
  - Ports(执行端口):1.50
  - Dependencies(依赖关系):4.00
英文:

You could try using online CPU simulator developed by Intel: https://uica.uops.info/ .

It collects different statistics and shows you what is the bottleneck in your program. Just upload your assembly code and hit the run button.

Here is an example from the website (not your code):

Throughput (in cycles per iteration): 4.00
Bottleneck: Dependencies

The following throughputs could be achieved if the given property were the only bottleneck:

  - DSB: 1.00
  - Issue: 1.50
  - Ports: 1.50
  - Dependencies: 4.00

huangapple
  • 本文由 发表于 2023年5月28日 04:26:40
  • 转载请务必保留本文链接:https://go.coder-hub.com/76348897.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定