Multiplying a struct (storing a float) by a float is coming out to be faster than simply multiplying a float by a float?

huangapple go评论77阅读模式
英文:

Multiplying a struct (storing a float) by a float is coming out to be faster than simply multiplying a float by a float?

问题

我需要一种方法来初始化一个标量值,可以接受一个单独的浮点数或三个浮点数值(对应于RGB)。所以我简单地创建了一个非常简单的结构体:

struct Mono {
    float value;

    Mono(){
        this->value = 0;
    }

    Mono(float value) {
        this->value = value;
    };

    Mono(float red, float green, float blue){
        this->value = (red+green+blue)/3;
    };
};

// 乘法运算符重载:
Mono operator*( Mono const& lhs, Mono const& rhs){
    return Mono(lhs.value*rhs.value);
};
Mono operator*( float const& lhs, Mono const& rhs){
    return Mono(lhs*rhs.value);
};
Mono operator*( Mono const& lhs, float const& rhs){
    return Mono(lhs.value*rhs);
};

这个代码按预期工作,但后来我想要进行基准测试,看看这个包装是否会影响性能,所以我写了以下基准测试,其中我简单地将一个浮点数乘以结构体100,000,000次,和一个浮点数乘以一个浮点数100,000,000次:

#include <vector>
#include <chrono>
#include <iostream>

using namespace std::chrono;

int main() {
    size_t N = 100000000;

    std::vector<float> inputs(N);
    
    std::vector<Mono> outputs_c(N);
    std::vector<float> outputs_f(N);

    Mono color(3.24);
    float color_f = 3.24;

    for (size_t i = 0; i < N; i++){
        inputs[i] = i;
    };

    auto start_c = high_resolution_clock::now();
    for (size_t i = 0; i < N; i++){
        outputs_c[i] = color*inputs[i];
    }
    auto stop_c = high_resolution_clock::now();
    auto duration_c = duration_cast<microseconds>(stop_c - start_c);
    std::cout << "Mono*float duration: " << duration_c.count() << "\n";

    auto start_f = high_resolution_clock::now();
    for (size_t i = 0; i < N; i++){
        outputs_f[i] = color_f*inputs[i];
    }
    auto stop_f = high_resolution_clock::now();
    auto duration_f = duration_cast<microseconds>(stop_f - start_f);
    std::cout << "float*float duration: " << duration_f.count() << "\n";

    return 0;
}

当我没有进行任何优化编译它时:g++ test.cpp,它非常可靠地打印出以下时间(以微秒为单位):

Mono*float duration:  841122
float*float duration: 656197

因此,在这种情况下,Mono*float明显较慢。但是如果我打开优化(g++ test.cpp -O3),它非常可靠地打印出以下时间(以微秒为单位):

Mono*float duration:  75494
float*float duration: 86176

我假设这里有一些奇怪的优化,实际上用结构体包装一个浮点数并不会更快... 但我很难看出我的测试出了什么问题。

英文:

I needed a way of initializing a scalar value given either a single float, or three floating point values (corresponding to RGB). So I just threw together a very simple struct:

struct Mono {
    float value;

    Mono(){
        this-&gt;value = 0;
    }

    Mono(float value) {
        this-&gt;value = value;
    };

    Mono(float red, float green, float blue){
        this-&gt;value = (red+green+blue)/3;
    };
};

// Multiplication operator overloads:
Mono operator*( Mono const&amp; lhs, Mono const&amp; rhs){
    return Mono(lhs.value*rhs.value);
};
Mono operator*( float const&amp; lhs, Mono const&amp; rhs){
    return Mono(lhs*rhs.value);
};
Mono operator*( Mono const&amp; lhs, float const&amp; rhs){
    return Mono(lhs.value*rhs);
};

This worked as expected, but then I wanted to benchmark to see if this wrapper is going to impact performance at all so I wrote the following benchmark test where I simply multiplied a float by the struct 100,000,000 times, and multipled a float by a float 100,000,000 times:

#include &lt;vector&gt;
#include &lt;chrono&gt;
#include &lt;iostream&gt;

using namespace std::chrono;

int main() {
    size_t N = 100000000;

    std::vector&lt;float&gt; inputs(N);
    
    std::vector&lt;Mono&gt; outputs_c(N);
    std::vector&lt;float&gt; outputs_f(N);

    Mono color(3.24);
    float color_f = 3.24;

    for (size_t i = 0; i &lt; N; i++){
        inputs[i] = i;
    };

    auto start_c = high_resolution_clock::now();
    for (size_t i = 0; i &lt; N; i++){
        outputs_c[i] = color*inputs[i];
    }
    auto stop_c = high_resolution_clock::now();
    auto duration_c = duration_cast&lt;microseconds&gt;(stop_c - start_c);
    std::cout &lt;&lt; &quot;Mono*float duration: &quot; &lt;&lt; duration_c.count() &lt;&lt; &quot;\n&quot;;

    auto start_f = high_resolution_clock::now();
    for (size_t i = 0; i &lt; N; i++){
        outputs_f[i] = color_f*inputs[i];
    }
    auto stop_f = high_resolution_clock::now();
    auto duration_f = duration_cast&lt;microseconds&gt;(stop_f - start_f);
    std::cout &lt;&lt; &quot;float*float duration: &quot; &lt;&lt; duration_f.count() &lt;&lt; &quot;\n&quot;;

    return 0;
}

When I compile it without any optimizations: g++ test.cpp, it prints the following times (in microseconds) very reliably:

Mono*float duration:  841122
float*float duration: 656197

So the Mono*float is clearly slower in that case. But then if I turn on optimizations (g++ test.cpp -O3), it prints the following times (in microseconds) very reliably:

Mono*float duration:  75494
float*float duration: 86176

I'm assuming that something is getting optimized weirdly here and it is NOT actually faster to wrap a float in a struct like this... but I'm struggling to see what is going wrong with my test.

答案1

得分: 1

以下是翻译的内容:

在我的系统上(i7-6700k,使用GCC 12.2.1),无论我哪个循环放在第二个运行,第一个循环都会变慢,而且这两个循环的汇编代码是相同的。

可能是因为当第一个循环运行时,缓存仍然部分地通过inputs[i] = i;初始化循环进行了预热。(参见https://blog.stuffedcow.net/2013/01/ivb-cache-replacement/,关于Intel L3的自适应替换策略,这可能解释了为什么有些条目在那个大的初始化循环中幸存下来,100000000个浮点数对应着每个数组的400MB,我的CPU有8MiB的L3缓存。)

因此,从低计算强度(每16字节加载+存储一个矢量数学指令)来看,这只是一个缓存/内存带宽基准测试,因为你使用一个大数组而不是对较小数组进行重复传递。这与你是否有一个裸的float还是一个struct {float; }无关。


不出所料,这两个循环编译成相同的汇编代码 - https://godbolt.org/z/7eTh4ojYf - 执行movups加载,mulps乘以4个浮点数,以及movups非对齐存储。由于某种原因,GCC重新加载3.24的矢量常量,而不是将其提升出循环,所以它执行了每次乘法2次加载和1次存储。大数组的缓存丢失应该给乱序执行足够的时间来执行这些额外的加载,这些加载来自与L1d缓存中的相同.rodata地址。

我尝试了*如何减轻GCC对Intel jcc漏洞的影响?*,但没有什么不同;与-Wa,-mbranches-within-32B-boundaries相比,性能差异仍然大致相同,所以正如预期的那样,这不是前端瓶颈;IPC足够低。也许是一些缓存的怪癖。


在我的系统上(Linux 6.1.8,i7-6700k,3.9GHz,使用GCC 12.2.1 -O3编译,没有-march=native-ffast-math),你的整个程序几乎一半的时间都花在内核的页面错误处理程序中(perf statperf stat --all-user周期计数)。这不太好;如果你不打算对内存分配和TLB缺失进行基准测试的话。

但这是总时间;在循环之前,你会触碰输入和输出数组(std::vector&lt;float&gt; outputs_c(N);为N个元素分配并清零空间,自定义结构也是如此,带有构造函数)。在你计时的区域内不应该有页面错误,只有可能的TLB缺失。当然还有很多缓存缺失。


顺便说一下,Clang正确地优化掉了所有循环,因为从来没有使用任何结果。Benchmark::DoNotOptimize(outputs_c[argc])可能会有所帮助。或者通过使用带有虚拟内存输入/输出的汇编代码来强制编译器将数组实例化到内存中并忘记它们的内容,从而迫使编译器材料化内存中的数组。

另请参阅https://stackoverflow.com/questions/60291987/性能评估的惯用方法。

英文:

On my system (i7-6700k with GCC 12.2.1), whichever loop I do second runs slower, and the asm for the two loops is identical.

Perhaps because cache is still partly primed from the inputs[i] = i; init loop when the first loop runs. (See https://blog.stuffedcow.net/2013/01/ivb-cache-replacement/ re: Intel's adaptive replacement policy for L3 which might explain some but not all of the entries surviving that big init loop. 100000000 floats is 400 MB per array, and my CPU has 8 MiB of L3 cache.)

So as expected from the low computational intensity (one vector math instruction per 16 bytes loaded + stored), it's just a cache / memory bandwidth benchmark since you used one huge array instead of repeated passes over a smaller array. Nothing to do with whether you have a bare float or a struct {float; }


As expected, both loops compile to the same asm - https://godbolt.org/z/7eTh4ojYf - doing a movups load, mulps to multiply 4 floats, and a movups unaligned store. For some reason, GCC reloads the vector constant of 3.24 instead of hoisting it out of the loop, so it's doing 2 loads and 1 store per multiply. Cache misses on the big arrays should give plenty of time for out-of-order exec to do those extra loads from the same .rodata address that hit in L1d cache every time.

I tried How can I mitigate the impact of the Intel jcc erratum on gcc? but it didn't make a difference; still about the same performance delta with -Wa,-mbranches-within-32B-boundaries, so as expected it's not a front-end bottleneck; IPC is plenty low. Maybe some quirk of cache.


On my system (Linux 6.1.8 on i7-6700k at 3.9GHz, compiled with GCC 12.2.1 -O3 without -march=native or -ffast-math), your whole program spends nearly half its time in the kernel's page fault handler. (perf stat vs. perf stat --all-user cycle counts). So that's not great; if you're not trying to benchmark memory allocation and TLB misses.

But that's total time; you do touch the input and output arrays before the loop (std::vector&lt;float&gt; outputs_c(N); allocates and zeros space for N elements, same for your custom struct with a constructor.) There shouldn't be page faults inside your timed regions, only potentially TLB misses. And of course lots of cache misses.


BTW, clang correctly optimizes away all the loops, because none of the results are ever used. Benchmark::DoNotOptimize(outputs_c[argc]) might help with that. Or some manual use of asm with dummy memory inputs / outputs to force the compiler to materialize arrays in memory and forget their contents.

See also https://stackoverflow.com/questions/60291987/idiomatic-way-of-performance-evaluation

huangapple
  • 本文由 发表于 2023年2月16日 10:41:25
  • 转载请务必保留本文链接:https://go.coder-hub.com/75467319.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定