2023年6月2日 14:56:52go评论85阅读模式

英文:

fmin and fmax are much slower than simple conditional operator

问题

我在处理视频帧的C++代码上进行了一些工作，发现std::fmin和std::fmax比简单的条件运算符慢得多。我已经按照评论中提到的方式简化了我的代码（将代码修改为更符合C++风格）：

#include <cmath>
#include <chrono>
#include <iostream>
#include <memory>

void func()
{
    std::unique_ptr<uint8_t[]> ptr(new uint8_t[1280 * 720 * 3]);
    auto mem = ptr.get();

    auto start = std::chrono::steady_clock::now();
    for (int i = 0; i != 720; ++i) {
        for (int j = 0; j != 1280; ++j) {
            for (int k = 0; k != 3; ++k) {
                float tmp = i + j + k;
                // *mem++ = (tmp > 0.f ? (tmp < 255.f ? tmp : 255.f) : 0.f) + 0.5f;
                *mem++ = std::round(fmin(255.f, fmax(0.f, tmp)));
            }
        }
    }
    auto end = std::chrono::steady_clock::now();
    std::cout << "cost " << std::chrono::duration<double, std::milli>(end - start).count() << "ms\n";
}

int main() {
    int i = 5;
    while (i--) func();
}

此代码在我的机器上的输出成本约为20ms（使用 g++ -O3 test.cpp）：

cost 21.1324ms
cost 20.8892ms
cost 19.9664ms
cost 19.9693ms
cost 19.9603ms

如果我用我的代码替换所有std库中的数学函数（取消上面的注释），输出的成本只有约4ms：

cost 3.90695ms
cost 3.48335ms
cost 3.02623ms
cost 2.65635ms
cost 2.76906ms

我尝试过分别使用std::fmin和std::fmax（以及std::round），它们都慢得多。例如：
*mem++ = fmax(0.f, tmp);

cost 9.31014ms
cost 8.86421ms
cost 7.8366ms
cost 7.86914ms
cost 7.82036ms

对比使用条件运算符的方式：
*mem++ = tmp > 0.f ? tmp : 0.f;

cost 3.50026ms
cost 3.05906ms
cost 2.33485ms
cost 2.36281ms
cost 2.38488ms

对于std::round，如果我简单地删除它，只运行 *mem++ = fmin(255.f, fmax(0.f, tmp));，时间成本会提高约7ms：

cost 13.4067ms
cost 13.2468ms
cost 12.1877ms
cost 12.2698ms
cost 12.1878ms

std::fmin、std::fmax和std::round都是constexpr，我认为不应该有函数调用的开销。
我知道std::round做的事情比简单的+0.5f并赋值给整数多，但它仍然比我预期的慢得多。

在我的系统上（Ubuntu 20.04，x86-64）运行 g++ -v：

Using built-in specs.
COLLECT_GCC=g++
COLLECT_LTO_WRAPPER=/usr/lib/gcc/x86_64-linux-gnu/9/lto-wrapper
OFFLOAD_TARGET_NAMES=nvptx-none:hsa
OFFLOAD_TARGET_DEFAULT=1
Target: x86_64-linux-gnu
Configured with: ../src/configure -v --with-pkgversion='Ubuntu 9.4.0-1ubuntu1~20.04.1' --with-bugurl=file:///usr/share/doc/gcc-9/README.Bugs --enable-languages=c,ada,c++,go,brig,d,fortran,objc,obj-c++,gm2 --prefix=/usr --with-gcc-major-version-only --program-suffix=-9 --program-prefix=x86_64-linux-gnu- --enable-shared --enable-linker-build-id --libexecdir=/usr/lib --without-included-gettext --enable-threads=posix --libdir=/usr/lib --enable-nls --enable-clocale=gnu --enable-libstdcxx-debug --enable-libstdcxx-time=yes --with-default-libstdcxx-abi=new --enable-gnu-unique-object --disable-vtable-verify --enable-plugin --enable-default-pie --with-system-zlib --with-target-system-zlib=auto --enable-objc-gc=auto --enable-multiarch --disable-werror --with-arch-32=i686 --with-abi=m64 --with-multilib-list=m32,m64,mx32 --enable-multilib --with-tune=generic --enable-offload-targets=nvptx-none=/build/gcc-9-Av3uEd/gcc-9-9.4.0/debian/tmp-nvptx/usr,hsa --without-cuda-driver --enable-checking=release --build=x86_64-linux-gnu --host=x86_64-linux-gnu --target=x86_64-linux-gnu
Thread model: posix
gcc version 9.4.0 (Ubuntu 9.4.0-1ubuntu1~20.04.1)

英文:

I was working on some C++ code to process video frame, and found std::fmin and std::fmax is much slower than simply conditional operator. I've simplify my code as following (modify my code more C++ style as mentioned in comments):

#include &lt;cmath&gt;
#include &lt;chrono&gt;
#include &lt;iostream&gt;
#include &lt;memory&gt;

void func()
{
    std::unique_ptr&lt;uint8_t[]&gt; ptr(new uint8_t[1280 * 720 * 3]);
    auto mem = ptr.get();

    auto start = std::chrono::steady_clock::now();
    for (int i = 0; i != 720; ++i) {
        for (int j = 0; j != 1280; ++j) {
            for (int k = 0; k != 3; ++k) {
                float tmp = i + j + k;
                // *mem++ = (tmp &gt; 0.f ? (tmp &lt; 255.f ? tmp : 255.f) : 0.f) + 0.5f;
                *mem++ = std::round(fmin(255.f, fmax(0.f, tmp)));
            }
        }
    }
    auto end = std::chrono::steady_clock::now();
    std::cout &lt;&lt; &quot;cost &quot; &lt;&lt; std::chrono::duration&lt;double, std::milli&gt;(end - start).count() &lt;&lt; &quot;ms\n&quot;;
}

int main() {
    int i = 5;
    while (i--) func();
}

The output cost of this code is about 20ms on my machine (with g++ -O3 test.cpp):

cost 21.1324ms
cost 20.8892ms
cost 19.9664ms
cost 19.9693ms
cost 19.9603ms

And if I replace all std lib math functions with my own code (by uncomment code above), the output cost is just about 4ms:

cost 3.90695ms
cost 3.48335ms
cost 3.02623ms
cost 2.65635ms
cost 2.76906ms

I've tried std::fmin and std::fmax (and std::round too) separately, they are all much slower. For example:
*mem++ = fmax(0.f, tmp);

cost 9.31014ms
cost 8.86421ms
cost 7.8366ms
cost 7.86914ms
cost 7.82036ms

vs *mem++ = tmp > 0.f ? tmp : 0.f;

cost 3.50026ms
cost 3.05906ms
cost 2.33485ms
cost 2.36281ms
cost 2.38488ms

For std::round here, if I simply delete it, just run *mem++ = fmin(255.f, fmax(0.f, tmp));, the time cost improve about 7ms:

cost 13.4067ms
cost 13.2468ms
cost 12.1877ms
cost 12.2698ms
cost 12.1878ms

std::fmin, std::fmax, std::round are all constexprs, and I thought there should not be function invoke overhead.
I know std::round does more than simply +0.5f and assign to integer, but it's still much slower than my expectation.

g++ -v on my system (Ubuntu 20.04, x86-64):

Using built-in specs.
COLLECT_GCC=g++
COLLECT_LTO_WRAPPER=/usr/lib/gcc/x86_64-linux-gnu/9/lto-wrapper
OFFLOAD_TARGET_NAMES=nvptx-none:hsa
OFFLOAD_TARGET_DEFAULT=1
Target: x86_64-linux-gnu
Configured with: ../src/configure -v --with-pkgversion=&#39;Ubuntu 9.4.0-1ubuntu1~20.04.1&#39; --with-bugurl=file:///usr/share/doc/gcc-9/README.Bugs --enable-languages=c,ada,c++,go,brig,d,fortran,objc,obj-c++,gm2 --prefix=/usr --with-gcc-major-version-only --program-suffix=-9 --program-prefix=x86_64-linux-gnu- --enable-shared --enable-linker-build-id --libexecdir=/usr/lib --without-included-gettext --enable-threads=posix --libdir=/usr/lib --enable-nls --enable-clocale=gnu --enable-libstdcxx-debug --enable-libstdcxx-time=yes --with-default-libstdcxx-abi=new --enable-gnu-unique-object --disable-vtable-verify --enable-plugin --enable-default-pie --with-system-zlib --with-target-system-zlib=auto --enable-objc-gc=auto --enable-multiarch --disable-werror --with-arch-32=i686 --with-abi=m64 --with-multilib-list=m32,m64,mx32 --enable-multilib --with-tune=generic --enable-offload-targets=nvptx-none=/build/gcc-9-Av3uEd/gcc-9-9.4.0/debian/tmp-nvptx/usr,hsa --without-cuda-driver --enable-checking=release --build=x86_64-linux-gnu --host=x86_64-linux-gnu --target=x86_64-linux-gnu
Thread model: posix
gcc version 9.4.0 (Ubuntu 9.4.0-1ubuntu1~20.04.1)

答案1

得分: 1

以下是翻译好的部分：

"As always, with performance questions, the hardware and software stack are both important and empirical measurement is the arbiter of truth. In this particular instance, the platform makes a big difference that is not intuitive or easily predicted."

"在性能问题上，硬件和软件堆栈都很重要，经验性的测量是真理的裁判。在这个特定的情况下，平台的差异很大，不容易直观或轻易预测。"

"I used nanobench to test three different options for the computation in question. Here are the results for two different platforms (one arm64, one x86)."

"我使用了 nanobench 来测试问题中的三种不同计算选项。以下是两个不同平台的结果（一个是 arm64，一个是 x86）。"

"M1, Mac OSX 13.4, Clang-16"

"| ns/op | op/s | err% | total | benchmark"
"| ns/op | op/s | err% | total | benchmark"

"| 3,166,542.00 | 315.80 | 13.1% | 0.04 | compare"
"| 3,166,542.00 | 315.80 | 13.1% | 0.04 | compare"

"| 1,988,667.00 | 502.85 | 7.3% | 0.02 | round-fminmax"
"| 1,988,667.00 | 502.85 | 7.3% | 0.02 | round-fminmax"

"| 1,911,292.00 | 523.21 | 3.6% | 0.02 | clamp"
"| 1,911,292.00 | 523.21 | 3.6% | 0.02 | clamp"

"Xeon, Ubuntu 20.04, Clang-17"

"| ns/op | op/s | err% | total | benchmark"
"| ns/op | op/s | err% | total | benchmark"

"| 6,763,898.00 | 147.84 | 0.5% | 0.08 | compare"
"| 6,763,898.00 | 147.84 | 0.5% | 0.08 | compare"

"| 10,629,358.00 | 94.08 | 0.2% | 0.13 | round-fminmax"
"| 10,629,358.00 | 94.08 | 0.2% | 0.13 | round-fminmax"

"| 5,131,994.00 | 194.86 | 0.0% | 0.06 | clamp"
"| 5,131,994.00 | 194.86 | 0.0% | 0.06 | clamp"

英文:

As always, with performance questions, the hardware and software stack are both important and empirical measurement is the arbiter of truth. In this particular instance, the platform makes a big difference that is not intuitive or easily predicted.

I used nanobench to test three different options for the computation in question. Here are the results for two different platforms (one arm64, one x86).

M1, Mac OSX 13.4, Clang-16

ns/op	op/s	err%	total	benchmark
3,166,542.00	315.80	13.1%	0.04	`compare`
1,988,667.00	502.85	7.3%	0.02	`round-fminmax`
1,911,292.00	523.21	3.6%	0.02	`clamp`

Xeon, Ubuntu 20.04, Clang-17

ns/op	op/s	err%	total	benchmark
6,763,898.00	147.84	0.5%	0.08	`compare`
10,629,358.00	94.08	0.2%	0.13	`round-fminmax`
5,131,994.00	194.86	0.0%	0.06	`clamp`

Sample Code

#include &lt;algorithm&gt;
#include &lt;iostream&gt;
#include &lt;cmath&gt;
#include &lt;chrono&gt;
#include &lt;iostream&gt;
#include &lt;memory&gt;
#include &quot;nanobench.h&quot;

using std::cin, std::cout, std::endl;

template&lt;class T&gt;
auto clamp(const T&amp; v, const T&amp; lo, const T&amp; hi) {
    return v &lt; lo ? lo : hi &lt; v ? hi : v;
}

template&lt;class Op&gt;
void func(uint8_t *mem, Op&amp;&amp; op)
{
    for (int i = 0; i != 720; ++i) {
        for (int j = 0; j != 1280; ++j) {
            for (int k = 0; k != 3; ++k) {
                float tmp = i + j + k;
                *mem++ = op(tmp);
            }
        }
    }
}

int main() {
    std::unique_ptr&lt;uint8_t[]&gt; ptr(new uint8_t[1280 * 720 * 3]);
    auto *mem = ptr.get();

    ankerl::nanobench::Bench().run(&quot;compare&quot;, [&amp;]() {
        func(mem, [](float x) {
            return (x &gt; 0 ? (x &lt; 255 ? x : 255) : 0) + 0.5;
        });
    });

    ankerl::nanobench::Bench().run(&quot;round-fminmax&quot;, [&amp;]() {
        func(mem, [](float x) {
            return std::round(fmin(255.f, fmax(0.f, x)));
        });
    });

    ankerl::nanobench::Bench().run(&quot;clamp&quot;, [&amp;]() {
        func(mem, [](float x) {
            return clamp(x + 0.5f, 0.0f, 255.0f);
        });
    });
}

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

fmin和fmax比简单的条件运算符慢得多

问题

答案1

Sample Code

相对输出目录的对象文件名

Extern全局变量在标准库中的使用

当我解引用一个指针以通过引用传递它时，编译器级别会发生什么？

在CGAL中保留自定义数据的同时对面进行三角剖分

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论