英文:
fmin and fmax are much slower than simple conditional operator
问题
我在处理视频帧的C++代码上进行了一些工作,发现std::fmin和std::fmax比简单的条件运算符慢得多。我已经按照评论中提到的方式简化了我的代码(将代码修改为更符合C++风格):
#include <cmath>
#include <chrono>
#include <iostream>
#include <memory>
void func()
{
std::unique_ptr<uint8_t[]> ptr(new uint8_t[1280 * 720 * 3]);
auto mem = ptr.get();
auto start = std::chrono::steady_clock::now();
for (int i = 0; i != 720; ++i) {
for (int j = 0; j != 1280; ++j) {
for (int k = 0; k != 3; ++k) {
float tmp = i + j + k;
// *mem++ = (tmp > 0.f ? (tmp < 255.f ? tmp : 255.f) : 0.f) + 0.5f;
*mem++ = std::round(fmin(255.f, fmax(0.f, tmp)));
}
}
}
auto end = std::chrono::steady_clock::now();
std::cout << "cost " << std::chrono::duration<double, std::milli>(end - start).count() << "ms\n";
}
int main() {
int i = 5;
while (i--) func();
}
此代码在我的机器上的输出成本约为20ms(使用 g++ -O3 test.cpp
):
cost 21.1324ms
cost 20.8892ms
cost 19.9664ms
cost 19.9693ms
cost 19.9603ms
如果我用我的代码替换所有std库中的数学函数(取消上面的注释),输出的成本只有约4ms:
cost 3.90695ms
cost 3.48335ms
cost 3.02623ms
cost 2.65635ms
cost 2.76906ms
我尝试过分别使用std::fmin和std::fmax(以及std::round),它们都慢得多。例如:
*mem++ = fmax(0.f, tmp);
cost 9.31014ms
cost 8.86421ms
cost 7.8366ms
cost 7.86914ms
cost 7.82036ms
对比使用条件运算符的方式:
*mem++ = tmp > 0.f ? tmp : 0.f;
cost 3.50026ms
cost 3.05906ms
cost 2.33485ms
cost 2.36281ms
cost 2.38488ms
对于std::round,如果我简单地删除它,只运行 *mem++ = fmin(255.f, fmax(0.f, tmp));
,时间成本会提高约7ms:
cost 13.4067ms
cost 13.2468ms
cost 12.1877ms
cost 12.2698ms
cost 12.1878ms
- std::fmin、std::fmax和std::round都是constexpr,我认为不应该有函数调用的开销。
- 我知道std::round做的事情比简单的+0.5f并赋值给整数多,但它仍然比我预期的慢得多。
在我的系统上(Ubuntu 20.04,x86-64)运行 g++ -v
:
Using built-in specs.
COLLECT_GCC=g++
COLLECT_LTO_WRAPPER=/usr/lib/gcc/x86_64-linux-gnu/9/lto-wrapper
OFFLOAD_TARGET_NAMES=nvptx-none:hsa
OFFLOAD_TARGET_DEFAULT=1
Target: x86_64-linux-gnu
Configured with: ../src/configure -v --with-pkgversion='Ubuntu 9.4.0-1ubuntu1~20.04.1' --with-bugurl=file:///usr/share/doc/gcc-9/README.Bugs --enable-languages=c,ada,c++,go,brig,d,fortran,objc,obj-c++,gm2 --prefix=/usr --with-gcc-major-version-only --program-suffix=-9 --program-prefix=x86_64-linux-gnu- --enable-shared --enable-linker-build-id --libexecdir=/usr/lib --without-included-gettext --enable-threads=posix --libdir=/usr/lib --enable-nls --enable-clocale=gnu --enable-libstdcxx-debug --enable-libstdcxx-time=yes --with-default-libstdcxx-abi=new --enable-gnu-unique-object --disable-vtable-verify --enable-plugin --enable-default-pie --with-system-zlib --with-target-system-zlib=auto --enable-objc-gc=auto --enable-multiarch --disable-werror --with-arch-32=i686 --with-abi=m64 --with-multilib-list=m32,m64,mx32 --enable-multilib --with-tune=generic --enable-offload-targets=nvptx-none=/build/gcc-9-Av3uEd/gcc-9-9.4.0/debian/tmp-nvptx/usr,hsa --without-cuda-driver --enable-checking=release --build=x86_64-linux-gnu --host=x86_64-linux-gnu --target=x86_64-linux-gnu
Thread model: posix
gcc version 9.4.0 (Ubuntu 9.4.0-1ubuntu1~20.04.1)
英文:
I was working on some C++ code to process video frame, and found std::fmin and std::fmax is much slower than simply conditional operator. I've simplify my code as following (modify my code more C++ style as mentioned in comments):
#include <cmath>
#include <chrono>
#include <iostream>
#include <memory>
void func()
{
std::unique_ptr<uint8_t[]> ptr(new uint8_t[1280 * 720 * 3]);
auto mem = ptr.get();
auto start = std::chrono::steady_clock::now();
for (int i = 0; i != 720; ++i) {
for (int j = 0; j != 1280; ++j) {
for (int k = 0; k != 3; ++k) {
float tmp = i + j + k;
// *mem++ = (tmp > 0.f ? (tmp < 255.f ? tmp : 255.f) : 0.f) + 0.5f;
*mem++ = std::round(fmin(255.f, fmax(0.f, tmp)));
}
}
}
auto end = std::chrono::steady_clock::now();
std::cout << "cost " << std::chrono::duration<double, std::milli>(end - start).count() << "ms\n";
}
int main() {
int i = 5;
while (i--) func();
}
The output cost of this code is about 20ms on my machine (with g++ -O3 test.cpp
):
cost 21.1324ms
cost 20.8892ms
cost 19.9664ms
cost 19.9693ms
cost 19.9603ms
And if I replace all std lib math functions with my own code (by uncomment code above), the output cost is just about 4ms:
cost 3.90695ms
cost 3.48335ms
cost 3.02623ms
cost 2.65635ms
cost 2.76906ms
I've tried std::fmin and std::fmax (and std::round too) separately, they are all much slower. For example:
*mem++ = fmax(0.f, tmp);
cost 9.31014ms
cost 8.86421ms
cost 7.8366ms
cost 7.86914ms
cost 7.82036ms
vs *mem++ = tmp > 0.f ? tmp : 0.f;
cost 3.50026ms
cost 3.05906ms
cost 2.33485ms
cost 2.36281ms
cost 2.38488ms
For std::round here, if I simply delete it, just run *mem++ = fmin(255.f, fmax(0.f, tmp));
, the time cost improve about 7ms:
cost 13.4067ms
cost 13.2468ms
cost 12.1877ms
cost 12.2698ms
cost 12.1878ms
- std::fmin, std::fmax, std::round are all constexprs, and I thought there should not be function invoke overhead.
- I know std::round does more than simply +0.5f and assign to integer, but it's still much slower than my expectation.
g++ -v
on my system (Ubuntu 20.04, x86-64):
Using built-in specs.
COLLECT_GCC=g++
COLLECT_LTO_WRAPPER=/usr/lib/gcc/x86_64-linux-gnu/9/lto-wrapper
OFFLOAD_TARGET_NAMES=nvptx-none:hsa
OFFLOAD_TARGET_DEFAULT=1
Target: x86_64-linux-gnu
Configured with: ../src/configure -v --with-pkgversion='Ubuntu 9.4.0-1ubuntu1~20.04.1' --with-bugurl=file:///usr/share/doc/gcc-9/README.Bugs --enable-languages=c,ada,c++,go,brig,d,fortran,objc,obj-c++,gm2 --prefix=/usr --with-gcc-major-version-only --program-suffix=-9 --program-prefix=x86_64-linux-gnu- --enable-shared --enable-linker-build-id --libexecdir=/usr/lib --without-included-gettext --enable-threads=posix --libdir=/usr/lib --enable-nls --enable-clocale=gnu --enable-libstdcxx-debug --enable-libstdcxx-time=yes --with-default-libstdcxx-abi=new --enable-gnu-unique-object --disable-vtable-verify --enable-plugin --enable-default-pie --with-system-zlib --with-target-system-zlib=auto --enable-objc-gc=auto --enable-multiarch --disable-werror --with-arch-32=i686 --with-abi=m64 --with-multilib-list=m32,m64,mx32 --enable-multilib --with-tune=generic --enable-offload-targets=nvptx-none=/build/gcc-9-Av3uEd/gcc-9-9.4.0/debian/tmp-nvptx/usr,hsa --without-cuda-driver --enable-checking=release --build=x86_64-linux-gnu --host=x86_64-linux-gnu --target=x86_64-linux-gnu
Thread model: posix
gcc version 9.4.0 (Ubuntu 9.4.0-1ubuntu1~20.04.1)
答案1
得分: 1
以下是翻译好的部分:
"As always, with performance questions, the hardware and software stack are both important and empirical measurement is the arbiter of truth. In this particular instance, the platform makes a big difference that is not intuitive or easily predicted."
"在性能问题上,硬件和软件堆栈都很重要,经验性的测量是真理的裁判。在这个特定的情况下,平台的差异很大,不容易直观或轻易预测。"
"I used nanobench to test three different options for the computation in question. Here are the results for two different platforms (one arm64, one x86)."
"我使用了 nanobench 来测试问题中的三种不同计算选项。以下是两个不同平台的结果(一个是 arm64,一个是 x86)。"
"M1, Mac OSX 13.4, Clang-16"
"M1, Mac OSX 13.4, Clang-16"
"| ns/op | op/s | err% | total | benchmark"
"| ns/op | op/s | err% | total | benchmark"
"| 3,166,542.00 | 315.80 | 13.1% | 0.04 | compare
"
"| 3,166,542.00 | 315.80 | 13.1% | 0.04 | compare
"
"| 1,988,667.00 | 502.85 | 7.3% | 0.02 | round-fminmax
"
"| 1,988,667.00 | 502.85 | 7.3% | 0.02 | round-fminmax
"
"| 1,911,292.00 | 523.21 | 3.6% | 0.02 | clamp
"
"| 1,911,292.00 | 523.21 | 3.6% | 0.02 | clamp
"
"Xeon, Ubuntu 20.04, Clang-17"
"Xeon, Ubuntu 20.04, Clang-17"
"| ns/op | op/s | err% | total | benchmark"
"| ns/op | op/s | err% | total | benchmark"
"| 6,763,898.00 | 147.84 | 0.5% | 0.08 | compare
"
"| 6,763,898.00 | 147.84 | 0.5% | 0.08 | compare
"
"| 10,629,358.00 | 94.08 | 0.2% | 0.13 | round-fminmax
"
"| 10,629,358.00 | 94.08 | 0.2% | 0.13 | round-fminmax
"
"| 5,131,994.00 | 194.86 | 0.0% | 0.06 | clamp
"
"| 5,131,994.00 | 194.86 | 0.0% | 0.06 | clamp
"
英文:
As always, with performance questions, the hardware and software stack are both important and empirical measurement is the arbiter of truth. In this particular instance, the platform makes a big difference that is not intuitive or easily predicted.
I used nanobench to test three different options for the computation in question. Here are the results for two different platforms (one arm64, one x86).
M1, Mac OSX 13.4, Clang-16
ns/op | op/s | err% | total | benchmark |
---|---|---|---|---|
3,166,542.00 | 315.80 | 13.1% | 0.04 | compare |
1,988,667.00 | 502.85 | 7.3% | 0.02 | round-fminmax |
1,911,292.00 | 523.21 | 3.6% | 0.02 | clamp |
Xeon, Ubuntu 20.04, Clang-17
ns/op | op/s | err% | total | benchmark |
---|---|---|---|---|
6,763,898.00 | 147.84 | 0.5% | 0.08 | compare |
10,629,358.00 | 94.08 | 0.2% | 0.13 | round-fminmax |
5,131,994.00 | 194.86 | 0.0% | 0.06 | clamp |
Sample Code
#include <algorithm>
#include <iostream>
#include <cmath>
#include <chrono>
#include <iostream>
#include <memory>
#include "nanobench.h"
using std::cin, std::cout, std::endl;
template<class T>
auto clamp(const T& v, const T& lo, const T& hi) {
return v < lo ? lo : hi < v ? hi : v;
}
template<class Op>
void func(uint8_t *mem, Op&& op)
{
for (int i = 0; i != 720; ++i) {
for (int j = 0; j != 1280; ++j) {
for (int k = 0; k != 3; ++k) {
float tmp = i + j + k;
*mem++ = op(tmp);
}
}
}
}
int main() {
std::unique_ptr<uint8_t[]> ptr(new uint8_t[1280 * 720 * 3]);
auto *mem = ptr.get();
ankerl::nanobench::Bench().run("compare", [&]() {
func(mem, [](float x) {
return (x > 0 ? (x < 255 ? x : 255) : 0) + 0.5;
});
});
ankerl::nanobench::Bench().run("round-fminmax", [&]() {
func(mem, [](float x) {
return std::round(fmin(255.f, fmax(0.f, x)));
});
});
ankerl::nanobench::Bench().run("clamp", [&]() {
func(mem, [](float x) {
return clamp(x + 0.5f, 0.0f, 255.0f);
});
});
}
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论