英文:
Why gcc is so much worse at std::vector<float> vectorization of a conditional multiply than clang?
问题
考虑以下浮点循环,使用 -O3 -mavx2 -mfma 编译:
for (auto i = 0; i < a.size(); ++i) {
a[i] = (b[i] > c[i]) ? (b[i] * c[i]) : 0;
}
Clang 在矢量化方面做得很好。它使用了 256 位的 ymm 寄存器,并理解了 vblendps/vandps 之间的差异,以实现最佳性能。
.LBB0_7:
vcmpltps ymm2, ymm1, ymm0
vmulps ymm0, ymm0, ymm1
vandps ymm0, ymm2, ymm0
然而,GCC 的表现要差得多。出于某种原因,它无法优化到 SSE 128 位矢量(-mprefer-vector-width=256 也不会改变任何事情)。
.L6:
vcomiss xmm0, xmm1
vmulss xmm0, xmm0, xmm1
vmovss DWORD PTR [rcx+rax*4], xmm0
如果将其替换为普通数组(如指南中所示),GCC 确实会将其优化为 AVX ymm。
int a[256], b[256], c[256];
auto foo (int *a, int *b, int *c) {
int i;
for (i=0; i<256; i++){
a[i] = (b[i] > c[i]) ? (b[i] * c[i]) : 0;
}
}
然而,我没有找到如何在可变长度的 std::vector 上实现这一点。GCC 需要什么样的提示才能将 std::vector 优化为 AVX?
在 Godbolt 上的源代码,使用了 gcc 13.1 和 clang 14.0.0
英文:
Consider following float loop, compiled using -O3 -mavx2 -mfma
for (auto i = 0; i < a.size(); ++i) {
a[i] = (b[i] > c[i]) ? (b[i] * c[i]) : 0;
}
Clang done perfect job at vectorizing it. It uses 256-bit ymm registers and understands the difference between vblendps/vandps for the best performance possible.
.LBB0_7:
vcmpltps ymm2, ymm1, ymm0
vmulps ymm0, ymm0, ymm1
vandps ymm0, ymm2, ymm0
GCC, however, is much worse. For some reason it doesn't get better than SSE 128-bit vectors (-mprefer-vector-width=256 won't change anything).
.L6:
vcomiss xmm0, xmm1
vmulss xmm0, xmm0, xmm1
vmovss DWORD PTR [rcx+rax*4], xmm0
If replace it with plain array (as in guideline), gcc does vectorize it to AVX ymm.
int a[256], b[256], c[256];
auto foo (int *a, int *b, int *c) {
int i;
for (i=0; i<256; i++){
a[i] = (b[i] > c[i]) ? (b[i] * c[i]) : 0;
}
}
However I didn't find how to do it with variable-length std::vector. What sort of hint does gcc need to vectorize std::vector to AVX?
答案1
得分: 35
这是您要翻译的内容:
"It's not std::vector
that's the problem, it's float
and GCC's usually-bad default of -ftrapping-math
that is supposed to treat FP exceptions as a visible side-effect, but doesn't always correctly do that, and misses some optimizations that would be safe.
In this case, there is a conditional FP multiply in the source, so strict exception behavior avoids possibly raising an overflow, underflow, inexact, or other exception in case the compare was false.
GCC does that correctly in this case using scalar code: ...ss
is Scalar Single, using the bottom element of 128-bit XMM registers, not vectorized at all. Your asm isn't GCC's actual output: it loads both elements with vmovss
, then branches on a vcomiss
result before vmulss
, so the multiply doesn't happen if b[i] > c[i]
isn't true. So unlike your "GCC" asm, GCC's actual asm does I think correctly implement -ftrapping-math
.
Notice that your example which does auto-vectorize uses int *
args, not float*
. If you change it to float*
and use the same compiler options, it doesn't auto-vectorize either, even with float *__restrict a
(https://godbolt.org/z/nPzsf377b).
@273K's answer shows that AVX-512 lets float
auto-vectorize even with -ftrapping-math
, since AVX-512 masking (ymm2{k1}{z}
) suppresses FP exceptions for masked elements, not raising FP exceptions from any FP multiplies that don't happen in the C++ abstract machine.
gcc -O3 -mavx2 -mfma -fno-trapping-math
auto-vectorizes all 3 functions (Godbolt)
void foo (float *__restrict a, float *__restrict b, float *__restrict c) {
for (int i=0; i<256; i++){
a[i] = (b[i] > c[i]) ? (b[i] * c[i]) : 0;
}
}
foo(float*, float*, float*):
xor eax, eax
.L143:
vmovups ymm2, YMMWORD PTR [rsi+rax]
vmovups ymm3, YMMWORD PTR [rdx+rax]
vmulps ymm1, ymm2, YMMWORD PTR [rdx+rax]
vcmpltps ymm0, ymm3, ymm2
vandps ymm0, ymm0, ymm1
vmovups YMMWORD PTR [rdi+rax], ymm0
add rax, 32
cmp rax, 1024
jne .L143
vzeroupper
ret
BTW, I'd recommend -march=x86-64-v3
for an AVX2+FMA feature-level. That also includes BMI1+BMI2 and stuff. It still just uses -mtune=generic
I think, but could hopefully in future ignore tuning things that only matter for CPUs that don't have AVX2+FMA+BMI2.
The std::vector
functions are bulkier since we didn't use float *__restrict a = avec.data();
or similar to promise non-overlap of the data pointed-to by the std::vector
control blocks (and the size isn't known to be a multiple of the vector width), but the non-cleanup loops for the no-overlap case are vectorized with the same vmulps
/ vcmpltps
/ vandps
.
See also:
-ftrapping-math
is broken and "never worked" according to GCC dev Marc Glisse. But https://gcc.gnu.org/bugzilla/show_bug.cgi?id=54192 from 2012 proposing to make it not the default is still open.- https://stackoverflow.com/questions/57673825/how-to-force-gcc-to-assume-that-a-floating-point-expression-is-non-negative (various FP options other than the full
-ffast-math
, such as-fno-math-errno
which allows many functions to inline and is not a problem for normal code which doesn't checkerrno
after callingsqrt
or whatever!) - [Semantics of Floating Point Math in GCC][3]
- https://stackoverflow.com/questions/2852730/auto-vectorization-on-double-and-ffast-math (of course reductions are only vectorized with
-ffast-math
or#pragma omp simd reduction (+:my_sum_var)
, but @phuclv's answer has some good links)
Tweaking the source to make the multiply unconditional? No
If the multiply in the C source happens regardless of the condition, then GCC would be allowed to vectorize it the efficient way without AVX-512 masking.
// still scalar asm with GCC -ftrapping-math which is a bug
void foo (float *__restrict a, float *__restrict b, float *__restrict c) {
for (int i=0; i<256; i++){
float prod = b[i] * c[i];
a[i] = (b[i] > c[i]) ? prod : 0;
}
}
But unfortunately GCC -O3 -march=x86-64-v3
([Godbolt][4] with and without the default -ftrapping-math
) still makes scalar asm that only conditionally multiplies!
This is a bug in -ftrapping-math
. Not only is it too conservative, missing the chance to auto-vectorize: It's actually buggy, not raising FP exceptions for some multiplies the abstract machine (or a debug build) actually performs. Crap behavior like this is why -ftrapping-math
is unreliable and probably shouldn't be on by default.
[@Ovinus Real's answer][5] points out GCC -ftrapping-math
could still have auto-vectorized the original source by masking both inputs instead of the output. 0.0 * 0.0
never raises any FP exceptions, so it's basically emulating AVX-512 zero-masking.
This would be more expensive and have more latency for out-of-order exec to hide, but is still much better than scalar especially when AVX1 is available, especially for small to medium arrays that are hot in some level of cache.
(If writing with intrinsics, just mask the output to zero unless you actually want to check the FP environment for exception flags after the loop.)
Doing this in scalar source doesn't lead GCC into making asm like that: GCC compiles this to the same branchy scalar asm unless you use -fno-trapping-math
. At least that's not a bug this time, just a missed optimization: this doesn't do b[i]*c[i]
when the compare is false.
// doesn't help, still scalar asm
<details>
<summary>英文:</summary>
It's not `std::vector` that's the problem, it's `float` and GCC's usually-bad default of `-ftrapping-math` that is supposed to treat FP exceptions as a visible side-effect, but doesn't always correctly do that, and misses some optimizations that would be safe.
In this case, there *is* a conditional FP multiply in the source, so strict exception behaviour avoids possibly raising an overflow, underflow, inexact, or other exception in case the compare was false.
**GCC does that correctly in this case using scalar code**: `...ss` is Scalar Single, using the bottom element of 128-bit XMM registers, not vectorized at all. Your asm isn't GCC's actual output: it loads both elements with `vmovss`, then branches on a `vcomiss` result *before* `vmulss`, so the multiply doesn't happen if `b[i] > c[i]` isn't true. So unlike your "GCC" asm, GCC's actual asm does I think correctly implement `-ftrapping-math`.
Notice that your example which does auto-vectorize uses `int *` args, not `float*`. If you change it to `float*` and use the same compiler options, it doesn't auto-vectorize either, even with `float *__restrict a` (https://godbolt.org/z/nPzsf377b).
@273K's answer shows that **AVX-512 lets `float` auto-vectorize even with `-ftrapping-math`**, since AVX-512 masking (`ymm2{k1}{z}`) suppresses FP exceptions for masked elements, not raising FP exceptions from any FP multiplies that don't happen in the C++ abstract machine.
---
#### `gcc -O3 -mavx2 -mfma -fno-trapping-math` auto-vectorizes all 3 functions ([Godbolt][1])
void foo (float *__restrict a, float *__restrict b, float *__restrict c) {
for (int i=0; i<256; i++){
a[i] = (b[i] > c[i]) ? (b[i] * c[i]) : 0;
}
}
foo(float*, float*, float*):
xor eax, eax
.L143:
vmovups ymm2, YMMWORD PTR [rsi+rax]
vmovups ymm3, YMMWORD PTR [rdx+rax]
vmulps ymm1, ymm2, YMMWORD PTR [rdx+rax]
vcmpltps ymm0, ymm3, ymm2
vandps ymm0, ymm0, ymm1
vmovups YMMWORD PTR [rdi+rax], ymm0
add rax, 32
cmp rax, 1024
jne .L143
vzeroupper
ret
BTW, **I'd recommend `-march=x86-64-v3`** for an AVX2+FMA feature-level. That also includes BMI1+BMI2 and stuff. It still just uses `-mtune=generic` I think, but could hopefully in future ignore tuning things that only matter for CPUs that don't have AVX2+FMA+BMI2.
The `std::vector` functions are bulkier since we didn't use `float *__restrict a = avec.data();` or similar to promise non-overlap of the data pointed-to by the `std::vector` control blocks (and the size isn't known to be a multiple of the vector width), but the non-cleanup loops for the no-overlap case are vectorized with the same `vmulps` / `vcmpltps` / `vandps`.
---
See also:
* `-ftrapping-math` is broken and "never worked" [according to GCC dev Marc Glisse][2]. But https://gcc.gnu.org/bugzilla/show_bug.cgi?id=54192 from 2012 proposing to make it not the default is still open.
* https://stackoverflow.com/questions/57673825/how-to-force-gcc-to-assume-that-a-floating-point-expression-is-non-negative (various FP options other than the full `-ffast-math`, such as `-fno-math-errno` which allows many functions to inline and is not a problem for normal code which doesn't check `errno` after calling `sqrt` or whatever!)
* [Semantics of Floating Point Math in GCC][3]
* https://stackoverflow.com/questions/2852730/auto-vectorization-on-double-and-ffast-math (of course reductions are only vectorized with `-ffast-math` or `#pragma omp simd reduction (+:my_sum_var)`, but @phuclv's answer has some good links)
----
### Tweaking the source to make the multiply unconditional? No
If the multiply in the C source happens regardless of the condition, then GCC would be *allowed* to vectorize it the efficient way without AVX-512 masking.
// still scalar asm with GCC -ftrapping-math which is a bug
void foo (float *__restrict a, float *__restrict b, float *__restrict c) {
for (int i=0; i<256; i++){
float prod = b[i] * c[i];
a[i] = (b[i] > c[i]) ? prod : 0;
}
}
But unfortunately GCC `-O3 -march=x86-64-v3` ([Godbolt][4] with and without the default `-ftrapping-math`) still makes scalar asm that only conditionally multiplies!
**This is a bug in `-ftrapping-math`**. Not only is it too conservative, missing the chance to auto-vectorize: It's actually buggy, *not* raising FP exceptions for some multiplies the abstract machine (or a debug build) actually performs. Crap behaviour like this is why `-ftrapping-math` is unreliable and probably shouldn't be on by default.
----
[@Ovinus Real's answer][5] points out GCC `-ftrapping-math` could still have auto-vectorized the original source by masking *both inputs* instead of the output. `0.0 * 0.0` never raises any FP exceptions, so it's basically emulating AVX-512 zero-masking.
This would be more expensive and have more latency for out-of-order exec to hide, but is still much better than scalar especially when AVX1 is available, especially for small to medium arrays that are hot in some level of cache.
(If writing with intrinsics, just mask the output to zero unless you actually want to check the FP environment for exception flags after the loop.)
Doing this in scalar source doesn't lead GCC into making asm like that: GCC compiles this to the same branchy scalar asm unless you use `-fno-trapping-math`. At least that's not a bug this time, just a missed optimization: this doesn't do `b[i]*c[i]` when the compare is false.
// doesn't help, still scalar asm with GCC -ftrapping-math
void bar (float *__restrict a, float *__restrict b, float *__restrict c) {
for (int i=0; i<256; i++){
float bi = b[i];
float ci = c[i];
if (! (bi > ci)) {
bi = ci = 0;
}
a[i] = bi * ci;
}
}
[1]: https://godbolt.org/#z:OYLghAFBqd5QCxAYwPYBMCmBRdBLAF1QCcAaPECAMzwBtMA7AQwFtMQByARg9KtQYEAysib0QXACx8BBAKoBnTAAUAHpwAMvAFYTStJg1DIApACYAQuYukl9ZATwDKjdAGFUtAK4sGe1wAyeAyYAHI%2BAEaYxCAAzBqkAA6oCoRODB7evnrJqY4CQSHhLFExZgCstpj2%2BQxCBEzEBJk%2BflxVNen1jQSFYZHRcQkKDU0t2e0jPX3FpSAVAJS2qF7EyOwc5rHByN5YANQmsW5OI8SYrEfYJhoAgls7e5iHxwBumA4kVze3APS/LCYwX2mFUrES9FI%2BwEtAAnvsFEofq9UHh0Pt%2BKhbgx0BARugQCB3p9iEc3FRaKgmARvmYAGz7JhQtAMEYIggEokfIik44Uqk02LXen7CLMgRs/GE4k8sn86m0hnIBaHADsVju%2By1GJI%2BwgTC8RH2eBeABF9hojhZjS83IyAHSpABemAgCyth0s1jwKpM6p%2B2sDjJM5QseBD5qO5ogERDYYjL2w%2B1MofD5VNvtiADE9bHUwmAFTJuNpjP7EAWq0B7V%2B00/Ws/f4EYh4TDoo3APDvfYAcTcdqY%2BwQwQIUPQqEwCgYYA4BCH1USyNR6Mx2PQAAkR3iOdLuV8%2BZSFULzAymcmJXOpVySXLD4LhQyxefWZed9fZQeBYrk77/ZrtXgVB6kwjp4C6brfLEpoVHSv4arcQbavwxDAYaqA2lGlaxNaJpkg6zquu62GelYlg%2Bmq8GIVqTAlgmmExrR6aJsW%2BbppmOYMax5pFim8ZseWWGUVRDb/lqIkIdq1RKBR1aIchqFGrhUGCRhxz4WBhEetY3pwbJQY0VxZq5oxkZCixfFlkcHF5hZ%2Bw8SZKoVpa2F6YG4luaqdZ3OJja/KgiSOCwGkKPsRpMK8qi%2BSOwahjBEZQjZcXpsycZJXWLl3CiaI6uh1B3nZAD6BXnGceAOIyULynOBZFSVzZlXOT5VYVxWTvV5XKjJ/7yRA0XhlBzk4WScXEeGXqWO6f4SdRJlGcZhlXOZpbsfNtn2VxjmCdWPmeT8HBLLQnDlLwfgcFopCoJwbjaZYCIrGszxbDwpAEJo%2B1LAA1nE5T2gAnBoZhSBodKqmYqoABxmODv36JwkgnW9F2cLwCggAkr1nftpBwLASBoCwiR0NE5CUPjhP0DEwBcLE7RYK8ZWYAAaq2ADuADyiSMJwz00LQBDRKjMaIxEwSNLC3O8CLzDELCbMRNo3IS6Q%2BNsIIbMMHCiNYBEXjAG4Yi0Kj3C8FggJGOImOkPg5wOF2k6I6CHyGhs50jtUiO0HgETEGLHhYIj9UsBLSwUkwwAKMzmDs5zp087IIhiOwUgyIIigqOolu6O0BhGCgN02J7ESo5ASz%2BbURsALRs7E%2BwV4CEVmLXLBUICtdUAwqAV82TCJITRh19SCC8Kg7zEC2WDF26HTcukLg4uMbRmKQgTBP0JSDJIGgJLkaQCAvOQpLvDAzAMMSxODqrT7bAjdGMnitBIS92DPN%2BjL0q%2BzBvW%2B2G/%2B%2BTG/J915nwvksBQ911gSAOkdBGltLocH2KocGdIK50kkPsYAyBkD7GpvaLgepcCEF1E9BYvAMZaAWEsBAFwsAxCnl9eI9pVSoPKBUX6v1z7g3PtnOGMDzpwJRmjF6b0lg40QCgVABMiZkAoBAMmUiUA5yppvBIdMGaR2jlzY2fA6D82IILCIwtRYyyVlLMWcsFYOCVirRgBB1aa0ttrXW%2BtaCGyVqbQwwALbnWtjPd4RtzqO2QM7JWbtDqW0Lj7GWfsXakJbEHY2IcDDh3URzTRcdBAJ3EMnfgqclBqERroJeii87jQLl7SepcArpErtXJu4VVCNzri3QcFd26dy8AwYgnhaAV0pP5BQw9R7j3tiXK%2BtQ57uHvhMZeOJAFzHaDvWof8kiH1qHMwYkxqgvzqL/KZbQxldAAR/U%2Begph3yyPss578ignK4KA8BScoEcGOqQU6fDOAIKQSgtBuwPHYMkPaDQgL8H4B5J6GmJChGYwoaQehsR7SxERUi5FyLYYcHhq8xG/DbCCLIe9NFZheHD2RlC8hSxR6pGcJIIAA%3D
[2]: https://stackoverflow.com/questions/56670132/simd-for-float-threshold-operation#comment99952463_56681744
[3]: https://gcc.gnu.org/wiki/FloatingPointMath
[4]: https://godbolt.org/#z:OYLghAFBqd5QCxAYwPYBMCmBRdBLAF1QCcAaPECAMzwBtMA7AQwFtMQByARg9KtQYEAysib0QXACx8BBAKoBnTAAUAHpwAMvAFYTStJg1DIApACYAQuYukl9ZATwDKjdAGFUtAK4sGe1wAyeAyYAHI%2BAEaYxCBmGqQADqgKhE4MHt6%2BekkpjgJBIeEsUTFxtpj2eQxCBEzEBBk%2BflzllWk1dQQFYZHRsfEKtfWNWS2Dnd1FJf0AlLaoXsTI7BwA9KvBBADUTCYArBZmewBs%2BwAipFsR%2B4cn55emB0ene2cmAMxWGgCCAG6oeHQW34qC21FoqCY2wAVAB9WHETCDYh4Bw7S5UCFQrZwhFIggotERDFYmHwxHI1HbZAzLYmADsX2%2BW2BJDBmy2eA%2BZw0HwsnI%2BbmefIFlmsMwZTJZLMxkO2CWIGDp7zOVxuXNeOK2jwsGrenxMP2lO3V52VqrB1wOeuV2G1ptetI%2BADEtgqlSAtryDUa6fS3j8GQHvoaQz9/oCrnUwbLsbiKQSqejgaSceT8YTtsSU3K03jKWiaX6paziOzBAKVd7%2BVz3kK7p9RVZLBLGaHjbGs3hzWrrec%2Be3pZ3td3ufa%2B68B76WXgqGCwJAIqP3nbkHgZk629PjVdl6q1z3q4OWUHjyaJ6qx0utWup8y/cHTz8OHNaJw9rw/BwtKRUJw3NY1hbAoCxLJgdJmO8PCkAQmgvnMADWIDvHsAB0RwAJwYe8AAcOHvPSGFxHs%2BicJIn5wb%2BnC8AoIDxLB34vqQcCwEgaAsAkdDROQlDsZx9AxMAXBQaQWC/KimAAGp4JgADuADyCSMJw0E0LQBDRLREARJRS7MMQACeKm8HpdQGfJETaJgDjGaQ7FsII8kMLQRmMaJmARF4wBuGItC0dwvBYCwhjAOIbn4IiDh4L8SKUZgqjWV4GmUZsFSUbQeARMQZkeFglGJiwxlzJiTDAAo0lyYpykBTIggiGI7BSLV8hKGolG6C0BhGCggGWPomW0ZAcyoAkVT%2BQAtPJ7xbONwVLAg3KqDhxzjcckjjb803jVQBJMAknFGLNUIILwqAxcQKJYINEBzHY1lVC4DDuJ4TT%2BE9ky9DELQ5KkAgjM0iTJL9DAfcUfRjBU93tEMDQvaMrRQwIHT1KD0xjDD/16OMKPBD0YNfbdoHLBIr7vhRbl/hwWxLSta1bMAyDIFswmoVwYK4IQbLmFBMy8AxWgbqQCCYEwWAxDdpBIVBqGSBoxz0mY9I4WYOH0pIxxcBhpEcORpBfj%2BlM0XRMFwXMLGICgqAcVxZAUBAfE2yAQkiWJEkVQpSlfqpdAacQWk6W5pmGbZQfmZZ922fZjAEE5LmUVgnneb5/nQUFIVhT%2BEVQzF/k/vFiXJW5qVvm5GVZTlGArD%2BBVFXwBhle7VVe7w/B1aI4hNa3LUqOobm6O8%2BghT1Yp9WX13DaNaQTVNM1zcgC0qjTq3rZtM1UAwqDjbt%2B3BMAR0ECdv7nZdsVDQjUXOBAriYy0gS41M4OA7kaQ30/wOo4/d0X9UGNwwDX9VGRl0e%2Bn0sa/0yP/GGH8CbzEWMTLgpMOAfj1pRSm1NlrL3pozZm7xWbs3wEQMs3MEF81NohZCuD3hUOoTQmh2tdb61OtRWwxt%2BbwW1mYcmBtmFsMFudFIzhJBAA
[5]: https://stackoverflow.com/questions/76683811/why-gcc-is-so-much-worse-at-stdvectorfloat-vectorization-of-a-conditional-mu/76696058#76696058
</details>
# 答案2
**得分**: 20
默认情况下,GCC 编译为较旧的 CPU 架构。
使用 `-march=native` 可以启用使用 256 位的 ymm 寄存器。
.L7:
vmovups ymm1, YMMWORD PTR [rsi+rax]
vmovups ymm0, YMMWORD PTR [rdx+rax]
vcmpps k1, ymm1, ymm0, 14
vmulps ymm2{k1}{z}, ymm1, ymm0
vmovups YMMWORD PTR [rcx+rax], ymm2
使用 `-march=x86-64-v4` 可以启用使用 512 位的 zmm 寄存器。
.L7:
vmovups zmm2, ZMMWORD PTR [rsi+rax]
vcmpps k1, zmm2, ZMMWORD PTR [rdx+rax], 14
vmulps zmm0{k1}{z}, zmm2, ZMMWORD PTR [rdx+rax]
vmovups ZMMWORD PTR [rcx+rax], zmm0
<details>
<summary>英文:</summary>
GCC by default compiles for older CPU architectures.
Setting `-march=native` enables using 256-bit ymm registers.
.L7:
vmovups ymm1, YMMWORD PTR [rsi+rax]
vmovups ymm0, YMMWORD PTR [rdx+rax]
vcmpps k1, ymm1, ymm0, 14
vmulps ymm2{k1}{z}, ymm1, ymm0
vmovups YMMWORD PTR [rcx+rax], ymm2
Setting `-march=x86-64-v4` enables using 512-bit zmm registers.
.L7:
vmovups zmm2, ZMMWORD PTR [rsi+rax]
vcmpps k1, zmm2, ZMMWORD PTR [rdx+rax], 14
vmulps zmm0{k1}{z}, zmm2, ZMMWORD PTR [rdx+rax]
vmovups ZMMWORD PTR [rcx+rax], zmm0
</details>
# 答案3
**得分**: 1
假设启用了`-ftrapping-math`选项,另一个选择是在进行乘法之前将被忽略的输入置零(未经测试):
```c
for (size_t i = 0; i < size; i += 4) {
__m128i x = _mm_loadu_si128((const __m128i*)(a + i));
__m128i y = _mm_loadu_si128((const __m128i*)(b + i));
__m128i cmp = _mm_cmplt_ps(x, y);
x = _mm_and_ps(x, cmp);
y = _mm_and_ps(y, cmp);
_mm_storeu_si128((__m128i*)(a + i), _mm_mul_ps(x, y));
}
当然,这会导致更大的宽度。
两个输入都必须被清零,因为+0.0 * x如果x < 0,则结果会是-0.0。在某些处理器上,这可能与具有相同矢量宽度的其他解决方案具有相同的吞吐量。这种方法也适用于加法、减法和平方根。除法将需要一个非零的除数。
即使在-fno-trapping-math
下,这个解决方案可能比在乘法之后进行屏蔽略有优势,因为它避免了与需要微码化的被忽略输入相关的惩罚。但我不确定吞吐量是否可以与在乘法之后清零的版本相同。
英文:
Assuming -ftrapping-math, another option is the zero the ignored inputs before multiplying them (untested):
for (size_t i = 0; i < size; i += 4) {
__m128i x = _mm_loadu_si128((const __m128i*)(a + i));
__m128i y = _mm_loadu_si128((const __m128i*)(b + i));
__m128i cmp = _mm_cmplt_ps(x, y);
x = _mm_and_ps(x, cmp);
y = _mm_and_ps(y, cmp);
_mm_storeu_si128((__m128i*)(a + i), _mm_mul_ps(x, y));
}
This of course translates to larger widths.
Both inputs must be zeroed, because +0.0 * x is -0.0 if x < 0. On some processors this will probably have the same throughput as other solutions of the same vector width. This same method will work for addition, subtraction, and square root. Division will require a divisor other than zero.
Even under -fno-trapping-math, this solution might be slightly superior to one masking after the multiplication, because it avoids penalties associated with ignored inputs that require microcoded multiplication. But I'm not sure whether the throughput can be the same as a version which zeroes after the multiplication.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论