英文:
Multiply 128-bit vectors of signed 16-bit integers, widening to 32-bit elements
问题
我有2个__m128i。每个包含8个int16_t(有符号)
__mm128i a = {a0,...,a7}
__mm128i b = {b0,...,b7}
我想要将这8个元素相乘。每次乘法的结果是**int32_t**,所以每个寄存器只能容纳4个结果:
__mm128i c0 = {a0*b0,...,a3*b3}
__mm128i c1 = {a4*b4,...,a7*b7}
我没有找到这样的内置函数。
英文:
I have 2 __m128i. Each contains 8x int16_t (signed)
__mm128i a = {a0,...,a7}
__mm128i b = {b0,...,b7}
I want to multiply 8 elements. The result of each multiply is int32_t so each register will hold only 4 results:
__mm128i c0 = {a0*b0,...,a3*b3}
__mm128i c1 = {a4*b4,...,a7*b7}
I did not find such intrinsic.
答案1
得分: 2
SSE2具有一个16位的“mul_hi”指令,返回乘积的高半部分。高16位和低16位通过unpack指令合并为32位。
__m128i lo = _mm_mullo_epi16(a, b);
__m128i hi = _mm_mulhi_epi16(a, b);
__m128i c0 = _mm_unpacklo_epi16(lo, hi);
__m128i c1 = _mm_unpackhi_epi16(lo, hi);
英文:
SSE2 has a 16-bit "mul_hi" that returns the high half of the product. The high 16-bits and low 16-bits are combined into 32-bits using unpack.
__m128i lo = _mm_mullo_epi16(a, b);
__m128i hi = _mm_mulhi_epi16(a, b);
__m128i c0 = _mm_unpacklo_epi16(lo, hi);
__m128i c1 = _mm_unpackhi_epi16(lo, hi);
答案2
得分: 1
[@aqrit的答案][1] 指出,这就像是 `pmullw` + `pmulhw` 一样简单,然后解压 lo / hi 来合并每个 32 位乘积的 16 位半部分。接受这个。
#### 大部分已过时的答案 - 只在奇偶交错有用
(或者作为 AVX2 版本的替代构建模块,该版本希望将结果扩展到 `__m256i` - 对于 `__m256i` 输入如何有效地实现这一点并不明显。 对于 `__m128i` 输入,您只需使用 `vpmovzxwd`(`_mm256_cvtepu16_epi32`)将每个输入扩展到 8x 32 位,然后使用 `_mm256_madd_epi16`。)
x86 唯一的扩展 16x16 -> 32 位 SIMD 乘法是 `pmaddwd`,它对产品的成对求和。因此,它不能直接使用,但如果其中一个乘积产生零,则在 dword 元素内部有两个有符号 16 位整数的 32 位乘积。
您可以(使用零扩展)解压缩两个输入(4 个洗牌)以提供 2 条 `pmaddwd` 指令。 相对于手动使用算术右移进行符号扩展或进行额外工作以将高半部分提供给 `_mm_cvtepi16_epi32` (`pmovsxwd`),对于高半部分来说,使用 `_mm_unpackhi_epi16` 和 `_mm_setzero_si128()` 比手动进行符号扩展更便宜。
只有在使用 `pmulld`(32x32 -> 32 位乘法)时才需要符号扩展,但要避免它,因为它在英特尔 CPU 上速度较慢(每条指令 2 个微操作)。 https://uops.info/
使用零扩展将 16 位有符号整数扩展到 32 位,然后与 `pmaddwd` 进行乘法,就像在每个元素中执行 `a0*b0 + 0*0` 一样。 如果对输入进行符号扩展,那么您可能会执行 `a0*b0 + -1*-1`,这就是为什么如果您打算执行此优化,则需要进行零扩展。
如果对奇/偶解压缩对您有用,那么这会更高效,因为 `0 x garbage = 0`,所以我们只需要 2 条指令(加上一些寄存器复制)来准备 2 对输入,其中没有一条指令是洗牌。(在 32 位字中进行的移位,例如 `psrld`,需要将两个输入都移位以保持乘数正确对齐。)
```python
__m128i a, b; # 输入
__m128i even_mask = _mm_set1_epi32(0x0000FFFF);
__m128i a_even = _mm_and_si128(a, even_mask);
__m128i b_odd = _mm_andnot_si128(even_mask, b);
# 只需对每个乘积掩码一个输入:0 x garbage = 0
__m128i prod_even = _mm_madd_epi16(a_even, b); # { a0*b0, a2*b2, a4*b4, a6*b6 }
__m128i prod_odd = _mm_madd_epi16(a, b_odd); # { a1*b1, a3*b3, a5*b5, a7*b7 }
我们可以将 a 都以两种方式掩码;这两种方式都需要相同数量的 movdqa 指令来复制一个寄存器,至少在可以破坏包含 a 和 b 的输入寄存器的函数中是如此。 它的有趣之处在于,无论哪个输入最后准备好(在乱序执行中),另一个都有时间已经得到掩码并准备执行其中一个乘积。 这在旧 CPU 上更有趣,因为 pmaddwd 的吞吐量仅为每时钟周期 1 次(例如,Skylake 之前的英特尔 CPU;https://uops.info/)。
那个奇/偶乘积可能是您想要的更好的起点,因为它只需要 2 个更多的洗牌来交错 prod_even 和 prod_odd 到 prod_low 和 prod_high。
__m128i prod_low = _mm_unpacklo_epi32(prod_even, prod_odd); # { a0*b0, a1*b1, a2*b2, a3*b3 }
__m128i prod_high = _mm_unpackhi_epi32(prod_even, prod_odd);
这是 2 个按位 AND,2 个 pmaddwd 和 2x punpckdq。
首先解压缩两个输入会耗费 2x pmovzxwd,2x punpckhwd 和 2x pmaddwd,因此所有 4 个非乘法指令都将是洗牌。 如果周围的代码也需要任何洗牌,则在英特尔 CPU 上会有很大的压力,因为其洗牌吞吐量有限(特别是在 Ice Lake 之前)。
使用 PAND 我们需要一个矢量常量,但 4x 洗牌方式只需要一个零寄存器。 我没有查看汇编代码,看看哪种方式需要更多的 movdqa 矢量寄存器复制。 由于 pmovzxwd 是复制和洗牌,所以 4
英文:
@aqrit's answer points out that it's as easy as pmullw + pmulhw, and unpack lo / hi to combine 16-bit halves of each 32-bit product. Accept that one.
Mostly obsolete answer - only useful if odd/even interleaved is useful
(Or perhaps as an alternate building block for an AVX2 version that wants to widen to __m256i results - it's non-obvious how to do that efficiently for __m256i inputs. With __m128i inputs you should just widen each input to 8x 32-bit with vpmovzxwd (_mm256_cvtepu16_epi32) and use _mm256_madd_epi16.)
x86's only widening 16x16 -> 32-bit SIMD multiply is pmaddwd, which horizontally adds pairs of products. So it's not directly usable, but if one of the products produces zero, then you have a 32-bit product of two signed 16-bit integers within that dword element.
You could unpack (with zero-extension) both inputs (4 shuffles) to feed 2 pmaddwd instruction.
Unpacking with zero-extension is cheaper for the high half, since _mm_unpackhi_epi16 against _mm_setzero_si128() is cheaper than manually doing sign extension with an arithmetic right shift, or doing extra work to feed the high half to _mm_cvtepi16_epi32 (pmovsxwd).
You'd only need or want sign extension if you were using pmulld (32x32 -> 32-bit multiply), but you want to avoid that since it's slower on Intel CPUs (2 uops per instruction). https://uops.info/
Multiplying 16-bit signed integers zero-extended to 32-bit with pmaddwd is doing a0*b0 + 0*0 and so on in each element. If you sign-extended the inputs, you could be doing a0*b0 + -1*-1, which is why you need zero-extension if you're going to do this optimization.
If unpacking to odd/even works for you, that's more efficient since 0 x garbage = 0 so we only need 2 instructions (plus some register copies) to prepare 2 pairs of inputs, and none of them are shuffles. (Shifts within 32-bit words like psrld would require shifting both inputs to keep the multiplicands lined up correctly.)
__m128i a, b; // inputs
__m128i even_mask = _mm_set1_epi32(0x0000FFFF);
__m128i a_even = _mm_and_si128(a, even_mask);
__m128i b_odd = _mm_andnot_si128(even_mask, b);
// only need to mask one input to each multiply: 0 x garbage = 0
__m128i prod_even = _mm_madd_epi16(a_even, b); // { a0*b0, a2*b2, a4*b4, a6*b6 }
__m128i prod_odd = _mm_madd_epi16(a, b_odd); // { a1*b1, a3*b3, a5*b5, a7*b7 }
We could just mask a both ways; both ways have end up having the same amount of movdqa instructions to copy a register, at least in a function that can destroy the input registers holding a and b. It does have the fun advantage that whichever input was ready last (in out-of-order exec), the other one had time to already get masked and be ready to exec one of the multiplies. This is more interesting on old CPUs with only 1/clock throughput for pmaddwd (e.g. Intel before Skylake; https://uops.info/).
That odd/even product might be a better starting point for what you want, since it only requires 2 more shuffles to interleave prod_even and prod_odd into prod_low and prod_high
__m128i prod_low = _mm_unpacklo_epi32(prod_even, prod_odd); // { a0*b0, a1*b1, a2*b2, a3*b3 }
__m128i prod_high = _mm_unpackhi_epi32(prod_even, prod_odd);
This is 2 bitwise ANDs, 2 pmaddwd, and 2x punpckdq.
Unpacking both inputs first would have cost 2x pmovzxwd, 2x punpckhwd, and 2x pmaddwd, so all 4 of the non-multiply instructions would be shuffles. If the surrounding code also needs any shuffles, less pressure on port 5 is good on Intel CPUs with their limited shuffle throughput (especially before Ice Lake).
Using PAND we need a vector constant, but the 4x shuffle way only needs a zeroed register. I didn't look at the asm to see if one way needs more movdqa vector register copies. The 4x shuffle way might have the advantage there thanks to pmovzxwd being a copy-and-shuffle.
The 4x shuffle way looks like this, and is probably more efficient unless surrounding code also uses a lot of shuffles that would create a bottleneck on shuffle execution unit throughput.
__m128i a, b; // inputs
__m128i a_lo = _mm_cvtepu16_epi32(a);
__m128i b_lo = _mm_cvtepu16_epi32(b);
__m128i a_hi = _mm_unpackhi_epi16(a, _mm_setzero_si128()); // unpack(zero, a) would work equally well but compile less efficiently
__m128i b_hi = _mm_unpackhi_epi16(b, _mm_setzero_si128());
__m128i prod_lo = _mm_madd_epi16(a_lo, b_lo); // { a0*b0, a1*b1, a2*b2, a3*b3 }
__m128i prod_hi = _mm_madd_epi16(a_hi, b_hi);
I put them on Godbolt inside functions. That doesn't test how they'd inline into loops, e.g. the compilers can overwrite the constant they load. But even with that, the 4x shuffle way is fewer instructions. Clang "optimizes" the and/andnot to 2x pblendw, which is 1 uop for port 5 only on Intel CPUs before Ice Lake. That might still be better than loading a constant for one call, unlike in a loop.
But anyway, the 4x shuffle way has no extra movdqa register-copy instructions, so it costs less front-end bandwidth than the other way that distributes the real work across different ports.
@chtz suggested prod_odd = _mm_sub_epi32(_mm_madd_epi16(a, b), prod_even); instead of pandn before the multiplies. That has longer critical-path latency, but does save 2 movdqa register-copy instructions when we have to compile with AVX. https://godbolt.org/z/7Gxn67d8h
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。


评论