英文:
AV512: Best way to combine horizontal sum and broadcast
问题
以下是您提供的代码的中文翻译:
已经有一个关于使用AVX512进行水平求和的问题。我尝试做类似的事情,但在求和后,我想将结果广播到__m512d
变量中的所有8个元素。到目前为止,我尝试过以下方法:
- 使用Intel提供的宏:
double sum = _mm512_reduce_add_pd( mvx );
sumx = _mm512_set1_pd( sum );
- 使用洗牌/排列,尽量避免通道交叉:
sumx = mvx;
mvx = _mm512_shuffle_pd(mvx, mvx, 0b01010101);
sumx = _mm512_add_pd(mvx, sumx);
mvx = _mm512_permutex_pd(mvx, _MM_PERM_ABCD);
sumx = _mm512_add_pd(mvx, sumx);
mvx = _mm512_shuffle_pd(mvx, mvx, 0b01010101);
sumx = _mm512_add_pd(mvx, sumx);
mvx = _mm512_shuffle_f64x2(mvx,mvx, _MM_SHUFFLE(1,0,3,2));
sumx = _mm512_add_pd(mvx, sumx);
mvx = _mm512_shuffle_pd(mvx, mvx, 0b01010101);
sumx = _mm512_add_pd(mvx, sumx);
mvx = _mm512_permutex_pd(mvx, _MM_PERM_ABCD);
sumx = _mm512_add_pd(mvx, sumx);
mvx = _mm512_shuffle_pd(mvx, mvx, 0b01010101);
sumx = _mm512_add_pd(mvx, sumx);
- 使用@PeterCordes的提示,将加法/洗牌减少到3次:
sumx = mvx;
mvx = _mm512_shuffle_pd(mvx, mvx, 0b01010101);
sumx = _mm512_add_pd(mvx, sumx);
mvx = _mm512_permutex_pd(sumx, _MM_PERM_ABCD);
sumx = _mm512_add_pd(mvx, sumx);
mvx = _mm512_shuffle_f64x2(sumx,sumx, _MM_SHUFFLE(1,0,3,2));
sumx = _mm512_add_pd(mvx, sumx);
在每种情况下,mvx
是__m512d
输入,sumx
是__m512d
输出。
我在Intel Skylake CPU上使用Intel编译器进行基准测试:
- 版本1:2.17秒
- 版本2:2.31秒
- 版本3:1.96秒
这是我能做到的最好的方法吗?还是您看到了另一种优化这个操作的方式?
英文:
There is already a question about horizontal sums using AVX512. I'm trying to do something similar, but after the sum, I would like to broadcast the result to all 8 elements in a __m512d
variable. So far, I have tried:
- Using the intel provided macros:
double sum = _mm512_reduce_add_pd( mvx );
sumx = _mm512_set1_pd( sum );
- Using shuffle/permute, trying to avoid lane crossings as much as possible:
sumx = mvx;
mvx = _mm512_shuffle_pd(mvx, mvx, 0b01010101);
sumx = _mm512_add_pd(mvx, sumx);
mvx = _mm512_permutex_pd(mvx, _MM_PERM_ABCD);
sumx = _mm512_add_pd(mvx, sumx);
mvx = _mm512_shuffle_pd(mvx, mvx, 0b01010101);
sumx = _mm512_add_pd(mvx, sumx);
mvx = _mm512_shuffle_f64x2(mvx,mvx, _MM_SHUFFLE(1,0,3,2));
sumx = _mm512_add_pd(mvx, sumx);
mvx = _mm512_shuffle_pd(mvx, mvx, 0b01010101);
sumx = _mm512_add_pd(mvx, sumx);
mvx = _mm512_permutex_pd(mvx, _MM_PERM_ABCD);
sumx = _mm512_add_pd(mvx, sumx);
mvx = _mm512_shuffle_pd(mvx, mvx, 0b01010101);
sumx = _mm512_add_pd(mvx, sumx);
- Using the hint by @PeterCordes, reducing the add/shuffles to 3:
sumx = mvx;
mvx = _mm512_shuffle_pd(mvx, mvx, 0b01010101);
sumx = _mm512_add_pd(mvx, sumx);
mvx = _mm512_permutex_pd(sumx, _MM_PERM_ABCD);
sumx = _mm512_add_pd(mvx, sumx);
mvx = _mm512_shuffle_f64x2(sumx,sumx, _MM_SHUFFLE(1,0,3,2));
sumx = _mm512_add_pd(mvx, sumx);
In each case mvx
is the __m512d
input and sumx
is the __m512d
output.
I'm benchmarking it on an Intel Skylake CPU using the intel compiler:
- Version 1: 2.17s
- Version 2: 2.31s
- Version 3: 1.96s
Is this the best I can do or do you see another way to optimize this operation?
答案1
得分: 2
一般来说,最好的方法是交换半部分而不是缩小,这样同样的总和可以在两半部分都计算出来。(尤其是如果你不关心Zen 4或假设未来的CPU,在缩小到256位时有吞吐量优势的情况下。)处理__m512d
中的8个双精度数只需要3次洗牌/加法步骤,其中一次是在lane内执行的。
你的第二个版本正确地做到了这一点,在当前的CPU(Intel和Zen 4)上看起来是最优化的。
首先执行低延迟的in-lane shuffle,像你正在做的那样,有利于乱序执行,允许更多的微操作在几个周期内执行并提前几个周期退休,以为新的可能独立的工作腾出调度器和ROB的空间。
在当前的Intel CPU上,所有32位或更宽的512位lane交叉洗牌的性能都相同:1个端口5的微操作,延迟为3个时钟周期。而在lane内的512位洗牌是1个端口5的微操作,延迟为1个时钟周期。
在Zen 4上,vshufpd
和vpermilpd
的定时都相同,vpermpd
/ vshuff64x2
/ vshuff32x4
/ valignq
也是如此。所有这些洗牌都具有立即操作数的控制,因此编译器不需要加载矢量常数。
**任何调整只是基于对可能未来CPU更快的猜测,比如具有AVX-512支持的未来Intel E核心,或者将其用作未来CPU中的E核心的削减版AMD Zen 4,如果它们改变了执行单元,而不仅仅是缓存。这段代码在所有6个操作中都存在序列依赖性,但能够在更多端口上运行可能会让乱序执行更好地同时运行这段代码和一些独立的周围代码,或者另一个逻辑核心。
历史上,使用最宽的粒度洗牌通常是最好的。例如,对于Zen 4来说,vextractf64x4 ymm, zmm, imm
比vextractf64x2 xmm, zmm, imm
更快,因此即使你不介意带有高垃圾数据,也应该优先考虑前者。较少的较大块意味着较少的可能排列,较短的多路复用链,因此可能具有较低的延迟或在更多执行单元上运行。但是,没有vshuff64x4
,只有vshuff64x2
128位块,所以这是我们交换256位半部分的唯一好选择。
如果专门为Zen 4进行调整而不关心Intel,vextractf64x4
+ vinsertf64x4
的总延迟低于vshuff64x2 zmm
,尽管它的前端成本为2个微操作,而不是1个。除了提取/插入256位半部分之外,Zen 4上的512位洗牌在其执行单元中占用2个周期(实际吞吐量为每个时钟周期1个,与Zen 4处理不需要在半部分之间移动数据的其他512位微操作相同)。
对于中间的洗牌,交换128位半部分的选择在vshuff64x2 z,z,z,imm8
和vpermpd z,z,imm8
之间。在当前的CPU上,包括Zen 4,两者都以相同的方式运行。我们可以选择vshuff64x2
基于更宽的粒度(以128位块的方式移动数据,而不是64位块),但还有另一个因素需要考虑:vpermpd z,z,imm8
在每半部分执行独立的256位洗牌,因此对于256位执行单元来说,可以轻松分解(与具有跨整个矢量的8个3位索引可供选择的矢量控制版本不同)。
Zen 4的洗牌执行单元实际上是512位宽的,所以它们只会导致一些吞吐量和更高的延迟。但是未来的Intel E-cores可能不会这样做,并且可能像Zen 1一样慢速运行vshuff64x2 z,z,z,imm8
(尽管这似乎过于夸张)或vpermps y,y,y
(3个微操作)。Alder Lake中的Intel E-cores设法将每个2个微操作处理为2个微操作,因此支持AVX-512并具有256位执行单元的E-core也可能能够处理vshuff64x2 z,z,z,imm8
作为2个微操作。
vshuff64x2 z,z,z,imm8
需要1024位的输入(因为它可以接受两个不同的输入矢量),但前两个输出lane来自第一个输入(因此只有4个可能的输入位必须通过复用器路由到每个输出位),同样也适用于来自第二个源的后两个输出lane。因此,它可以分解为两个独立的512位输入/256位输出的洗牌,例如256位valignq ymm,ymm,ymm, imm
或vperm2f128 ymm, ymm, ymm, imm
,
英文:
Generally the best way is to swap halves instead of narrowing, so the same sum gets computed in both halves. (Especially if you don't care about Zen 4 or hypothetical future CPUs where there's a throughput advantage to narrowing to 256-bit.) It should only take 3 shuffle/add steps to handle 2^3 = 8 doubles in a __m512d
, one of them being in-lane adding pairs.
Your second version is correctly doing that, and looks optimal on current CPUs (Intel and Zen 4.)
Doing the lower-latency in-lane shuffle first like you're doing is good for out-of-order exec, letting more uops execute and retire a couple cycles sooner to make room in the scheduler and ROB for new maybe-independent work sooner.
On current Intel CPUs, all 512-bit lane-crossing shuffles of 32-bit granularity or wider have the same performance: 1 uop for port 5 with 3c latency. And in-lane 512-bit shuffles are 1 uop for port 5 with 1c latency.
On Zen 4, vshufpd
and vpermilpd
both have the same timings as each other, and so do vpermpd
/ vshuff64x2
/ vshuff32x4
/ valignq
. (https://uops.info/) All these shuffles have immediate operands for their controls, so the compiler doesn't have to load a vector constant.
Any tweaks would just be based on guess-work about what might be faster on possible future CPUs, like a future Intel E-core with AVX-512 support, or a stripped-down AMD Zen 4 that they use as an E-core in their future CPUs, if they change the execution units at all instead of just cache. Or future big cores which might have room for multiple 512-bit shuffle units. This code has a serial dependency through all 6 operations, but being able to run on more ports might let out-of-order execution do a better job of running this and some independent surrounding code at the same time, or another logical core.
Using the widest granularity shuffle available has historically been best. e.g. vextractf64x4 ymm, zmm, imm
is faster on Zen 4 than vextractf64x2 xmm, zmm, imm
, so prefer the former for extracting the third 128-bit chunk even if you don't mind bringing high garbage with it. Fewer larger chunks means fewer possible arrangements, shorter chains of multiplexing and thus might have lower latency or run on more execution units. But there is no vshuff64x4
, only vshuff64x2
128-bit chunks, so that's our only good option for swapping 256-bit halves.
If tuning specifically for Zen 4 without caring about Intel, vextractf64x4
+ vinsertf64x4
is lower total latency than vshuff64x2 zmm
, although it costs 2 uops instead of 1 for the front-end. Other than insert/extract of 256-bit halves, 512-bit shuffles on Zen 4 occupy their execution unit for 2 cycles (actual throughput of 1/clock is half what you'd expect from being one uop for either of two ports, like how Zen 4 handles other 512-bit uops that don't need to move data between halves).
For the middle shuffle, swapping 128-bit halves, the choice is between vshuff64x2 z,z,z,imm8
and vpermpd z,z,imm8
. Both run identically on current CPUs including Zen 4. We might choose vshuff64x2
based on the wider granularity (moving data around in 128-bit chunks instead of 64-bit), but there's another factor to consider: vpermpd z,z,imm8
does independent 256-bit shuffles in each half, so decomposes trivially for 256-bit execution units. (Unlike with the vector control version that has eight 3-bit indices to pick from across the whole vector.)
Zen 4 has shuffle execution units that are essentially 512-bit wide so they just cost some throughput and higher latency for 512-bit ops. But possible future Intel E-cores with AVX-512 might not do that, and might run vshuff64x2 z,z,z,imm8
as slowly as Zen 1 ran vperm2f128 y,y,y,imm8
(8 uops although that seems excessive) or vpermps y,y,y
(3 uops). Intel E-cores in Alder Lake manage to handle those as 2 uops each, so presumably an E-core supporting AVX-512 with 256-bit execution units could also handle vshuff64x2 z,z,z,imm8
as 2 uops.
vshuff64x2 z,z,z,imm8
takes 1024 bits of input (since it can take two different input vectors), but the first 2 output lanes are selected from the first input (so only 4 possible input bits have to route through muxes to each output bit), and same for the second two output lanes coming from the second source. So it could be decomposed to two separate 512-bit-input / 256-bit-output shuffles, like 256-bit valignq ymm,ymm,ymm, imm
or like vperm2f128 ymm, ymm, ymm, imm
but with each output lane being able to select any of the four. (valignq zmm
is actually another possibility for the final shuffle, but less likely to be cheap.)
So vshuff64x2 zmm
is actually designed in a way that probably makes it cheaper to implement with narrower execution units than you might think, much easier than valignq
or vpermt2ps
or other 2-input shuffles where each output can pick from anywhere in both 512-bit inputs.
One might guess that a one-input shuffle _mm512_permute_pd(mvx, 0b01'01'01'01);
(aka vpermilpd z,z, imm
) might be more efficient on some future CPU than your vshufpd z,z,z, imm
with the same input twice. That's actually true on Knight's Landing (Xeon Phi), but I assume you don't care about that since it's been discontinued for a few years, and I didn't look at timings of vpermpd
vs. vshuff64x2
on it.
But on Ice Lake, the more common vshufpd y,y,y,i
has 2/clock throughput vs. 1/clock vpermilpd y,y,i
<sup>1</sup>. So who can guess which shuffles will be faster on future E-cores with AVX-512 or future big cores where there might be room for multiple 512-bit shuffle units.
Summary:
-
vshufpd
is fine for the first shuffle. Even if the vector started in memory, you wouldn't want a memory-sourcevpermilpd
since you need another copy of the vector as an input forvaddpd
. Could go either way on future E-cores handling one or the other more cheaply. It's an in-lane shuffle so it decompose to multiple narrower shuffles trivially for E-cores. -
vpermpd
-immediate is a good choice for the middle shuffle (swapping 128-bit pairs); it's likely that future E-cores can handle it efficiently (as two independent 256-bit halves).vshuff64x2
can decompose into two separate 512-bit input / 256-bit output shuffles, though, so it's not bad either.vpermpd
with a vector control operand doesn't decompose as easily, but it's a different opcode so hopefully the immediate control version would still be cheap even if the vector control version is slower. And somehow Alder Lake E-cores do manage to runvpermps ymm
as 2 uops. -
vshuff64x2
orvalignq
are equally good for swapping 256-bit halves on Intel CPUs, and equal to each other on Zen 4.vshuff64x2
is clearly easier for E-cores to implement efficiently: both have the same amount of input (1024 bits), butvshuff64x2
has significantly fewer possible sources for any given bit of output (4 vs. 16, and with more restrictions on which source feeds which output if the two sources aren't the same register). Also, it's probably a more commonly-used shuffle so architects are more likely to spend transistors to make it not too slow.vextractf64x4
+vinsertf64x4
would be lower latency on Zen 4, which might or might not matter depending on surrounding code. Butvshuff64x2 zmm
is still single-uop on Zen 4 with only 4-cycle latency, like other 512-bit lane-crossing shuffles. Hypothetical smaller cores with AVX-512 might run it as 2 or more.
Footnote 1: IDK why Ice Lake / Alder Lake can't just decode vpermilpd
with a register source and immediate control into a vshufpd
uop that reads the same input twice, since the same immediate bits will produce the same shuffle in that case. Seems like a missed optimization, although maybe it would have a cost somewhere in the decoders for producing a uop with 1 input for the memory source version vs. 2 inputs for a register source version. So instead, change the shuffle execution unit to replicate one input in that case, as a way to have port 1 handle vpermilpd
uops, making it not special to handle memory sources this way. At a cost of having to handle more different control inputs on the port 1 input of the shuffle unit?
On Ice Lake / Alder Lake, the port 1 execution unit that can handle some but not all 128-bit and 256-bit shuffles when there are no 512-bit uops in flight is probably just half of the 512-bit shuffle execution unit that's normally accessible from port 5. (Same way they handle 256-bit FP math instructions on port 0 or 1, but have it work as a single 512-bit FMA unit when port 1 is shut down.) So the lanes of the shuffle unit can handle vpermilpd
when it's the upper half of a vpermilpd zmm, zmm, imm8
on port 5, so it seems like it would require minimal extra logic to be able to do the same when accessed via port 1. (vpermilpd zmm
and vshufpd zmm
use the upper 4 bits of their immediates the same way as each other, and the same as the low 4 bits works for the low half. Each 128-bit lane has 2 bits of control input.)
I wonder if it's intentional to make sure vpermilpd/ps
can't steal cycles from FP math ops (port 0 and 1 for 256-bit). That could make sense, and is maybe even useful for people tuning a loop that bottlenecks on p01 throughput vs. shuffle throughput: they can use vshufpd y, same,same, i
to let it run on port 1 or 5, or just for smaller machine-code size (2-byte VEX). Or vpermilpd y, ymm/mem, i
to restrict it to port 5, at the cost of an extra byte of machine-code size if vshufpd
didn't already need a 3-byte VEX. (Or a whole separate instruction if it was shuffling a memory source. But like many instructions with an immediate operand, Intel CPUs can't micro-fuse the load+ALU uop, so the cost in issue bandwidth is the same.)
That seems unlikely. Maybe they just analyzed existing code and found shufpd
/ vshufpd
was more common and thus important; unsurprising since shufpd
is SSE2 but vpermilpd
didn't exist until AVX1. So that factor may be what affected this design which is relevant for choosing YMM shuffles, even though both vshufpd ymm
and vpermilpd
were new with AVX1.
But guessing about the future, Intel gracemont E-cores in Alder Lake have identical performance for vpermilpd ymm, ymm, i8
vs. vshufpd ymm, ymm, ymm, i8
.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论