Intel 存储指令故意重叠内存区域。

huangapple go评论71阅读模式
英文:

Intel store instructions on delibrately overlapping memory regions

问题

I have to store the lower 3 doubles in YMM register into an unaligned double array of size 3 (that is, cannot write the 4th element). But being a bit naughty, I'm wondering if the AVX intrinsic _mm256_storeu2_m128d can do the trick. I had

reg = _mm256_permute4x64_pd(reg, 0b10010100); // [0 1 1 2]
_mm256_storeu2_m128d(vec, vec + 1, reg);

and compiling by clang gives

vmovupd xmmword ptr [rsi + 8], xmm1 # reg in ymm1 after perm
vextractf128    xmmword ptr [rsi], ymm0, 1

If storeu2 had semantics like memcpy then it most definitely triggers undefined behavior. But with the generated instructions, would this be free of race conditions (or other potential problems)?

英文:

I have to store the lower 3 doubles in YMM register into an unaligned double array of size 3 (that is, cannot write the 4th element). But being a bit naughty, I'm wondering if the AVX intrinsic _mm256_storeu2_m128d can do the trick. I had

reg = _mm256_permute4x64_pd(reg, 0b10010100); // [0 1 1 2]
_mm256_storeu2_m128d(vec, vec + 1, reg);

and compiling by clang gives

vmovupd xmmword ptr [rsi + 8], xmm1 # reg in ymm1 after perm
vextractf128    xmmword ptr [rsi], ymm0, 1

If storeu2 had semantics like memcpy then it most definitely triggers undefined behavior. But with the generated instructions, would this be free of race conditions (or other potential problems)?

Other ways to store YMM into size 3 arrays are welcomed as well.

答案1

得分: 2

以下是您要翻译的内容:

"Intel的指令集没有真正的正式规范,据我所知,除了Intel发布的文档之外,没有其他规范。例如,他们的指令集指南以及来自他们白皮书的示例等。需要工作的示例是GCC/clang知道它们必须使用__attribute__((may_alias))来定义__m128

这都是在一个线程中完成的,完全同步,因此绝对没有“竞态条件”。在您的情况下,甚至不需要关心存储发生的顺序(假设它们不与__m256d reg对象本身重叠!那将相当于重叠的memcpy问题)。您正在做的可能类似于两个不确定顺序的memcpy到重叠的目标:它们绝对以其中一种顺序发生,而编译器可以选择任何一种。

存储顺序的可观察差异在于性能:如果您想在之后很快重新加载SIMD,那么如果16字节重新加载从一个16字节存储中获取数据,存储转发将效果更好,而不是从两个存储的重叠中获取数据。

但总体来说,重叠存储对性能来说是可以接受的;存储缓冲区会吸收它们。但这意味着它们中的一个不对齐,并且跨越缓存行边界会更昂贵。


然而,这一切都是无关紧要的:Intel的指令集指南中确实列出了一个“操作”部分用于该复合指令

> 操作
>
> MEM[loaddr+127:loaddr] := a[127:0]
> MEM[hiaddr+127:hiaddr] := a[255:128]

因此,严格定义为先低地址存储(第二个参数;我认为您弄反了)。


还有更高效的方法

您的方法需要进行1次跨通道的洗牌+vmovups+vextractf128 [mem], ymm, 1。根据它的编译方式,洗牌之后任何存储都不能开始(尽管看起来clang可能已经避免了这个问题)。

在Intel CPU上,vextractf128 [mem], ymm, imm需要前端2个微操作,不能融合成一个。 (Zen也是如此,因为某种原因Zen CPU上也需要2个微操作。)

在Zen 2之前的AMD CPU上,跨通道洗牌需要不止1个微操作,因此_mm256_permute4x64_pd比必要的更昂贵。

您只需要存储输入向量的低通道和高通道的低元素。最便宜的洗牌是vextractf128 xmm, ymm, 1 - 在Zen上的1个微操作/1个时钟周期(它将YMM向量拆分成两个128位的半部分)。在Intel上,与其他任何跨通道的洗牌一样便宜。

您希望编译器生成的汇编代码可能是这样的,它只需要AVX1。AVX2没有任何对此有用的指令。

    vextractf128  xmm1, ymm0, 1            ; 在任何地方都是单一的微操作
    vmovupd       [rdi], xmm0              ; 在任何地方都是单一的微操作
    vmovsd        [rdi+2*8], xmm1          ; 在任何地方都是单一的微操作

因此,您希望类似于这样的东西,应该能够高效编译。

    _mm_store_pd(vec, _mm256_castpd256_pd128(reg));  // 低半部分
    __m128d hi = _mm256_extractf128_pd(reg, 1);
    _mm_store_sd(vec+2, hi);
    // 或者 vec[2] = _mm_cvtsd_f64(hi);

vmovlps_mm_storel_pi)也可以工作,但使用AVX VEX编码时,它不会节省任何代码大小,并且需要更多的转换以使编译器满意。

不幸的是,没有vpextractq [mem], ymm,只有一个XMM源,因此没有帮助。


带掩码的存储:

如评论中所讨论的,是的,您可以使用vmaskmovps,但不幸的是在所有CPU上它都不像我们希望的那么高效。在AVX512将带掩码的加载/存储变成一流公民之前,最好的方法可能是进行洗牌并进行2次存储。或者将数组/结构填充,以便您至少可以在稍后处理较晚的内容。

Zen具有2微操作的vmaskmovpd ymm加载,但非常昂贵的vmaskmovpd存储(42个微操作,每个11个周期为YMM)。或者Zen+和Zen2为18或19个微操作,6个周期吞吐量。如果您对Zen有任何关心,请避免vmaskmov

根据Agner Fog的测试,在Intel Broadwell及更早版本中,vmaskmov存储为4个微操作,因此比从洗牌+movups+movsd获得的2个融合域微操作多一个。但是,Haswell及更高版本的吞吐量达到每个时钟周期1次,因此如果这是

英文:

There isn't really a formal spec for Intel's intrinsics, AFAIK, other than what Intel has published as documentation. e.g. their intrinsics guide. Also examples from their whitepapers and so on; e.g. examples that need to work are one way GCC/clang know they have to define __m128 with __attribute__((may_alias)).

It's all within one thread, fully synchronous, so definitely no "race condition". In your case it doesn't even matter which order the stores happen in (assuming they don't overlap with the __m256d reg object itself! That would be the equivalent of an overlapping memcpy problem.) What you're doing might be like two indeterminately sequenced memcpy to overlapping destinations: they definitely happen in one order or the other, and the compiler could pick either.

The observable difference for order of stores is performance: if you want to do a SIMD reload very soon after, then store forwarding will work better if the 16-byte reload takes its data from one 16-byte store, not the overlap of two stores.

In general overlapping stores are fine for performance, though; the store buffer will absorb them. It means one of them is unaligned, though, and crossing a cache-line boundary would be more expensive.


However, that's all moot: Intel's intrinsics guide does list an "operation" section for that compound intrinsic:

> Operation
>
> MEM[loaddr+127:loaddr] := a[127:0]
> MEM[hiaddr+127:hiaddr] := a[255:128]

So it's strictly defined as low address store first (the second arg; I think you got this backwards).


And all of that is also moot because there's a more efficient way

Your way costs 1 lane-crossing shuffle + vmovups + vextractf128 [mem], ymm, 1. Depending on how it compiles, neither store can start until after the shuffle. (Although it looks like clang might have avoided that problem).

On Intel CPUs, vextractf128 [mem], ymm, imm costs 2 uops for the front-end, not micro-fused into one. (Also 2 uops on Zen for some reason.)

On AMD CPUs before Zen 2, lane-crossing shuffles are more than 1 uop, so _mm256_permute4x64_pd is more expensive than necessary.

You just want to store the low lane of the input vector, and the low element of the high lane. The cheapest shuffle is vextractf128 xmm, ymm, 1 - 1 uop / 1c latency on Zen (which splits YMM vectors into two 128-bit halves anyway). It's as cheap as any other lane-crossing shuffle on Intel.

The asm you want the compiler to make is probably this, which only requires AVX1. AVX2 doesn't have any useful instructions for this.

    vextractf128  xmm1, ymm0, 1            ; single uop everywhere
    vmovupd       [rdi], xmm0              ; single uop everywhere
    vmovsd        [rdi+2*8], xmm1          ; single uop everywhere

So you want something like this, which should compile efficiently.

    _mm_store_pd(vec, _mm256_castpd256_pd128(reg));  // low half
    __m128d hi = _mm256_extractf128_pd(reg, 1);
    _mm_store_sd(vec+2, hi);
    // or    vec[2] = _mm_cvtsd_f64(hi);

vmovlps (_mm_storel_pi) would also work, but with AVX VEX encoding it doesn't save any code size, and would require even more casting to keep compilers happy.

There's unfortunately no vpextractq [mem], ymm, only with an XMM source so that doesn't help.


Masked store:

As discussed in comments, yes you could do vmaskmovps but it's unfortunately not as efficient as we might like on all CPUs. Until AVX512 makes masked loads/stores first-class citizens, it may be best to shuffle and do 2 stores. Or pad your array / struct so you can at least temporarily step on later stuff.

Zen has 2-uop vmaskmovpd ymm loads, but very expensive vmaskmovpd stores (42 uops, 1 per 11 cycles for YMM). Or Zen+ and Zen2 are 18 or 19 uops, 6 cycle throughput. If you care at all about Zen, avoid vmaskmov.

On Intel Broadwell and earlier, vmaskmov stores are 4 uops according to Agner's Fog's testing, so that's 1 more fused-domain uop than we get from shuffle + movups + movsd. But still, Haswell and later do manage 1/clock throughput so if that's a bottleneck then it beats the 2-cycle throughput of 2 stores. SnB/IvB of course take 2 cycles for a 256-bit store, even without masking.

On Skylake, vmaskmov mem, ymm, ymm is only 3 uops (Agner Fog lists 4, but his spreadsheets are hand-edited and have been wrong before. I think it's safe to assume uops.info's automated testing is right. And that makes sense; Skylake-client is basically the same core as Skylake-AVX512, just without actually enabling AVX512. So they could implement vmaskmovpd by decoding it into test into a mask register (1 uop) + masked store (2 more uops without micro-fusion).

So if you only care about Skylake and later, and can amortize the cost of loading a mask into a vector register (reusable for loads and stores), vmaskmovpd is actually pretty good. Same front-end cost but cheaper in the back-end: only 1 each store-address and store-data uops, instead of 2 separate stores. Note the 1/clock throughput on Haswell and later vs. the 2-cycle throughput for doing 2 separate stores.

vmaskmovpd might even store-forward efficiently to a masked reload; I think Intel mentioned something about this in their optimization manual.

huangapple
  • 本文由 发表于 2020年1月3日 23:25:35
  • 转载请务必保留本文链接:https://go.coder-hub.com/59581140.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定