2020年1月3日 23:25:35go评论207阅读模式

英文:

Intel store instructions on delibrately overlapping memory regions

问题

I have to store the lower 3 doubles in YMM register into an unaligned double array of size 3 (that is, cannot write the 4th element). But being a bit naughty, I'm wondering if the AVX intrinsic _mm256_storeu2_m128d can do the trick. I had

reg = _mm256_permute4x64_pd(reg, 0b10010100); // [0 1 1 2]
_mm256_storeu2_m128d(vec, vec + 1, reg);

and compiling by clang gives

vmovupd xmmword ptr [rsi + 8], xmm1 # reg in ymm1 after perm
vextractf128    xmmword ptr [rsi], ymm0, 1

If storeu2 had semantics like memcpy then it most definitely triggers undefined behavior. But with the generated instructions, would this be free of race conditions (or other potential problems)?

英文:

reg = _mm256_permute4x64_pd(reg, 0b10010100); // [0 1 1 2]
_mm256_storeu2_m128d(vec, vec + 1, reg);

and compiling by clang gives

vmovupd xmmword ptr [rsi + 8], xmm1 # reg in ymm1 after perm
vextractf128    xmmword ptr [rsi], ymm0, 1

If storeu2 had semantics like memcpy then it most definitely triggers undefined behavior. But with the generated instructions, would this be free of race conditions (or other potential problems)?

Other ways to store YMM into size 3 arrays are welcomed as well.

答案1

得分: 2

以下是您要翻译的内容：

"Intel的指令集没有真正的正式规范，据我所知，除了Intel发布的文档之外，没有其他规范。例如，他们的指令集指南以及来自他们白皮书的示例等。需要工作的示例是GCC/clang知道它们必须使用__attribute__((may_alias))来定义__m128。

这都是在一个线程中完成的，完全同步，因此绝对没有“竞态条件”。在您的情况下，甚至不需要关心存储发生的顺序（假设它们不与__m256d reg对象本身重叠！那将相当于重叠的memcpy问题）。您正在做的可能类似于两个不确定顺序的memcpy到重叠的目标：它们绝对以其中一种顺序发生，而编译器可以选择任何一种。

存储顺序的可观察差异在于性能：如果您想在之后很快重新加载SIMD，那么如果16字节重新加载从一个16字节存储中获取数据，存储转发将效果更好，而不是从两个存储的重叠中获取数据。

但总体来说，重叠存储对性能来说是可以接受的；存储缓冲区会吸收它们。但这意味着它们中的一个不对齐，并且跨越缓存行边界会更昂贵。

然而，这一切都是无关紧要的：Intel的指令集指南中确实列出了一个“操作”部分用于该复合指令：

> 操作
>
> MEM[loaddr+127:loaddr] := a[127:0]
> MEM[hiaddr+127:hiaddr] := a[255:128]

因此，严格定义为先低地址存储（第二个参数；我认为您弄反了）。

还有更高效的方法

您的方法需要进行1次跨通道的洗牌+vmovups+vextractf128 [mem], ymm, 1。根据它的编译方式，洗牌之后任何存储都不能开始（尽管看起来clang可能已经避免了这个问题）。

在Intel CPU上，vextractf128 [mem], ymm, imm需要前端2个微操作，不能融合成一个。（Zen也是如此，因为某种原因Zen CPU上也需要2个微操作。）

在Zen 2之前的AMD CPU上，跨通道洗牌需要不止1个微操作，因此_mm256_permute4x64_pd比必要的更昂贵。

您只需要存储输入向量的低通道和高通道的低元素。最便宜的洗牌是vextractf128 xmm, ymm, 1 - 在Zen上的1个微操作/1个时钟周期（它将YMM向量拆分成两个128位的半部分）。在Intel上，与其他任何跨通道的洗牌一样便宜。

您希望编译器生成的汇编代码可能是这样的，它只需要AVX1。AVX2没有任何对此有用的指令。

    vextractf128  xmm1, ymm0, 1            ; 在任何地方都是单一的微操作
    vmovupd       [rdi], xmm0              ; 在任何地方都是单一的微操作
    vmovsd        [rdi+2*8], xmm1          ; 在任何地方都是单一的微操作

因此，您希望类似于这样的东西，应该能够高效编译。

    _mm_store_pd(vec, _mm256_castpd256_pd128(reg));  // 低半部分
    __m128d hi = _mm256_extractf128_pd(reg, 1);
    _mm_store_sd(vec+2, hi);
    // 或者 vec[2] = _mm_cvtsd_f64(hi);

vmovlps（_mm_storel_pi）也可以工作，但使用AVX VEX编码时，它不会节省任何代码大小，并且需要更多的转换以使编译器满意。

不幸的是，没有vpextractq [mem], ymm，只有一个XMM源，因此没有帮助。

带掩码的存储：

如评论中所讨论的，是的，您可以使用vmaskmovps，但不幸的是在所有CPU上它都不像我们希望的那么高效。在AVX512将带掩码的加载/存储变成一流公民之前，最好的方法可能是进行洗牌并进行2次存储。或者将数组/结构填充，以便您至少可以在稍后处理较晚的内容。

Zen具有2微操作的vmaskmovpd ymm加载，但非常昂贵的vmaskmovpd存储（42个微操作，每个11个周期为YMM）。或者Zen+和Zen2为18或19个微操作，6个周期吞吐量。如果您对Zen有任何关心，请避免vmaskmov。

根据Agner Fog的测试，在Intel Broadwell及更早版本中，vmaskmov存储为4个微操作，因此比从洗牌+movups+movsd获得的2个融合域微操作多一个。但是，Haswell及更高版本的吞吐量达到每个时钟周期1次，因此如果这是

英文:

There isn't really a formal spec for Intel's intrinsics, AFAIK, other than what Intel has published as documentation. e.g. their intrinsics guide. Also examples from their whitepapers and so on; e.g. examples that need to work are one way GCC/clang know they have to define __m128 with __attribute__((may_alias)).

It's all within one thread, fully synchronous, so definitely no "race condition". In your case it doesn't even matter which order the stores happen in (assuming they don't overlap with the __m256d reg object itself! That would be the equivalent of an overlapping memcpy problem.) What you're doing might be like two indeterminately sequenced memcpy to overlapping destinations: they definitely happen in one order or the other, and the compiler could pick either.

The observable difference for order of stores is performance: if you want to do a SIMD reload very soon after, then store forwarding will work better if the 16-byte reload takes its data from one 16-byte store, not the overlap of two stores.

In general overlapping stores are fine for performance, though; the store buffer will absorb them. It means one of them is unaligned, though, and crossing a cache-line boundary would be more expensive.

However, that's all moot: Intel's intrinsics guide does list an "operation" section for that compound intrinsic:

> Operation
>
> MEM[loaddr+127:loaddr] := a[127:0]
> MEM[hiaddr+127:hiaddr] := a[255:128]

So it's strictly defined as low address store first (the second arg; I think you got this backwards).

And all of that is also moot because there's a more efficient way

Your way costs 1 lane-crossing shuffle + vmovups + vextractf128 [mem], ymm, 1. Depending on how it compiles, neither store can start until after the shuffle. (Although it looks like clang might have avoided that problem).

On Intel CPUs, vextractf128 [mem], ymm, imm costs 2 uops for the front-end, not micro-fused into one. (Also 2 uops on Zen for some reason.)

On AMD CPUs before Zen 2, lane-crossing shuffles are more than 1 uop, so _mm256_permute4x64_pd is more expensive than necessary.

You just want to store the low lane of the input vector, and the low element of the high lane. The cheapest shuffle is vextractf128 xmm, ymm, 1 - 1 uop / 1c latency on Zen (which splits YMM vectors into two 128-bit halves anyway). It's as cheap as any other lane-crossing shuffle on Intel.

The asm you want the compiler to make is probably this, which only requires AVX1. AVX2 doesn't have any useful instructions for this.

    vextractf128  xmm1, ymm0, 1            ; single uop everywhere
    vmovupd       [rdi], xmm0              ; single uop everywhere
    vmovsd        [rdi+2*8], xmm1          ; single uop everywhere

So you want something like this, which should compile efficiently.

    _mm_store_pd(vec, _mm256_castpd256_pd128(reg));  // low half
    __m128d hi = _mm256_extractf128_pd(reg, 1);
    _mm_store_sd(vec+2, hi);
    // or    vec[2] = _mm_cvtsd_f64(hi);

vmovlps (_mm_storel_pi) would also work, but with AVX VEX encoding it doesn't save any code size, and would require even more casting to keep compilers happy.

There's unfortunately no vpextractq [mem], ymm, only with an XMM source so that doesn't help.

Masked store:

As discussed in comments, yes you could do vmaskmovps but it's unfortunately not as efficient as we might like on all CPUs. Until AVX512 makes masked loads/stores first-class citizens, it may be best to shuffle and do 2 stores. Or pad your array / struct so you can at least temporarily step on later stuff.

Zen has 2-uop vmaskmovpd ymm loads, but very expensive vmaskmovpd stores (42 uops, 1 per 11 cycles for YMM). Or Zen+ and Zen2 are 18 or 19 uops, 6 cycle throughput. If you care at all about Zen, avoid vmaskmov.

On Intel Broadwell and earlier, vmaskmov stores are 4 uops according to Agner's Fog's testing, so that's 1 more fused-domain uop than we get from shuffle + movups + movsd. But still, Haswell and later do manage 1/clock throughput so if that's a bottleneck then it beats the 2-cycle throughput of 2 stores. SnB/IvB of course take 2 cycles for a 256-bit store, even without masking.

On Skylake, vmaskmov mem, ymm, ymm is only 3 uops (Agner Fog lists 4, but his spreadsheets are hand-edited and have been wrong before. I think it's safe to assume uops.info's automated testing is right. And that makes sense; Skylake-client is basically the same core as Skylake-AVX512, just without actually enabling AVX512. So they could implement vmaskmovpd by decoding it into test into a mask register (1 uop) + masked store (2 more uops without micro-fusion).

So if you only care about Skylake and later, and can amortize the cost of loading a mask into a vector register (reusable for loads and stores), vmaskmovpd is actually pretty good. Same front-end cost but cheaper in the back-end: only 1 each store-address and store-data uops, instead of 2 separate stores. Note the 1/clock throughput on Haswell and later vs. the 2-cycle throughput for doing 2 separate stores.

vmaskmovpd might even store-forward efficiently to a masked reload; I think Intel mentioned something about this in their optimization manual.

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

Intel 存储指令故意重叠内存区域。

问题

答案1

还有更高效的方法

带掩码的存储：

And all of that is also moot because there's a more efficient way

Masked store:

如何正确链接一个C++库，该库的行为取决于其编译方式？

用Go + SWIG替换C++

Sort the words in a vector in alphabetical order. 将向量中的单词按字母顺序排序。

在循环内重新定义变量是否有效？

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论