问题

我计划使用 _mm512_popcnt_epi64() 来获得包含八个64位值的 __m512i 向量。我需要以成对的方式将这些值相加，以获得以下任何一种结果：

一个包含四个128位值的 __m512i 向量
一个包含四个64位值的 __m256i 向量
一个包含四个32位值的 __m128i 向量

在Zen4架构上有没有一种好的方法来实现这个？

英文:

I plan to use _mm512_popcnt_epi64() to get an __m512i vector containing eight 64-bit values. I need to add those values in a pairwise fashion to get any of the following:

an __m512i vector containing four 128-bit values
an __m256i vector containing four 64-bit values
an __m128i vector containing four 32-bit values

Is there a good way to do this on Zen4?

答案1

得分: 1

__m128i _mm512_cvtepi64_epi16(__m512i a); (vpmovqw)会将64位元素缩小为16位。从那里，您可以使用 _mm_madd_epi16(v, _mm_set1_epi16(0x0001)) (pmaddwd)水平相加成对元素，或者使用移位/加法/与操作，或者使用移位/零掩码相加。

首先将宽度缩小至小于512位对于Zen4来说是个好办法，因为大多数512位操作在执行单元中需要额外的周期（吞吐量和延迟较差）。

如果您实际上想要一个 __m512i，您可以在通道内进行洗牌以获得零掩码的 vpaddq，或者 __m256i 可以从 vpmovqd 开始，只缩小一半，为 _mm256_srli_epi64(v, 32) 和 _mm256_maskz_add_epi32(0x55, shifted, v) 做准备。

在Zen 4上，掩码寄存器的设置显然很糟糕，使用 kmovb k, r32 单独成本为2个微操作（https://uops.info），所以如果这不在循环中，您可能只想使用矢量常数进行 vpand。或者左移然后右移，如 srli( add(v, slli(v, 32)), 32)。但是一旦您在掩码寄存器中有了掩码，使用它是可以的：vpaddd 带有零掩码的吞吐量为XMM/YMM寄存器每时钟周期4个，零掩码的延迟为1个时钟周期。（或者合并掩码时，一个输入的延迟为2个时钟周期）。

英文:

__m128i _mm512_cvtepi64_epi16( __m512i a); (vpmovqw) will narrow 64-bit elements to 16-bit. From there you can horizontally add pairs with _mm_madd_epi16(v, _mm_set1_epi16(0x0001)) (pmaddwd), or with shift / add / AND, or shift / zero-masked add.

Narrowing to less than 512-bit as a first step is good for Zen4, since most 512-bit operations take extra cycles in the execution units (worse throughput and latency).

If you actually wanted a __m512i you'd just shuffle within lanes for a zero-masked vpaddq, or a __m256i could start with vpmovqd to only narrow in half, setting up for _mm256_srli_epi64(v, 32) and _mm256_maskz_add_epi32(0x55, shifted, v)

Mask register setup apparently sucks on Zen 4, with kmovb k, r32 costing 2 uops alone (https://uops.info), so if this isn't in a loop you might want to just use a vector constant for vpand. Or shift left then right, like srli( add(v, slli(v, 32)), 32). But once you have a mask in a mask register, using it is fine: vpaddd with zero-masking is 4/clock throughput on XMM/YMM registers, with 1 cycle latency for zero-masking. (Or 2 cycles for one of the inputs in merge-masking).

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

对于64位值在__m512i中的成对相加操作？

问题

答案1

如何使用AVX-512实现向量化的“exp”和“log”基数2函数。

AV512: 最佳方法将水平求和和广播结合

如何在x86汇编中编写一个操作数，它是从一个N位内存位置加载的512位向量。

使用 AVX512F 在 Visual Studio 编译代码

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论