如何使用变量执行AVX洗牌操作

huangapple go评论52阅读模式
英文:

How to do an AVX shuffle by a variable

问题

我想实现一个固定查找表搜索指令。

指令 _mm_shuffle_epi32(table, index) 符合我的要求。但它需要一个立即数。

如果我想使用类似的指令,我应该将 int8_t 扩展为 __m128i

有没有更好的处理这种情况的方法?

我有一个查找表如下:

int32_t table = {t0, t1, t2, t3};
int8_t vindex[100] = { ... }; // 索引向量

C 代码如下:

for (uint32_t i = 0; i < 100; i++){
    pdata[i * 4] = table[vindex[i] & 3];
    pdata[i * 4 +1] = table[vindex[i] >> 2];
    pdata[i * 4 +2] = table[vindex[i] >> 4];
    pdata[i * 4 +3] = table[vindex[i] >> 6];
}

这与指令 _mm_shuffle_epi32() 的性能相一致。

英文:

I want to do implement a fixed lookup table search through instruction.

The instruction _mm_shuffle_epi32(table, index) suit my requirement. but it need a immediate number.

if I want to use similar instruction, I should extend the int8_t to an __m128i

Is there a better way to handle this situation

I have a look up table like

int32_t table = {t0, t1, t2, t3};
an index vector int8_t [100] = vindex;
the c code is like

for (uint32_t i = 0; i &lt; 100; i++){
    pdata[i * 4] = table[vindex[i] &amp; 3];
    pdata[i * 4 +1] = table[vindex[i] &gt;&gt; 2];
    pdata[i * 4 +2] = table[vindex[i] &gt;&gt; 4];
    pdata[i * 4 +3] = table[vindex[i] &gt;&gt; 6];}

This is consistent with the instruction _mm_shuffle_epi32() performance.

答案1

得分: 3

以下是您提供的代码的翻译部分:

首先,如果我没弄错的话,您的基准代码中存在一些错误。这是我用于基准测试和验证的版本:

如果我们只有 AVX-1,我们需要进行一些位操作技巧以使查找位正确。我们可以将 64 位整数视为打包的 4x16 位矢量。我们希望每个 16 位字的最低 2 位成为该元素的洗牌方式。我们可以通过单一乘法来实现这一点,因为只要元素之间没有重叠,它就可以像变量移位一样工作。

我们将这一操作移到矢量寄存器中,并将其进一步扩展为 128 位。这将在意想不到的位置留下一些设置位,但洗牌指令会忽略它们(如果它们不会被忽略,那么很容易屏蔽它们)。

这个版本的代码如下:

如果我们有 AVX2,我们可以使用可变移位来代替。在我的测试中,与基准相比,它的运行速度提高了 2.9 倍。

这个版本的代码如下:

在这个版本中,我们将代码扩展到了 256 位矢量。我的测试显示,为了实现这一点,如果可能的话,我们需要确保将输出对齐到 32 字节边界(这意味着输出至少在开始时是 16 字节对齐的)。这将加速到 4.3 倍。我不确定为什么它没有更好地扩展。我认为我的 Tiger Lake i7-11800H  256 位和 128 位之间不应该有瓶颈。也许我已经受到了内存限制。

这个版本的代码如下:

将代码扩展到 AVX512 简单,但至少在我的系统上,这会更慢。可能在其他系统上会有不同的情况。

这个版本的代码如下:

顺便说一下,有趣的是,在没有 `-march`  `-mtune` 设置的情况下,GCC 认为乘法技巧是愚蠢的,并使用多次移位和或运算代替。但这对我来说似乎不太合理。哪台机器具有 AVX,但没有快速乘法器呢?

注意:我已经根据您的请求仅返回了代码的翻译部分,没有包含任何其他内容。如果您需要进一步的翻译或有其他问题,请随时提出。

英文:

First, your baseline has some bugs if I'm not mistaken. This is the version I use for benchmarking and validation:

void baseline(
        int* pdata, const int* table, const uint8_t* vindex, ptrdiff_t n)
{
    ptrdiff_t i;
    for(i = 0; i &lt; n; ++i) {
        pdata[i * 4] = table[vindex[i] &amp; 3];
        pdata[i * 4 + 1] = table[(vindex[i] &gt;&gt; 2) &amp; 3];
        pdata[i * 4 + 2] = table[(vindex[i] &gt;&gt; 4) &amp; 3];
        pdata[i * 4 + 3] = table[vindex[i] &gt;&gt; 6];
    }
}

If we only have AVX-1, we need to do some bit-twiddling tricks to get the lookup bits into shape. We can view a 64 bit integer as a packed 4x16 bit vector. What we want is the lowest 2 bits per 16 bit word to be the shuffle for that element. We can do this via a single multiplication because it can act like a broadcast with a variable shift as long as there is no overlap between elements.

We move this to the vector register and spread it further into 128 bit. This will leave us with a couple of set bits in unexpected locations but the shuffle instructions ignore those (and it would be easy to mask them if they didn't).

void shuffle_avx1(
        int* pdata, const int* table, const uint8_t* vindex, ptrdiff_t n)
{
    const uint64_t magic_mul =
        1 | (1 &lt;&lt; 14) | (1 &lt;&lt; 28) | (1ull &lt;&lt; 42);
    const __m128 vtable = _mm_castsi128_ps(
            _mm_loadu_si128((const __m128i*) table));
    ptrdiff_t i;
    for(i = 0; i &lt; n; ++i) {
        __m128i permut = _mm_cvtsi64_si128(vindex[i] * magic_mul);
        __m128 result;
        permut = _mm_cvtepu16_epi32(permut);
        result = _mm_permutevar_ps(vtable, permut);
        _mm_storeu_ps((float*) pdata + i * 4, result);
    }
}

This is about the factor 2.1 faster. If we have AVX2, we can instead use variable shifts. In my tests it runs faster by a factor of 2.9 compared to the baseline.

void shuffle_avx2(
        int* pdata, const int* table, const uint8_t* vindex, ptrdiff_t n)
{
    const __m128i shifts = _mm_set_epi32(6, 4, 2, 0);
    const __m128 vtable = _mm_castsi128_ps(
            _mm_loadu_si128((const __m128i*) table));
    ptrdiff_t i;
    for(i = 0; i &lt; n; ++i) {
        __m128i result;
        __m128i permut = _mm_set1_epi32(vindex[i]);
        permut = _mm_srlv_epi32(permut, shifts);
        result = _mm_permutevar_ps(vtable, permut);
        _mm_storeu_ps((float*) pdata + i * 4, result);
    }
}

And while we are at it, we can do the same for a full 256 bit vector. My tests show that for this we need to be careful to align the output to a 32 byte boundary if we can (meaning, the output was at least 16 byte aligned to begin with). This increases the speedup to 4.3. I'm not sure why it didn't scale up better. I don't think my Tiger Lake i7-11800H should have any bottlenecks in 256 bit compared to 128 bit. Maybe I'm already memory-bound.

void shuffle_avx2_full(
        int* pdata, const int* table, const uint8_t* vindex, ptrdiff_t n)
{
    const __m256i shifts8 = _mm256_set_epi32(14, 12, 10, 8, 6, 4, 2, 0);
    const __m256 vtable8 = _mm256_castsi256_ps(
            _mm256_broadcastsi128_si256(
            _mm_loadu_si128((const __m128i*) table)));
    const int small_iter = n &lt; 2 ? n : (((ptrdiff_t) pdata &amp; 31) == 16);
    ptrdiff_t i = 0;
    if(small_iter) {
        /*
         * For n == 1, do full output.
         * For larger n, align output to 32 byte boundary if possible
         */
        const __m128i shifts4 = _mm256_castsi256_si128(shifts8);
        const __m128 vtable4 = _mm256_castps256_ps128(vtable8);
        __m128i permut = _mm_set1_epi32(*vindex);
        __m128 result;
        permut = _mm_srlv_epi32(permut, shifts4);
        result = _mm_permutevar_ps(vtable4, permut);
        _mm_storeu_ps((float*) pdata, result);
        i = 1;
    }
    for(; i + 2 &lt;= n; i += 2) {
        __m256i permut = _mm256_set1_epi32(
            *((const uint16_t*) (vindex + i)));;
        __m256 result;
        permut = _mm256_srlv_epi32(permut, shifts8);
        result = _mm256_permutevar_ps(vtable8, permut);
        /*
         * This still needs to be an unaligned store even though
         * we tried to align above, we cannot align if the output
         * address is not at least 16 byte-aligned
         */
        _mm256_storeu_ps((float*) pdata + i * 4, result);
    }
    if(i &lt; n) {
        /* One misaligned iteration for tail */
        i = n - 2;
        __m256i permut = _mm256_set1_epi32(
            *((const uint16_t*) (vindex + i)));;
        __m256 result;
        permut = _mm256_srlv_epi32(permut, shifts8);
        result = _mm256_permutevar_ps(vtable8, permut);
        _mm256_storeu_ps((float*) pdata + i * 4, result);
    }
}

Scaling this up to AVX512 is simple but at least on my system this is is slower. YMMV.

void shuffle_avx512(
        int* pdata, const int* table, const uint8_t* vindex, ptrdiff_t n)
{
    const __m512i shifts16 = _mm512_set_epi32(
            30, 28, 26, 24, 22, 20, 18, 16,
            14, 12, 10,  8,  6,  4,  2,  0);
    const __m128i shifts4 = _mm512_castsi512_si128(shifts16);
    const __m512 vtable16 = _mm512_castsi512_ps(
            _mm512_broadcast_i32x4(
            _mm_loadu_si128((const __m128i*) table)));
    const __m128 vtable4 = _mm512_castps512_ps128(vtable16);
    const int small_count = n &lt; 4 ? n : (ptrdiff_t) pdata &gt;&gt; 4 &amp; 3;
    ptrdiff_t i;
    /* Large n: Align to 64 byte output. Small n: All iterations */
    for(i = 0; i &lt; small_count; ++i) {
        __m128i permut = _mm_set1_epi32(vindex[i]);
        __m128 result;
        permut = _mm_srlv_epi32(permut, shifts4);
        result = _mm_permutevar_ps(vtable4, permut);
        _mm_storeu_ps((float*) pdata + i * 4, result);
    }
    for(; i + 4 &lt; n; i += 4) {
        __m512i permut = _mm512_set1_epi32(*((const int*) (vindex + i)));
        __m512 result;
        permut = _mm512_srlv_epi32(permut, shifts16);
        result = _mm512_permutevar_ps(vtable16, permut);
        /* Output will still be misaligned if pdata isn&#39;t 16 byte aligned */
        _mm512_storeu_ps((float*) pdata + i * 4, result);
    }
    if(i &lt; n) {
        /* One overlapping misaligned iteration for tail */
        i = n - 4;
        __m512i permut = _mm512_set1_epi32(*((const int*) (vindex + i)));
        __m512 result;
        permut = _mm512_srlv_epi32(permut, shifts16);
        result = _mm512_permutevar_ps(vtable16, permut);
        _mm512_storeu_ps((float*) pdata + i * 4, result);
    }
}

Side note: Funny enough, without a -march or -mtune setting, GCC decides that the multiplication trick is stupid and uses multiple shifts and ORs instead. That doesn't sound sensible to me though. Which machine has AVX but no fast multiplier?

huangapple
  • 本文由 发表于 2023年8月10日 18:25:50
  • 转载请务必保留本文链接:https://go.coder-hub.com/76874856.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定