如何在C++中交错三个AVX寄存器的字节。

huangapple go评论80阅读模式
英文:

How to interleave the bytes of 3 avx registers in c++

问题

#include <immintrin.h>
#include <stdint.h>

int main() {
    uint8_t a[] = {0, 3, 6, 9, 12, 15, 18, 21};
    uint8_t b[] = {1, 4, 7, 10, 13, 16, 19, 22};
    uint8_t c[] = {2, 5, 8, 11, 14, 17, 20, 23};

    auto as = _mm256_loadu_epi8(a);
    auto bs = _mm256_loadu_epi8(b);
    auto cs = _mm256_loadu_epi8(c);
    // uint8_t result[32] = {0,1,2,3,4,5,6,7,8,9,10,11...}

    return 0;
}
英文:

Hi following snipped:

#include &lt;immintrin.h&gt;
#include &lt;stdint.h&gt;

int main() {
    uint8_t a[] = {0, 3, 6, 9, 12, 15, 18, 21};
    uint8_t b[] = {1, 4, 7, 10, 13, 16, 19, 22};
    uint8_t c[] = {2, 5, 8, 11, 14, 17, 20, 23};

    auto as = _mm256_loadu_epi8(a);
    auto bs = _mm256_loadu_epi8(b);
    auto cs = _mm256_loadu_epi8(c);
    // uint8_t result[32] = {0,1,2,3,4,5,6,7,8,9,10,11...}

    return 0;
}

I played a lot with pack/unpack/shuffle/permutate but did not get my desired result.
I want a result that contains all values in increasing order (0, 1, 2, 3...), so basically I want to interleave all three input arrays. Has anybody a solution for that?

答案1

得分: 2

这是基于_mm256_shuffle_epi8的版本。问题是你不能在128位通道之间进行混洗。因此,为了在两个通道中都有所需的信息,您必须执行广播操作。

在这里,我假设可以读取所需值的末尾(每个向量10字节)。根据需要,可以用其他序列替换以进行更谨慎的内存访问。

auto a_lo = _mm_loadu_si128((const __m128i*) a);
auto b_lo = _mm_loadu_si128((const __m128i*) b);
auto c_lo = _mm_loadu_si128((const __m128i*) c);

auto as = _mm256_broadcastsi128_si256(a_lo);
auto bs = _mm256_broadcastsi128_si256(b_lo);
auto cs = _mm256_broadcastsi128_si256(c_lo);

现在我们可以设置掩码,将字节放置在其各自的位置。其他值设置为零。请注意,最后两个字节始终为零,因为32不能被3整除。

auto mask_axx = _mm256_setr_epi8(
        /* 第一通道 */
        0, -1, -1, 1, -1, -1, 2, -1, -1, 3, -1, -1, 4, -1, -1, 5,
        /* 第二通道 */
        -1, -1, 6, -1, -1, 7, -1, -1, 8, -1, -1, 9, -1, -1, -1, -1);
auto mask_xbx = _mm256_setr_epi8(
        /* 第一通道 */
        -1, 0, -1, -1, 1, -1, -1, 2, -1, -1, 3, -1, -1, 4, -1, -1,
        /* 第二通道 */
        5, -1, -1, 6, -1, -1, 7, -1, -1, 8, -1, -1, 9, -1, -1, -1);
auto mask_xxc = _mm256_setr_epi8(
        /* 第一通道 */
        -1, -1, 0, -1, -1, 1, -1, -1, 2, -1, -1, 3, -1, -1, 4, -1,
        /* 第二通道 */
        -1, 5, -1, -1, 6, -1, -1, 7, -1, -1, 8, -1, -1, 9, -1, -1);

现在剩下的就是应用混洗,然后将结果进行OR运算。

auto axx = _mm256_shuffle_epi8(as, mask_axx);
auto xbx = _mm256_shuffle_epi8(bs, mask_xbx);
auto xxc = _mm256_shuffle_epi8(cs, mask_xxc);
auto abc = _mm256_or_si256(axx, xbx);
abc = _mm256_or_si256(abc, xxc);

同样,最后两个字节为零。所以在循环中,这将在abc数组中前进10字节,然后在输出数组中前进30字节。

半宽版本,每15字节有5个输出像素的工作方式相同,只是mask_axx的最后一个位置为-1

完整循环

这是一个完整的转换函数。

在某一点上,我使用了从uint8_tuint16_t的隐式转换,因此我们必须小心使用无符号值来处理各个字节,以避免意外的符号扩展。

void interleave_abc(std::uint8_t* out, const std::uint8_t* a,
        const std::uint8_t* b, const  std::uint8_t* c,
        std::ptrdiff_t n)
{

由于我们将不得不处理尾元素,将主要计算包装在lambda中会有所帮助。Lambda会被强烈内联,因此不会有任何开销。

    auto pack256 = [=](std::ptrdiff_t i, __m256i mask_axx,
            __m256i mask_xbx, __m256i mask_xxc) {
        auto a_lo = _mm_loadu_si128((const __m128i*) (a + i));
        auto b_lo = _mm_loadu_si128((const __m128i*) (b + i));
        auto c_lo = _mm_loadu_si128((const __m128i*) (c + i));
        auto as = _mm256_broadcastsi128_si256(a_lo);
        auto bs = _mm256_broadcastsi128_si256(b_lo);
        auto cs = _mm256_broadcastsi128_si256(c_lo);
        auto axx = _mm256_shuffle_epi8(as, mask_axx);
        auto xbx = _mm256_shuffle_epi8(bs, mask_xbx);
        auto xxc = _mm256_shuffle_epi8(cs, mask_xxc);
        auto abc = _mm256_or_si256(axx, xbx);
        return _mm256_or_si256(abc, xxc);
    };

然后我们可以用于主循环。

    const auto mask_axx = _mm256_setr_epi8(
            /* 第一通道 */
            0, -1, -1, 1, -1, -1, 2, -1,
            -1, 3, -1, -1, 4, -1, -1, 5,
            /* 第二通道 */
            -1, -1, 6, -1, -1, 7, -1, -1,
            8, -1, -1, 9, -1, -1, -1, -1);
    const auto mask_xbx = _mm256_setr_epi8(
            /* 第一通道 */
            -1, 0, -1, -1, 1, -1, -1, 2,
            -1, -1, 3, -1, -1, 4, -1, -1,
            5, -1, -1, 6, -1, -1, 7, -1,
            -1, 8, -1,

<details>
<summary>英文:</summary>

Here is a version based on `_mm256_shuffle_epi8`. The issue is that you cannot shuffle across 128 bit lanes. Therefore, to have the required information in both lanes, you have to do a broadcast.

Here I assume that I can read beyond the end of the required values (10 byte per vector). Substitute with other sequences to achieve the same with more careful memory accesses as required.

```c++
auto a_lo = _mm_loadu_si128((const __m128i*) a);
auto b_lo = _mm_loadu_si128((const __m128i*) b);
auto c_lo = _mm_loadu_si128((const __m128i*) c);

auto as = _mm256_broadcastsi128_si256(a_lo);
auto bs = _mm256_broadcastsi128_si256(b_lo);
auto cs = _mm256_broadcastsi128_si256(c_lo);

Now we can set up masks to put bytes into their individual positions. Other values are set to zero. Note that the last two bytes are always zero since 32 isn't divisible by 3.

auto mask_axx = _mm256_setr_epi8(
        /* first lane */
        0, -1, -1, 1, -1, -1, 2, -1, -1, 3, -1, -1, 4, -1, -1, 5,
        /* second lane */
        -1, -1, 6, -1, -1, 7, -1, -1, 8, -1, -1, 9, -1, -1, -1, -1);
auto mask_xbx = _mm256_setr_epi8(
        /* first lane */
        -1, 0, -1, -1, 1, -1, -1, 2, -1, -1, 3, -1, -1, 4, -1, -1,
        /* second lane */
        5, -1, -1, 6, -1, -1, 7, -1, -1, 8, -1, -1, 9, -1, -1, -1);
auto mask_xxc = _mm256_setr_epi8(
        /* first lane */
        -1, -1, 0, -1, -1, 1, -1, -1, 2, -1, -1, 3, -1, -1, 4, -1,
        /* second lane */
        -1, 5, -1, -1, 6, -1, -1, 7, -1, -1, 8, -1, -1, 9, -1, -1);

Now all that's left is to apply the shuffle, then OR the results.

auto axx = _mm256_shuffle_epi8(as, mask_axx);
auto xbx = _mm256_shuffle_epi8(bs, mask_xbx);
auto xxc = _mm256_shuffle_epi8(cs, mask_xxc);
auto abc = _mm256_or_si256(axx, xbx);
abc = _mm256_or_si256(abc, xxc);

Again, the last two bytes are zero. So in a loop this would advance 10 bytes in the a, b, and c arrays and then 30 byte in the output array.

A half-width version with 5 output pixels over 15 byte works the same except that the mask_axx is -1 in the last position of its single lane.

Full loop

Here is a full conversion function.

At one point I use an implicit cast from uint8_t to uint16_t, so we have to be careful to use unsigned values for the individual bytes to avoid accidental sign extension.

void interleave_abc(std::uint8_t* out, const std::uint8_t* a,
        const std::uint8_t* b, const  std::uint8_t* c,
        std::ptrdiff_t n)
{

Since we will have to deal with the tail elements, it helps to wrap the main computation in a lambda. Lambdas are very strongly inlined so this will not have any overhead.

    auto pack256 = [=](std::ptrdiff_t i, __m256i mask_axx,
            __m256i mask_xbx, __m256i mask_xxc) {
        auto a_lo = _mm_loadu_si128((const __m128i*) (a + i));
        auto b_lo = _mm_loadu_si128((const __m128i*) (b + i));
        auto c_lo = _mm_loadu_si128((const __m128i*) (c + i));
        auto as = _mm256_broadcastsi128_si256(a_lo);
        auto bs = _mm256_broadcastsi128_si256(b_lo);
        auto cs = _mm256_broadcastsi128_si256(c_lo);
        auto axx = _mm256_shuffle_epi8(as, mask_axx);
        auto xbx = _mm256_shuffle_epi8(bs, mask_xbx);
        auto xxc = _mm256_shuffle_epi8(cs, mask_xxc);
        auto abc = _mm256_or_si256(axx, xbx);
        return _mm256_or_si256(abc, xxc);
    };

Which we can then use for the main loop.

    const auto mask_axx = _mm256_setr_epi8(
            /* first lane */
            0, -1, -1, 1, -1, -1, 2, -1,
            -1, 3, -1, -1, 4, -1, -1, 5,
            /* second lane */
            -1, -1, 6, -1, -1, 7, -1, -1,
            8, -1, -1, 9, -1, -1, -1, -1);
    const auto mask_xbx = _mm256_setr_epi8(
            -1, 0, -1, -1, 1, -1, -1, 2,
            -1, -1, 3, -1, -1, 4, -1, -1,
            5, -1, -1, 6, -1, -1, 7, -1,
            -1, 8, -1, -1, 9, -1, -1, -1);
    const auto mask_xxc = _mm256_setr_epi8(
            -1, -1, 0, -1, -1, 1, -1, -1,
            2, -1, -1, 3, -1, -1, 4, -1,
            -1, 5, -1, -1, 6, -1, -1, 7,
            -1, -1, 8, -1, -1, 9, -1, -1);
    std::ptrdiff_t i;
    for(i = 0; i + 16 &lt;= n; i += 10) {
        auto abc = pack256(i, mask_axx, mask_xbx, mask_xxc);
        _mm256_storeu_si256((__m256i*) (out + 3 * i), abc);
    }

Handling the tail is … bothersome. Using partial loads and stores generates many different cases for what I assume are about 30 CPU cycles when handled in a trivial loop. That isn't worth it in my opinion unless short sequences are a common occurrence and the number of samples can be constant-propagated. Instead, we go for something different. First we start by getting the special case out of the way that we have fewer than 16 bytes per input channel.

    if(n &lt; 16) {
        for(; i &lt; n; ++i) {
            std::uint16_t ab = b[i];
            ab = ab &lt;&lt; 8 | a[i];
            *((std::uint16_t*) (out + 3 * i)) = ab;
            out[3 * i + 2] = c[i];
        }
        return;
    }

For cases where we have enough samples, we can do a similar loop to the main loop, but going backwards from the end and using the tail elements of the vectors. The results will partially overlap with the ones we already computed but the number of cycles used for this (and lines of code) is reasonably low. There are at most two iterations of this loop to deal with up to 15 tail elements.

    const auto tailmask_axx = _mm256_setr_epi8(
            -1, -1, 6, -1, -1, 7, -1, -1,
            8, -1, -1, 9, -1, -1, 10, -1,
            -1, 11, -1, -1, 12, -1, -1, 13,
            -1, -1, 14, -1, -1, 15, -1, -1);
    const auto tailmask_xbx = _mm256_setr_epi8(
            5, -1, -1, 6, -1, -1, 7, -1,
            -1, 8, -1, -1, 9, -1, -1, 10,
            -1, -1, 11, -1, -1, 12, -1, -1,
            13, -1, -1, 14, -1, -1, 15, -1);
    const auto tailmask_xxc = _mm256_setr_epi8(
            -1, 5, -1, -1, 6, -1, -1, 7,
            -1, -1, 8, -1, -1, 9, -1, -1,
            10, -1, -1, 11, -1, -1, 12, -1,
            -1, 13, -1, -1, 14, -1, -1, 15);
    for(std::ptrdiff_t j = n; j &gt;= i; j -= 10) {
        auto abc = pack256(j - 16, tailmask_axx, tailmask_xbx,
                tailmask_xxc);
        _mm256_storeu_si256((__m256i*) (out + 3 * j - 32), abc);
    }
}

huangapple
  • 本文由 发表于 2023年6月22日 18:33:17
  • 转载请务必保留本文链接:https://go.coder-hub.com/76530994.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定