SIMD位重新排列的12位整数数组

huangapple go评论84阅读模式
英文:

SIMD bit reordering of packed 12-bit integer array

问题

我有一个紧密排列的大型12位整数数组,遵循以下重复的位压缩模式:(其中A*n*/B*n*中的*n*表示位编号,A和B是数组中的前两个12位整数

| byte0 | byte1 | byte2 | 等等..
| A11 A10 A9 A8 A7 A6 A5 A4 | B11 B10 B9 B8 B7 B6 B5 B4 | B3 B2 B1 B0 A3 A2 A1 A0 | 等等..


我将其重新排列成以下模式:

| byte0 | byte1 | byte2 | 等等..
| A11 A10 A9 A8 A7 A6 A5 A4 | A3 A2 A1 A0 B11 B10 B9 B8 | B7 B6 B5 B4 B3 B2 B1 B0 | 等等..


我已经在每个3字节循环中使其工作,使用以下代码:
```c
void CSI2toBE12(uint8_t* pCSI2, uint8_t* pBE, uint8_t* pCSI2LineEnd)
{
    while (pCSI2 < pCSI2LineEnd) {
        pBE[0] = pCSI2[0];
        pBE[1] = ((pCSI2[2] & 0xf) << 4) | (pCSI2[1] >> 4);
        pBE[2] = ((pCSI2[1] & 0xf) << 4) | (pCSI2[2] >> 4);
        
        // 前往下一个12位像素对(3字节)
        pCSI2 += 3;
        pBE += 3;
    }
}

但是以字节为粒度进行处理对性能来说并不是很理想。目标CPU是64位ARM Cortex-A72(树莓派计算模块4)。为了了解背景情况,此代码将MIPI CSI-2位压缩的原始图像数据转换为Adobe DNG的位压缩。

我希望可以通过使用SIMD内联汇编获得显著的性能提升,但我不太确定从哪里开始。我有SIMDe头文件来转换内联汇编,因此欢迎AVX/AVX2解决方案。


<details>
<summary>英文:</summary>

I&#39;ve got a large tightly packed array of 12-bit integers in the following repeating bit-packing pattern: (where *n* in A*n*/B*n* represents bit number and A and B are the first two 12-bit integers in the array)

| byte0 | byte1 | byte2 | etc..
| A11 A10 A9 A8 A7 A6 A5 A4 | B11 B10 B9 B8 B7 B6 B5 B4 | B3 B2 B1 B0 A3 A2 A1 A0 | etc..


which I&#39;m bit reordering into the following pattern:

| byte0 | byte1 | byte2 | etc..
| A11 A10 A9 A8 A7 A6 A5 A4 | A3 A2 A1 A0 B11 B10 B9 B8 | B7 B6 B5 B4 B3 B2 B1 B0 | etc..


I have got it working in a per 3-byte loop with the following code:

void CSI2toBE12(uint8_t* pCSI2, uint8_t* pBE, uint8_t* pCSI2LineEnd)
{
while (pCSI2 < pCSI2LineEnd) {
pBE[0] = pCSI2[0];
pBE[1] = ((pCSI2[2] & 0xf) << 4) | (pCSI2[1] >> 4);
pBE[2] = ((pCSI2[1] & 0xf) << 4) | (pCSI2[2] >> 4);

	// Go to next 12-bit pixel pair (3 bytes)
	pCSI2 += 3;
	pBE += 3;
}

}

but working with byte granularity isn&#39;t great for performance. The target CPU is a 64-bit ARM Cortex-A72 (Raspberry Pi Compute Module 4). For context, this code converts MIPI CSI-2 bit-packed raw image data to Adobe DNG&#39;s bit-packing.

I&#39;m hoping I can get a considerable performance improvement using SIMD intrinsics but I&#39;m not really sure where to start. I&#39;ve got the SIMDe header to translate intrinsics so AVX/AVX2 solutions are welcome.

</details>


# 答案1
**得分**: 4

NEON的`ld3`指令非常适合这个任务;它可以加载48字节并将它们解压成三个NEON寄存器。然后你只需要进行一些位移和或操作。

我提出了以下解决方案:

```c
void vectorized(const uint8_t* pCSI2, uint8_t* pBE, const uint8_t* pCSI2LineEnd)
{
    while (pCSI2 < pCSI2LineEnd) {
        uint8x16x3_t in = vld3q_u8(pCSI2);
        uint8x16x3_t out;
        out.val[0] = in.val[0];
        out.val[1] = vorrq_u8(vshlq_n_u8(in.val[2], 4), vshrq_n_u8(in.val[1], 4));
        out.val[2] = vorrq_u8(vshlq_n_u8(in.val[1], 4), vshrq_n_u8(in.val[2], 4));
        vst3q_u8(pBE, out);
        pCSI2 += 48;
        pBE += 48;
    }
}

godbolt上试一试

使用gcc编译器生成的汇编代码看起来符合预期。(有一个mov指令,可以通过更好的寄存器分配来消除,但这相对不重要)。

不幸的是,clang似乎有一个奇怪的未优化之处,它将4位右移分成了3位和1位的移位操作。 我提了一个bug报告

原则上,我们可以通过使用sli(Shift Left and Insert)来稍微提升性能,以有效地将OR操作与其中一个移位合并:

out.val[1] = vsliq_n_u8(vshrq_n_u8(in.val[1], 4), in.val[2], 4);
out.val[2] = vsliq_n_u8(vshrq_n_u8(in.val[2], 4), in.val[1], 4);

但由于它会覆盖源操作数,我们需要额外付出一些mov指令。 godbolt上的示例。clang会更聪明地分配寄存器,只需要一个mov指令。

另一个选项,可能会稍微更快一些,是使用sra(Shift Right and Accumulate),它执行加法而不是插入。 由于相关位已经是零,这会产生相同的效果。 奇怪的是没有sla

out.val[1] = vsraq_n_u8(vshlq_n_u8(in.val[2], 4), in.val[1], 4);
out.val[2] = vsraq_n_u8(vshlq_n_u8(in.val[1], 4), in.val[2], 4);
英文:

The NEON ld3 instruction is ideal for this; it loads 48 bytes and unzips them into three NEON registers. Then you just need a couple of shifts and ORs.

I came up with the following:

void vectorized(const uint8_t* pCSI2, uint8_t* pBE, const uint8_t* pCSI2LineEnd)
{
    while (pCSI2 &lt; pCSI2LineEnd) {
        uint8x16x3_t in = vld3q_u8(pCSI2);
        uint8x16x3_t out;
        out.val[0] = in.val[0];
        out.val[1] = vorrq_u8(vshlq_n_u8(in.val[2], 4), vshrq_n_u8(in.val[1], 4));
        out.val[2] = vorrq_u8(vshlq_n_u8(in.val[1], 4), vshrq_n_u8(in.val[2], 4));
        vst3q_u8(pBE, out);
        pCSI2 += 48;
        pBE += 48;
    }
}

Try on godbolt.

With gcc, the generated assembly looks like what you would expect. (There is one mov that could be eliminated with better register allocation, but that's pretty minor.)

Unfortunately clang has what looks like a bizarre missed optimization, where it breaks the 4-bit right shift into a 3-bit and a 1-bit shift. I filed a bug.

In principle we can do a little better using sli, Shift Left and Insert, to effectively merge the OR with one of the shifts:

out.val[1] = vsliq_n_u8(vshrq_n_u8(in.val[1], 4), in.val[2], 4);
out.val[2] = vsliq_n_u8(vshrq_n_u8(in.val[2], 4), in.val[1], 4);

But since it overwrites its source operand, we pay for it with a couple extra movs. https://godbolt.org/z/TbzEEd1Pn. clang allocates registers more cleverly and only needs one mov.

Another option, which could be slightly faster, is to use sra, Shift Right and Accumulate, which does an add instead of an insert. Since the relevant bits are already zero here, this has the same effect. Oddly there is no sla.

out.val[1] = vsraq_n_u8(vshlq_n_u8(in.val[2], 4), in.val[1], 4);
out.val[2] = vsraq_n_u8(vshlq_n_u8(in.val[1], 4), in.val[2], 4);

答案2

得分: 0

我建议你从一个图表开始。

关于NEON我不能说,所以我会描述如何制作AVX2代码来实现你想要的功能(不过,你应该使用你的目标指令集来实现它;最好不要使用转换器,如果你的目标是创建新的代码)。x64指令集有很好的文档;这里有一个示例,我常用它。

AVX2寄存器有256位,或者32字节。也就是说,它包含了你的24位数据的10个单位。制作一个图表(最好在纸上),画出如果你从内存中读取它,一个256位寄存器将包含哪些位。然后画出你想要在经过转换后在其中得到哪些位。用线连接它们。识别出具有相同相对位置的位块。

然后编写代码来隔离相关的位块(_mm256_and_si256),将它们移动(_mm256_slli_si256,可能是_mm256_bslli_epi128或其他函数),然后将它们组合起来(_mm256_or_si256)。AVX2对于位移操作特别古怪,所以我相信NEON代码会更容易编写。

你的主循环应该包括读取、处理和写入3个寄存器,或者768位。如果你为第一个寄存器制作一个图表,你也许可以类似地实现另外两个。当然,你需要特殊处理循环剩余部分(最后几个数据元素)——对它们使用常规的C代码。

英文:

I suggest you start with a diagram.

I can't say about NEON, so I'll describe how I would make AVX2 code which does what you want (however, you should implement it with your target instruction set; better don't bother with converters, if your goal is to make new code). x64 intrinsics have great documentation; here is an example which I use.

AVX2 registers have 256 bits, or 32 bytes. That is, 10 units of your 24-bit data. Make a diagram (on paper would be best for me): draw which bits would a 256-bit register contain if you read it from memory. Then draw which bits you want to get in it after your transformation. Connect them with lines. Identify blocks of bits which have identical relative positions.

Then write code which isolates relevant blocks of bits (_mm256_and_si256), shifts them around (_mm256_slli_si256, possibly _mm256_bslli_epi128 or others) and combines them (_mm256_or_si256). AVX2 is particularly idiosyncratic about shifts, so I am sure NEON code will be easier to write.

Your main loop should probably contain reading, processing and writing 3 registers, or 768 bits. If you make a diagram for just the first one, you might be able to implement the other two similarly. Of course, you need special treatment for loop leftovers (the last few data elements) — use regular C code for them.

huangapple
  • 本文由 发表于 2023年7月23日 20:15:01
  • 转载请务必保留本文链接:https://go.coder-hub.com/76748183.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定