2023年6月1日 03:27:15go评论163阅读模式

英文:

Transpose 4x4 int32 matrix using NEON

问题

如何高效地转置一个以四个int32x4t值表示的矩阵？我不能使用ld4q_s32和st4q_s32。

英文:

How can I efficiently transpose a matrix represented as four int32x4t values? I cannot use ld4q_s32 and st4q_s32.

答案1

得分: 2

以下是您要翻译的内容：

Transposing is typically implemented in the recursive form:

   [A B]' == [A' C']
   [C D]     [B' D']

where each A,B,C,D can be though of being a square matrix (and A' == A, for 1x1 matrices).

Using transpose will unfortunately help you only on the level that needs to transpose the lowest 2x2 matrices in parallel -- the ARM64 instruction set lacks an instruction to transpose the highest level, i.e. 64 bits at a time, which was done on arm-v7 as swp d2, d0.

There's anyway another iterative formula for transpose: it's repeatedly unzipping or zipping the input.

in octave/matlab language we define the zip operator as one, that takes the first N/2 elements of an array a and interleaves them with the last N/2 elements of an array.

  zip = @(a) [a(1:end/2);a(end/2+1:end)](:)';
  zip(0:15) %--> 0  8  1  9  2 10  3 11  4 12  5 13  6 14  7 15
  zip(ans)  %--> 0  4  8 12  1  5  9 13  2  6 10 14  3  7 11 15

Thus the complete code using intrinsics could be something like:

inline int32x4x4_t zip(int32x4x4_t a) {
   return { vzip1q_s32(a.val[0], a.val[2]),
            vzip2q_s32(a.val[0], a.val[2]),
            vzip1q_s32(a.val[1], a.val[3]),
            vzip2q_s32(a.val[1], a.val[3]) };
}

int32x4x4_t transpose(int32x4x4_t a) {
    return zip(zip(a));
}

On some architectures one can also utilise vtbl, if one has 8 extra registers to spare (4 for the indices and 4 for the output)

uint8x16x4_t transpose_vtbl(uint8x16x4_t a) {
   return {
      vqtbl4q_u8(a, idx0),
      vqtbl4q_u8(a, idx1),
      vqtbl4q_u8(a, idx2),
      vqtbl4q_u8(a, idx3)};
}

Here the int32x4x4_t input needs to be cast element-wise with vreinterpretq_u8_s32(input.val[i]) and back, with idx0 having the values of 0 1 2 3 16 17 18 19 32 33 34 35 48 49 50 51, idx1 == idx0 + 4 and so on.

On M1 or M2 VTBL seems to instruction level parallelise, contrary to eg. a moderately recent cortex A75 (v8.2), where the vtbl takes 3N+1 clock cycles with no dual-issue capabilities.

英文:

Transposing is typically implemented in the recursive form:

   [A B]&#39; == [A&#39; C&#39;]
   [C D]     [B&#39; D&#39;]

where each A,B,C,D can be though of being a square matrix (and A' == A, for 1x1 matrices).

There's anyway another iterative formula for transpose: it's repeatedly unzipping or zipping the input.

in octave/matlab language we define the zip operator as one, that takes the first N/2 elements of an array a and interleaves them with the last N/2 elements of an array.

  zip = @(a) [a(1:end/2);a(end/2+1:end)](:)&#39;;
  zip(0:15) %--&gt; 0  8  1  9  2 10  3 11  4 12  5 13  6 14  7 15
  zip(ans)  %--&gt; 0  4  8 12  1  5  9 13  2  6 10 14  3  7 11 15

Thus the complete code using intrinsics could be something like:

inline int32x4x4_t zip(int32x4x4_t a) {
   return { vzip1q_s32(a.val[0], a.val[2]),
            vzip2q_s32(a.val[0], a.val[2]),
            vzip1q_s32(a.val[1], a.val[3]),
            vzip2q_s32(a.val[1], a.val[3]) };
}

int32x4x4_t transpose(int32x4x4_t a) {
    return zip(zip(a));
}

On some architectures one can also utilise vtbl, if one has 8 extra registers to spare (4 for the indices and 4 for the output)

uint8x16x4_t transpose_vtbl(uint8x16x4_t a) {
   return {
      vqtbl4q_u8(a, idx0),
      vqtbl4q_u8(a, idx1),
      vqtbl4q_u8(a, idx2),
      vqtbl4q_u8(a, idx3)};
}

On M1 or M2 VTBL seems to instruction level parallelise, contrary to eg. a moderately recent cortex A75 (v8.2), where the vtbl takes 3N+1 clock cycles with no dual-issue capabilities.

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

使用NEON转置4×4的int32矩阵。

问题

答案1

为什么gcc在寄存器可用时在堆栈上使用变量？

如何在使用 YASM 汇编时更改 64 位 Linux 终端颜色？

在Mac上使用VS Code运行汇编代码。

使用汇编语言来确定当前存在哪种类型的 CPU。

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论