2023年4月4日 04:53:28go评论68阅读模式

英文:

How to multiply-accumulate unsigned bytes into 32-bit elements without overflow with RISC-V extension "V" SIMD vectors?

问题

I am writing vector code with RISC-V intrinsics for extension V vectors, but this question probably applies to vectorisation generally.

我正在使用RISC-V内部函数编写矢量代码，用于扩展 V 向量，但这个问题可能通常适用于矢量化。

I need to multiply and accumulate lots of uint8 values. To do this I want to fill the vector registers with uint8s, multiply and accumulate (MAC) in a loop, done. However in order to avoid overflowing the result of the accumulation would normally have to be stored in a larger type eg uint32. How does this extend to vectors?

我需要对许多 uint8 值进行乘法和累加。为了做到这一点，我希望将矢量寄存器填充为 uint8，在循环中进行乘法和累加（MAC），完成后需要避免溢出，通常需要将累加的结果存储在较大的类型中，例如 uint32。这如何扩展到矢量？

I imagine I have to split the vector registers into 32-bit lanes and accumulate into them, but writing vectorised code is new to me. Is there a way I can split the vector registers into 8-bit lanes for better parallelism, and still avoid the overflow?

我想象我必须将矢量寄存器分成32位通道，并在其中累积，但编写矢量化代码对我来说是新的。有没有办法将矢量寄存器分成8位通道以实现更好的并行性，并仍然避免溢出？

A problem arises because I fill a vector register by providing a pointer to an array of uint8

一个问题出现了，因为我通过提供指向 uint8 数组的指针来填充矢量寄存器

vuint8m1_t vec_u8s = __riscv_vle64_v_u8m1(ptr_a, vl);

but if I were to replace this with...

但如果我将其替换为...

vuint32m1_t vec_u8s_in_32bit_lanes = __riscv_vle64_v_u32m1(ptr_a, vl);

It may read from my array as 32 bit values, reading 4 (uint8) elements into one (uint32) lane. Is my understanding correct? How should I avoid this?

它可能会将我的数组读取为32位值，将4个（uint8）元素读取到一个（uint32）通道中。我的理解正确吗？我应该如何避免这种情况？

Is it ok because ptr_a is defined as uint8_t * ptr_a ... ?

这是否可以，因为ptr_a被定义为 uint8_t * ptr_a ...？

Edit:

编辑：

Perhaps what I'm looking for is

也许我正在寻找的是

vint32m1_t __riscv_vlse32_v_i32m1_m (vbool32_t mask, const int32_t *base, ptrdiff_t bstride, size_t vl);

where I can set the mask to 0xFF and stride to 1 to read data at 1 byte increments ?

在这里，我可以将掩码设置为0xFF，步幅设置为1以按1字节递增读取数据？

英文:

I am writing vector code with RISC-V intrinsics for extension V vectors, but this question probably applies to vectorisation generally.

A problem arises because I fill a vector register by providing a pointer to an array of uint8

vuint8m1_t vec_u8s = __riscv_vle64_v_u8m1(ptr_a, vl);

but if I were to replace this with...

vuint32m1_t vec_u8s_in_32bit_lanes = __riscv_vle64_v_u32m1(ptr_a, vl);

It may read from my array as 32 bit values, reading 4 (uint8) elements into one (uint32) lane. Is my understanding correct? How should I avoid this?

Is it ok because ptr_a is defined as uint8_t * ptr_a ... ?

Edit:

Perhaps what im looking for is

vint32m1_t __riscv_vlse32_v_i32m1_m (vbool32_t mask, const int32_t *base, ptrdiff_t bstride, size_t vl);

where I can set the mask to 0xFF and stride to 1 to read data at 1 byte increments ?

答案1

得分: 1

以下是翻译的代码部分：

答案是使用适当的`v{s;z}ext`内部函数扩展向量元素的宽度，然后对结果使用重新解释的内部函数以"转换"其值。

以下是一个函数及其矢量化等效部分的示例，考虑了宽度/类型的更改。

非常感谢Peter Cordes帮助我弄清楚这个问题！

int byte_mac(unsigned char a[], unsigned char b[], int len) {
  int sum = 0;
  for (int i = 0; i < len; i++) {
    sum += a[i] * b[i];
  }
  return sum;
}

int byte_mac_vec(unsigned char *a, unsigned char *b, int len) {
  size_t vlmax = __riscv_vsetvlmax_e8m1();
  vint32m4_t vec_s = __riscv_vmv_v_x_i32m4(0, vlmax);
  vint32m1_t vec_zero = __riscv_vmv_v_x_i32m1(0, vlmax);
  int k = len;
  for (size_t vl; k > 0; k -= vl, a += vl, b += vl) {
    vl = __riscv_vsetvl_e8m1(k);

    vuint8m1_t a8s = __riscv_vle8_v_u8m1(a, vl);
    vuint8m1_t b8s = __riscv_vle8_v_u8m1(b, vl);
    vuint32m4_t a8s_extended = __riscv_vzext_vf4_u32m4(a8s, vl);
    vuint32m4_t b8s_extended = __riscv_vzext_vf4_u32m4(b8s, vl);

    vint32m4_t a8s_as_i32 = __riscv_vreinterpret_v_u32m4_i32m4(a8s_extended);
    vint32m4_t b8s_as_i32 = __riscv_vreinterpret_v_u32m4_i32m4(b8s_extended);

    vec_s = __riscv_vmacc_vv_i32m4_tu(vec_s, a8s_as_i32, b8s_as_i32, vl);
  }

  vint32m1_t vec_sum = __riscv_vredsum_vs_i32m4_i32m1(vec_s, vec_zero, __riscv_vsetvl_e32m4(len));
  int sum = __riscv_vmv_x_s_i32m1_i32(vec_sum);

  return sum;
}

请注意，这部分是原始代码的翻译。

英文:

The answer was to extend the width of the vector elements using the appropriate v{s;z}ext intrinsic, then use a reinterpret intrinsic on the result to "cast" its values.

Below is an example of a function and its vectorized equivalent, accounting for width/type changes.

Big thanks to Peter Cordes for helping me figure it out!

int byte_mac(unsigned char a[], unsigned char b[], int len) {
  int sum = 0;
  for (int i = 0; i &lt; len; i++) {
    sum += a[i] * b[i];
  }
  return sum;
}

int byte_mac_vec(unsigned char *a, unsigned char *b, int len) {
  size_t vlmax = __riscv_vsetvlmax_e8m1();
  vint32m4_t vec_s = __riscv_vmv_v_x_i32m4(0, vlmax);
  vint32m1_t vec_zero = __riscv_vmv_v_x_i32m1(0, vlmax);
  int k = len;
  for (size_t vl; k &gt; 0; k -= vl, a += vl, b += vl) {
    vl = __riscv_vsetvl_e8m1(k);
   
    vuint8m1_t a8s = __riscv_vle8_v_u8m1(a, vl);
    vuint8m1_t b8s = __riscv_vle8_v_u8m1(b, vl);
    vuint32m4_t a8s_extended = __riscv_vzext_vf4_u32m4(a8s, vl);
    vuint32m4_t b8s_extended = __riscv_vzext_vf4_u32m4(b8s, vl);
    
    vint32m4_t a8s_as_i32 = __riscv_vreinterpret_v_u32m4_i32m4(a8s_extended);
    vint32m4_t b8s_as_i32 = __riscv_vreinterpret_v_u32m4_i32m4(b8s_extended);

    vec_s = __riscv_vmacc_vv_i32m4_tu(vec_s, a8s_as_i32, b8s_as_i32, vl);
  }
  
  vint32m1_t vec_sum = __riscv_vredsum_vs_i32m4_i32m1(vec_s, vec_zero, __riscv_vsetvl_e32m4(len));
  int sum = __riscv_vmv_x_s_i32m1_i32(vec_sum);

  return sum;
}

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

How to multiply-accumulate unsigned bytes into 32-bit elements without overflow with RISC-V extension "V" SIMD vectors?

问题

答案1

“fseek”在C程序中以”w”模式返回-1，并使文件变为空白。

我需要翻译的部分是：”Do I have to initialize function pointers one by one?”

按其内部属性的值对结构进行排序

cs50 pset 2 – 可读性

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论