How to multiply-accumulate unsigned bytes into 32-bit elements without overflow with RISC-V extension "V" SIMD vectors?

huangapple go评论64阅读模式
英文:

How to multiply-accumulate unsigned bytes into 32-bit elements without overflow with RISC-V extension "V" SIMD vectors?

问题

I am writing vector code with RISC-V intrinsics for extension V vectors, but this question probably applies to vectorisation generally.

我正在使用RISC-V内部函数编写矢量代码,用于扩展 V 向量,但这个问题可能通常适用于矢量化。

I need to multiply and accumulate lots of uint8 values. To do this I want to fill the vector registers with uint8s, multiply and accumulate (MAC) in a loop, done. However in order to avoid overflowing the result of the accumulation would normally have to be stored in a larger type eg uint32. How does this extend to vectors?

我需要对许多 uint8 值进行乘法和累加。为了做到这一点,我希望将矢量寄存器填充为 uint8,在循环中进行乘法和累加(MAC),完成后需要避免溢出,通常需要将累加的结果存储在较大的类型中,例如 uint32。这如何扩展到矢量?

I imagine I have to split the vector registers into 32-bit lanes and accumulate into them, but writing vectorised code is new to me. Is there a way I can split the vector registers into 8-bit lanes for better parallelism, and still avoid the overflow?

我想象我必须将矢量寄存器分成32位通道,并在其中累积,但编写矢量化代码对我来说是新的。有没有办法将矢量寄存器分成8位通道以实现更好的并行性,并仍然避免溢出?

A problem arises because I fill a vector register by providing a pointer to an array of uint8

一个问题出现了,因为我通过提供指向 uint8 数组的指针来填充矢量寄存器

vuint8m1_t vec_u8s = __riscv_vle64_v_u8m1(ptr_a, vl);

but if I were to replace this with...

但如果我将其替换为...

vuint32m1_t vec_u8s_in_32bit_lanes = __riscv_vle64_v_u32m1(ptr_a, vl);

It may read from my array as 32 bit values, reading 4 (uint8) elements into one (uint32) lane. Is my understanding correct? How should I avoid this?

它可能会将我的数组读取为32位值,将4个(uint8)元素读取到一个(uint32)通道中。我的理解正确吗?我应该如何避免这种情况?

Is it ok because ptr_a is defined as uint8_t * ptr_a ... ?

这是否可以,因为ptr_a被定义为 uint8_t * ptr_a ...

Edit:

编辑:

Perhaps what I'm looking for is

也许我正在寻找的是

vint32m1_t __riscv_vlse32_v_i32m1_m (vbool32_t mask, const int32_t *base, ptrdiff_t bstride, size_t vl);

where I can set the mask to 0xFF and stride to 1 to read data at 1 byte increments ?

在这里,我可以将掩码设置为0xFF,步幅设置为1以按1字节递增读取数据?

英文:

I am writing vector code with RISC-V intrinsics for extension V vectors, but this question probably applies to vectorisation generally.

I need to multiply and accumulate lots of uint8 values. To do this I want to fill the vector registers with uint8s, multiply and accumulate (MAC) in a loop, done. However in order to avoid overflowing the result of the accumulation would normally have to be stored in a larger type eg uint32. How does this extend to vectors?

I imagine I have to split the vector registers into 32-bit lanes and accumulate into them, but writing vectorised code is new to me. Is there a way I can split the vector registers into 8-bit lanes for better parallelism, and still avoid the overflow?

A problem arises because I fill a vector register by providing a pointer to an array of uint8

vuint8m1_t vec_u8s = __riscv_vle64_v_u8m1(ptr_a, vl);

but if I were to replace this with...

vuint32m1_t vec_u8s_in_32bit_lanes = __riscv_vle64_v_u32m1(ptr_a, vl);

It may read from my array as 32 bit values, reading 4 (uint8) elements into one (uint32) lane. Is my understanding correct? How should I avoid this?

Is it ok because ptr_a is defined as uint8_t * ptr_a ... ?

Edit:

Perhaps what im looking for is

vint32m1_t __riscv_vlse32_v_i32m1_m (vbool32_t mask, const int32_t *base, ptrdiff_t bstride, size_t vl);

where I can set the mask to 0xFF and stride to 1 to read data at 1 byte increments ?

答案1

得分: 1

以下是翻译的代码部分:

答案是使用适当的`v{s;z}ext`内部函数扩展向量元素的宽度,然后对结果使用重新解释的内部函数以"转换"其值。

以下是一个函数及其矢量化等效部分的示例,考虑了宽度/类型的更改。

非常感谢Peter Cordes帮助我弄清楚这个问题

int byte_mac(unsigned char a[], unsigned char b[], int len) {
  int sum = 0;
  for (int i = 0; i < len; i++) {
    sum += a[i] * b[i];
  }
  return sum;
}

int byte_mac_vec(unsigned char *a, unsigned char *b, int len) {
  size_t vlmax = __riscv_vsetvlmax_e8m1();
  vint32m4_t vec_s = __riscv_vmv_v_x_i32m4(0, vlmax);
  vint32m1_t vec_zero = __riscv_vmv_v_x_i32m1(0, vlmax);
  int k = len;
  for (size_t vl; k > 0; k -= vl, a += vl, b += vl) {
    vl = __riscv_vsetvl_e8m1(k);

    vuint8m1_t a8s = __riscv_vle8_v_u8m1(a, vl);
    vuint8m1_t b8s = __riscv_vle8_v_u8m1(b, vl);
    vuint32m4_t a8s_extended = __riscv_vzext_vf4_u32m4(a8s, vl);
    vuint32m4_t b8s_extended = __riscv_vzext_vf4_u32m4(b8s, vl);

    vint32m4_t a8s_as_i32 = __riscv_vreinterpret_v_u32m4_i32m4(a8s_extended);
    vint32m4_t b8s_as_i32 = __riscv_vreinterpret_v_u32m4_i32m4(b8s_extended);

    vec_s = __riscv_vmacc_vv_i32m4_tu(vec_s, a8s_as_i32, b8s_as_i32, vl);
  }

  vint32m1_t vec_sum = __riscv_vredsum_vs_i32m4_i32m1(vec_s, vec_zero, __riscv_vsetvl_e32m4(len));
  int sum = __riscv_vmv_x_s_i32m1_i32(vec_sum);

  return sum;
}

请注意,这部分是原始代码的翻译。

英文:

The answer was to extend the width of the vector elements using the appropriate v{s;z}ext intrinsic, then use a reinterpret intrinsic on the result to "cast" its values.

Below is an example of a function and its vectorized equivalent, accounting for width/type changes.

Big thanks to Peter Cordes for helping me figure it out!

int byte_mac(unsigned char a[], unsigned char b[], int len) {
  int sum = 0;
  for (int i = 0; i &lt; len; i++) {
    sum += a[i] * b[i];
  }
  return sum;
}

int byte_mac_vec(unsigned char *a, unsigned char *b, int len) {
  size_t vlmax = __riscv_vsetvlmax_e8m1();
  vint32m4_t vec_s = __riscv_vmv_v_x_i32m4(0, vlmax);
  vint32m1_t vec_zero = __riscv_vmv_v_x_i32m1(0, vlmax);
  int k = len;
  for (size_t vl; k &gt; 0; k -= vl, a += vl, b += vl) {
    vl = __riscv_vsetvl_e8m1(k);
   
    vuint8m1_t a8s = __riscv_vle8_v_u8m1(a, vl);
    vuint8m1_t b8s = __riscv_vle8_v_u8m1(b, vl);
    vuint32m4_t a8s_extended = __riscv_vzext_vf4_u32m4(a8s, vl);
    vuint32m4_t b8s_extended = __riscv_vzext_vf4_u32m4(b8s, vl);
    
    vint32m4_t a8s_as_i32 = __riscv_vreinterpret_v_u32m4_i32m4(a8s_extended);
    vint32m4_t b8s_as_i32 = __riscv_vreinterpret_v_u32m4_i32m4(b8s_extended);

    vec_s = __riscv_vmacc_vv_i32m4_tu(vec_s, a8s_as_i32, b8s_as_i32, vl);
  }
  
  vint32m1_t vec_sum = __riscv_vredsum_vs_i32m4_i32m1(vec_s, vec_zero, __riscv_vsetvl_e32m4(len));
  int sum = __riscv_vmv_x_s_i32m1_i32(vec_sum);

  return sum;
}

huangapple
  • 本文由 发表于 2023年4月4日 04:53:28
  • 转载请务必保留本文链接:https://go.coder-hub.com/75923685.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定