英文:
How to multiply-accumulate unsigned bytes into 32-bit elements without overflow with RISC-V extension "V" SIMD vectors?
问题
I am writing vector code with RISC-V intrinsics for extension V vectors, but this question probably applies to vectorisation generally.
我正在使用RISC-V内部函数编写矢量代码,用于扩展 V 向量,但这个问题可能通常适用于矢量化。
I need to multiply and accumulate lots of uint8
values. To do this I want to fill the vector registers with uint8
s, multiply and accumulate (MAC) in a loop, done. However in order to avoid overflowing the result of the accumulation would normally have to be stored in a larger type eg uint32
. How does this extend to vectors?
我需要对许多 uint8
值进行乘法和累加。为了做到这一点,我希望将矢量寄存器填充为 uint8
,在循环中进行乘法和累加(MAC),完成后需要避免溢出,通常需要将累加的结果存储在较大的类型中,例如 uint32
。这如何扩展到矢量?
I imagine I have to split the vector registers into 32-bit lanes and accumulate into them, but writing vectorised code is new to me. Is there a way I can split the vector registers into 8-bit lanes for better parallelism, and still avoid the overflow?
我想象我必须将矢量寄存器分成32位通道,并在其中累积,但编写矢量化代码对我来说是新的。有没有办法将矢量寄存器分成8位通道以实现更好的并行性,并仍然避免溢出?
A problem arises because I fill a vector register by providing a pointer to an array of uint8
一个问题出现了,因为我通过提供指向 uint8
数组的指针来填充矢量寄存器
vuint8m1_t vec_u8s = __riscv_vle64_v_u8m1(ptr_a, vl);
but if I were to replace this with...
但如果我将其替换为...
vuint32m1_t vec_u8s_in_32bit_lanes = __riscv_vle64_v_u32m1(ptr_a, vl);
It may read from my array as 32 bit values, reading 4 (uint8) elements into one (uint32) lane. Is my understanding correct? How should I avoid this?
它可能会将我的数组读取为32位值,将4个(uint8)元素读取到一个(uint32)通道中。我的理解正确吗?我应该如何避免这种情况?
Is it ok because ptr_a is defined as uint8_t * ptr_a ...
?
这是否可以,因为ptr_a被定义为 uint8_t * ptr_a ...
?
Edit:
编辑:
Perhaps what I'm looking for is
也许我正在寻找的是
vint32m1_t __riscv_vlse32_v_i32m1_m (vbool32_t mask, const int32_t *base, ptrdiff_t bstride, size_t vl);
where I can set the mask to 0xFF and stride to 1 to read data at 1 byte increments ?
在这里,我可以将掩码设置为0xFF,步幅设置为1以按1字节递增读取数据?
英文:
I am writing vector code with RISC-V intrinsics for extension V vectors, but this question probably applies to vectorisation generally.
I need to multiply and accumulate lots of uint8
values. To do this I want to fill the vector registers with uint8
s, multiply and accumulate (MAC) in a loop, done. However in order to avoid overflowing the result of the accumulation would normally have to be stored in a larger type eg uint32
. How does this extend to vectors?
I imagine I have to split the vector registers into 32-bit lanes and accumulate into them, but writing vectorised code is new to me. Is there a way I can split the vector registers into 8-bit lanes for better parallelism, and still avoid the overflow?
A problem arises because I fill a vector register by providing a pointer to an array of uint8
vuint8m1_t vec_u8s = __riscv_vle64_v_u8m1(ptr_a, vl);
but if I were to replace this with...
vuint32m1_t vec_u8s_in_32bit_lanes = __riscv_vle64_v_u32m1(ptr_a, vl);
It may read from my array as 32 bit values, reading 4 (uint8) elements into one (uint32) lane. Is my understanding correct? How should I avoid this?
Is it ok because ptr_a is defined as uint8_t * ptr_a ...
?
Edit:
Perhaps what im looking for is
vint32m1_t __riscv_vlse32_v_i32m1_m (vbool32_t mask, const int32_t *base, ptrdiff_t bstride, size_t vl);
where I can set the mask to 0xFF and stride to 1 to read data at 1 byte increments ?
答案1
得分: 1
以下是翻译的代码部分:
答案是使用适当的`v{s;z}ext`内部函数扩展向量元素的宽度,然后对结果使用重新解释的内部函数以"转换"其值。
以下是一个函数及其矢量化等效部分的示例,考虑了宽度/类型的更改。
非常感谢Peter Cordes帮助我弄清楚这个问题!
int byte_mac(unsigned char a[], unsigned char b[], int len) {
int sum = 0;
for (int i = 0; i < len; i++) {
sum += a[i] * b[i];
}
return sum;
}
int byte_mac_vec(unsigned char *a, unsigned char *b, int len) {
size_t vlmax = __riscv_vsetvlmax_e8m1();
vint32m4_t vec_s = __riscv_vmv_v_x_i32m4(0, vlmax);
vint32m1_t vec_zero = __riscv_vmv_v_x_i32m1(0, vlmax);
int k = len;
for (size_t vl; k > 0; k -= vl, a += vl, b += vl) {
vl = __riscv_vsetvl_e8m1(k);
vuint8m1_t a8s = __riscv_vle8_v_u8m1(a, vl);
vuint8m1_t b8s = __riscv_vle8_v_u8m1(b, vl);
vuint32m4_t a8s_extended = __riscv_vzext_vf4_u32m4(a8s, vl);
vuint32m4_t b8s_extended = __riscv_vzext_vf4_u32m4(b8s, vl);
vint32m4_t a8s_as_i32 = __riscv_vreinterpret_v_u32m4_i32m4(a8s_extended);
vint32m4_t b8s_as_i32 = __riscv_vreinterpret_v_u32m4_i32m4(b8s_extended);
vec_s = __riscv_vmacc_vv_i32m4_tu(vec_s, a8s_as_i32, b8s_as_i32, vl);
}
vint32m1_t vec_sum = __riscv_vredsum_vs_i32m4_i32m1(vec_s, vec_zero, __riscv_vsetvl_e32m4(len));
int sum = __riscv_vmv_x_s_i32m1_i32(vec_sum);
return sum;
}
请注意,这部分是原始代码的翻译。
英文:
The answer was to extend the width of the vector elements using the appropriate v{s;z}ext
intrinsic, then use a reinterpret intrinsic on the result to "cast" its values.
Below is an example of a function and its vectorized equivalent, accounting for width/type changes.
Big thanks to Peter Cordes for helping me figure it out!
int byte_mac(unsigned char a[], unsigned char b[], int len) {
int sum = 0;
for (int i = 0; i < len; i++) {
sum += a[i] * b[i];
}
return sum;
}
int byte_mac_vec(unsigned char *a, unsigned char *b, int len) {
size_t vlmax = __riscv_vsetvlmax_e8m1();
vint32m4_t vec_s = __riscv_vmv_v_x_i32m4(0, vlmax);
vint32m1_t vec_zero = __riscv_vmv_v_x_i32m1(0, vlmax);
int k = len;
for (size_t vl; k > 0; k -= vl, a += vl, b += vl) {
vl = __riscv_vsetvl_e8m1(k);
vuint8m1_t a8s = __riscv_vle8_v_u8m1(a, vl);
vuint8m1_t b8s = __riscv_vle8_v_u8m1(b, vl);
vuint32m4_t a8s_extended = __riscv_vzext_vf4_u32m4(a8s, vl);
vuint32m4_t b8s_extended = __riscv_vzext_vf4_u32m4(b8s, vl);
vint32m4_t a8s_as_i32 = __riscv_vreinterpret_v_u32m4_i32m4(a8s_extended);
vint32m4_t b8s_as_i32 = __riscv_vreinterpret_v_u32m4_i32m4(b8s_extended);
vec_s = __riscv_vmacc_vv_i32m4_tu(vec_s, a8s_as_i32, b8s_as_i32, vl);
}
vint32m1_t vec_sum = __riscv_vredsum_vs_i32m4_i32m1(vec_s, vec_zero, __riscv_vsetvl_e32m4(len));
int sum = __riscv_vmv_x_s_i32m1_i32(vec_sum);
return sum;
}
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论