英文:
Faster bitwise AND operation on byte slices
问题
我想对一个字节矩阵的每一列执行按位与操作,该矩阵以 [][]byte
的形式存储在 golang 中。我创建了一个包含可运行测试代码的 repo。
可以简化为对两个等长字节切片进行按位与操作。最简单的方法是使用 for
循环处理每一对字节。
func and(x, y []byte) []byte {
z := make([]byte, len(x))
for i := 0; i < len(x); i++ {
z[i] = x[i] & y[i]
}
return z
}
然而,对于较长的切片,这种方法非常慢。更快的方法是展开 for
循环(查看基准测试结果)
BenchmarkLoop-16 14467 84265 ns/op
BenchmarkUnrollLoop-16 17668 67550 ns/op
还有更快的方法吗?使用 Go 汇编?
提前谢谢你。
英文:
I'd like to perform bitwise AND on every column of a byte matrix, which is stored in a [][]byte
in golang. I created a repo with runnable test code.
It can be simplified as a bitwise AND operation on two byte-slice of equal length. The simplest way is using for
loop to handle every pair of bytes.
func and(x, y []byte) []byte {
z := make([]byte, lenght(x))
for i:= 0; i < len(x); i++ {
z[i] = x[i] & y[i]
}
return z
}
However, it's very slow for long slices. A faster way is to unroll the for
loop (check the benchmark result)
BenchmarkLoop-16 14467 84265 ns/op
BenchmarkUnrollLoop-16 17668 67550 ns/op
Any faster way? Go assembly?
Thank you in advance.
答案1
得分: 2
我在学习Go汇编两天后,使用AVX2指令编写了一个go汇编实现。
性能很好,比简单循环版本快了10倍。但仍需要进行兼容性和性能优化。欢迎提出建议和PR。
注意:代码和基准测试结果已更新。
我感谢@PeterCordes提供的许多宝贵建议。
#include "textflag.h"
// func AND(x []byte, y []byte)
// Requires: AVX
TEXT ·AND(SB), NOSPLIT|NOPTR, $0-48
// x的指针
MOVQ x_base+0(FP), AX
// x的长度
MOVQ x_len+8(FP), CX
// y的指针
MOVQ y_base+24(FP), DX
// --------------------------------------------
// x的结束地址,不会改变:p + n
MOVQ AX, BX
ADDQ CX, BX
// 循环的结束地址
// n <= 8,跳转到尾部
CMPQ CX, $0x00000008
JLE tail
// n < 16,跳转到loop8
CMPQ CX, $0x00000010
JL loop8_start
// n < 32,跳转到loop16
CMPQ CX, $0x00000020
JL loop16_start
// --------------------------------------------
// 循环32的结束地址
MOVQ BX, CX
SUBQ $0x0000001f, CX
loop32:
// 计算x & y,并将值保存到x
VMOVDQU (AX), Y0
VANDPS (DX), Y0, Y0
VMOVDQU Y0, (AX)
// 移动指针
ADDQ $0x00000020, AX
ADDQ $0x00000020, DX
CMPQ AX, CX
JL loop32
// n <= 8,跳转到尾部
MOVQ BX, CX
SUBQ AX, CX
CMPQ CX, $0x00000008
JLE tail
// n < 16,跳转到loop8
CMPQ CX, $0x00000010
JL loop8_start
// --------------------------------------------
loop16_start:
// 循环16的结束地址
MOVQ BX, CX
SUBQ $0x0000000f, CX
loop16:
// 计算x & y,并将值保存到x
VMOVDQU (AX), X0
VANDPS (DX), X0, X0
VMOVDQU X0, (AX)
// 移动指针
ADDQ $0x00000010, AX
ADDQ $0x00000010, DX
CMPQ AX, CX
JL loop16
// n <= 8,跳转到尾部
MOVQ BX, CX
SUBQ AX, CX
CMPQ CX, $0x00000008
JLE tail
// --------------------------------------------
loop8_start:
// 循环8的结束地址
MOVQ BX, CX
SUBQ $0x00000007, CX
loop8:
// 计算x & y,并将值保存到x
MOVQ (AX), BX
ANDQ (DX), BX
MOVQ BX, (AX)
// 移动指针
ADDQ $0x00000008, AX
ADDQ $0x00000008, DX
CMPQ AX, CX
JL loop8
// --------------------------------------------
tail:
// 剩余的元素(<=8)
MOVQ (AX), BX
ANDQ (DX), BX
MOVQ BX, (AX)
RET
基准测试结果:
test 数据大小 时间
------------------- --------- -----------
BenchmarkGrailbio 8.00_B 4.654 ns/op
BenchmarkGoAsm 8.00_B 4.824 ns/op
BenchmarkUnrollLoop 8.00_B 6.851 ns/op
BenchmarkLoop 8.00_B 8.683 ns/op
BenchmarkGrailbio 16.00_B 5.363 ns/op
BenchmarkGoAsm 16.00_B 6.369 ns/op
BenchmarkUnrollLoop 16.00_B 10.47 ns/op
BenchmarkLoop 16.00_B 13.48 ns/op
BenchmarkGoAsm 32.00_B 6.079 ns/op
BenchmarkGrailbio 32.00_B 6.497 ns/op
BenchmarkUnrollLoop 32.00_B 17.46 ns/op
BenchmarkLoop 32.00_B 21.09 ns/op
BenchmarkGoAsm 128.00_B 10.52 ns/op
BenchmarkGrailbio 128.00_B 14.40 ns/op
BenchmarkUnrollLoop 128.00_B 56.97 ns/op
BenchmarkLoop 128.00_B 80.12 ns/op
BenchmarkGoAsm 256.00_B 15.48 ns/op
BenchmarkGrailbio 256.00_B 23.76 ns/op
BenchmarkUnrollLoop 256.00_B 110.8 ns/op
BenchmarkLoop 256.00_B 147.5 ns/op
BenchmarkGoAsm 1.00_KB 47.16 ns/op
BenchmarkGrailbio 1.00_KB 87.75 ns/op
BenchmarkUnrollLoop 1.00_KB 443.1 ns/op
BenchmarkLoop 1.00_KB 540.5 ns/op
BenchmarkGoAsm 16.00_KB 751.6 ns/op
BenchmarkGrailbio 16.00_KB 1342 ns/op
BenchmarkUnrollLoop 16.00_KB 7007 ns/op
BenchmarkLoop 16.00_KB 8623 ns/op
英文:
I write a go assembly implementation using AVX2 instructions after two days of (go) assembly learning.
The performance is good, 10X
of simple loop version. While optimizations for compatibility and performance are still needed. Suggestions and PRs are welcome.
Note: code and benchmark results are updated.
I appreciate @PeterCordes for many valuable suggestions.
#include "textflag.h"
// func AND(x []byte, y []byte)
// Requires: AVX
TEXT ·AND(SB), NOSPLIT|NOPTR, $0-48
// pointer of x
MOVQ x_base+0(FP), AX
// length of x
MOVQ x_len+8(FP), CX
// pointer of y
MOVQ y_base+24(FP), DX
// --------------------------------------------
// end address of x, will not change: p + n
MOVQ AX, BX
ADDQ CX, BX
// end address for loop
// n <= 8, jump to tail
CMPQ CX, $0x00000008
JLE tail
// n < 16, jump to loop8
CMPQ CX, $0x00000010
JL loop8_start
// n < 32, jump to loop16
CMPQ CX, $0x00000020
JL loop16_start
// --------------------------------------------
// end address for loop32
MOVQ BX, CX
SUBQ $0x0000001f, CX
loop32:
// compute x & y, and save value to x
VMOVDQU (AX), Y0
VANDPS (DX), Y0, Y0
VMOVDQU Y0, (AX)
// move pointer
ADDQ $0x00000020, AX
ADDQ $0x00000020, DX
CMPQ AX, CX
JL loop32
// n <= 8, jump to tail
MOVQ BX, CX
SUBQ AX, CX
CMPQ CX, $0x00000008
JLE tail
// n < 16, jump to loop8
CMPQ CX, $0x00000010
JL loop8_start
// --------------------------------------------
loop16_start:
// end address for loop16
MOVQ BX, CX
SUBQ $0x0000000f, CX
loop16:
// compute x & y, and save value to x
VMOVDQU (AX), X0
VANDPS (DX), X0, X0
VMOVDQU X0, (AX)
// move pointer
ADDQ $0x00000010, AX
ADDQ $0x00000010, DX
CMPQ AX, CX
JL loop16
// n <= 8, jump to tail
MOVQ BX, CX
SUBQ AX, CX
CMPQ CX, $0x00000008
JLE tail
// --------------------------------------------
loop8_start:
// end address for loop8
MOVQ BX, CX
SUBQ $0x00000007, CX
loop8:
// compute x & y, and save value to x
MOVQ (AX), BX
ANDQ (DX), BX
MOVQ BX, (AX)
// move pointer
ADDQ $0x00000008, AX
ADDQ $0x00000008, DX
CMPQ AX, CX
JL loop8
// --------------------------------------------
tail:
// left elements (<=8)
MOVQ (AX), BX
ANDQ (DX), BX
MOVQ BX, (AX)
RET
Benchmark result:
test data-size time
------------------- --------- -----------
BenchmarkGrailbio 8.00_B 4.654 ns/op
BenchmarkGoAsm 8.00_B 4.824 ns/op
BenchmarkUnrollLoop 8.00_B 6.851 ns/op
BenchmarkLoop 8.00_B 8.683 ns/op
BenchmarkGrailbio 16.00_B 5.363 ns/op
BenchmarkGoAsm 16.00_B 6.369 ns/op
BenchmarkUnrollLoop 16.00_B 10.47 ns/op
BenchmarkLoop 16.00_B 13.48 ns/op
BenchmarkGoAsm 32.00_B 6.079 ns/op
BenchmarkGrailbio 32.00_B 6.497 ns/op
BenchmarkUnrollLoop 32.00_B 17.46 ns/op
BenchmarkLoop 32.00_B 21.09 ns/op
BenchmarkGoAsm 128.00_B 10.52 ns/op
BenchmarkGrailbio 128.00_B 14.40 ns/op
BenchmarkUnrollLoop 128.00_B 56.97 ns/op
BenchmarkLoop 128.00_B 80.12 ns/op
BenchmarkGoAsm 256.00_B 15.48 ns/op
BenchmarkGrailbio 256.00_B 23.76 ns/op
BenchmarkUnrollLoop 256.00_B 110.8 ns/op
BenchmarkLoop 256.00_B 147.5 ns/op
BenchmarkGoAsm 1.00_KB 47.16 ns/op
BenchmarkGrailbio 1.00_KB 87.75 ns/op
BenchmarkUnrollLoop 1.00_KB 443.1 ns/op
BenchmarkLoop 1.00_KB 540.5 ns/op
BenchmarkGoAsm 16.00_KB 751.6 ns/op
BenchmarkGrailbio 16.00_KB 1342 ns/op
BenchmarkUnrollLoop 16.00_KB 7007 ns/op
BenchmarkLoop 16.00_KB 8623 ns/op
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论