在字节切片上更快的按位与操作

huangapple go评论130阅读模式
英文:

Faster bitwise AND operation on byte slices

问题

我想对一个字节矩阵的每一列执行按位与操作,该矩阵以 [][]byte 的形式存储在 golang 中。我创建了一个包含可运行测试代码的 repo

可以简化为对两个等长字节切片进行按位与操作。最简单的方法是使用 for 循环处理每一对字节。

func and(x, y []byte) []byte {
    z := make([]byte, len(x))
    for i := 0; i < len(x); i++ {
        z[i] = x[i] & y[i]
    }
    return z
}

然而,对于较长的切片,这种方法非常慢。更快的方法是展开 for 循环(查看基准测试结果

BenchmarkLoop-16                   14467             84265 ns/op
BenchmarkUnrollLoop-16             17668             67550 ns/op

还有更快的方法吗?使用 Go 汇编?

提前谢谢你。

英文:

I'd like to perform bitwise AND on every column of a byte matrix, which is stored in a [][]byte in golang. I created a repo with runnable test code.

It can be simplified as a bitwise AND operation on two byte-slice of equal length. The simplest way is using for loop to handle every pair of bytes.

func and(x, y []byte) []byte {
    z := make([]byte, lenght(x))
    for i:= 0; i &lt; len(x); i++ {
        z[i] = x[i] &amp; y[i]
    }
    return z
}

However, it's very slow for long slices. A faster way is to unroll the for loop (check the benchmark result)

BenchmarkLoop-16                   14467             84265 ns/op
BenchmarkUnrollLoop-16             17668             67550 ns/op

Any faster way? Go assembly?

Thank you in advance.

答案1

得分: 2

我在学习Go汇编两天后,使用AVX2指令编写了一个go汇编实现

性能很好,比简单循环版本快了10倍。但仍需要进行兼容性和性能优化。欢迎提出建议和PR。

注意:代码和基准测试结果已更新。

我感谢@PeterCordes提供的许多宝贵建议。

#include &quot;textflag.h&quot;

// func AND(x []byte, y []byte)
// Requires: AVX
TEXT &#183;AND(SB), NOSPLIT|NOPTR, $0-48
	// x的指针
	MOVQ x_base+0(FP), AX

	// x的长度
	MOVQ x_len+8(FP), CX

	// y的指针
	MOVQ y_base+24(FP), DX

	// --------------------------------------------
	// x的结束地址,不会改变:p + n
	MOVQ AX, BX
	ADDQ CX, BX

	// 循环的结束地址
	// n <= 8,跳转到尾部
	CMPQ CX, $0x00000008
	JLE  tail

	// n < 16,跳转到loop8
	CMPQ CX, $0x00000010
	JL   loop8_start

	// n < 32,跳转到loop16
	CMPQ CX, $0x00000020
	JL   loop16_start

	// --------------------------------------------
	// 循环32的结束地址
	MOVQ BX, CX
	SUBQ $0x0000001f, CX

loop32:
	// 计算x & y,并将值保存到x
	VMOVDQU (AX), Y0
	VANDPS  (DX), Y0, Y0
	VMOVDQU Y0, (AX)

	// 移动指针
	ADDQ $0x00000020, AX
	ADDQ $0x00000020, DX
	CMPQ AX, CX
	JL   loop32

	// n <= 8,跳转到尾部
	MOVQ BX, CX
	SUBQ AX, CX
	CMPQ CX, $0x00000008
	JLE  tail

	// n < 16,跳转到loop8
	CMPQ CX, $0x00000010
	JL   loop8_start

	// --------------------------------------------
loop16_start:
	// 循环16的结束地址
	MOVQ BX, CX
	SUBQ $0x0000000f, CX

loop16:
	// 计算x & y,并将值保存到x
	VMOVDQU (AX), X0
	VANDPS  (DX), X0, X0
	VMOVDQU X0, (AX)

	// 移动指针
	ADDQ $0x00000010, AX
	ADDQ $0x00000010, DX
	CMPQ AX, CX
	JL   loop16

	// n <= 8,跳转到尾部
	MOVQ BX, CX
	SUBQ AX, CX
	CMPQ CX, $0x00000008
	JLE  tail

	// --------------------------------------------
loop8_start:
	// 循环8的结束地址
	MOVQ BX, CX
	SUBQ $0x00000007, CX

loop8:
	// 计算x & y,并将值保存到x
	MOVQ (AX), BX
	ANDQ (DX), BX
	MOVQ BX, (AX)

	// 移动指针
	ADDQ $0x00000008, AX
	ADDQ $0x00000008, DX
	CMPQ AX, CX
	JL   loop8

	// --------------------------------------------
tail:
	// 剩余的元素(<=8)
	MOVQ (AX), BX
	ANDQ (DX), BX
	MOVQ BX, (AX)
	RET

基准测试结果:

test                       数据大小        时间
-------------------        ---------        -----------
BenchmarkGrailbio          8.00_B           4.654 ns/op
BenchmarkGoAsm             8.00_B           4.824 ns/op
BenchmarkUnrollLoop        8.00_B           6.851 ns/op
BenchmarkLoop              8.00_B           8.683 ns/op

BenchmarkGrailbio          16.00_B          5.363 ns/op
BenchmarkGoAsm             16.00_B          6.369 ns/op
BenchmarkUnrollLoop        16.00_B          10.47 ns/op
BenchmarkLoop              16.00_B          13.48 ns/op

BenchmarkGoAsm             32.00_B          6.079 ns/op
BenchmarkGrailbio          32.00_B          6.497 ns/op
BenchmarkUnrollLoop        32.00_B          17.46 ns/op
BenchmarkLoop              32.00_B          21.09 ns/op

BenchmarkGoAsm             128.00_B         10.52 ns/op
BenchmarkGrailbio          128.00_B         14.40 ns/op
BenchmarkUnrollLoop        128.00_B         56.97 ns/op
BenchmarkLoop              128.00_B         80.12 ns/op

BenchmarkGoAsm             256.00_B         15.48 ns/op
BenchmarkGrailbio          256.00_B         23.76 ns/op
BenchmarkUnrollLoop        256.00_B         110.8 ns/op
BenchmarkLoop              256.00_B         147.5 ns/op

BenchmarkGoAsm             1.00_KB          47.16 ns/op
BenchmarkGrailbio          1.00_KB          87.75 ns/op
BenchmarkUnrollLoop        1.00_KB          443.1 ns/op
BenchmarkLoop              1.00_KB          540.5 ns/op

BenchmarkGoAsm             16.00_KB         751.6 ns/op
BenchmarkGrailbio          16.00_KB         1342 ns/op
BenchmarkUnrollLoop        16.00_KB         7007 ns/op
BenchmarkLoop              16.00_KB         8623 ns/op
英文:

I write a go assembly implementation using AVX2 instructions after two days of (go) assembly learning.

The performance is good, 10X of simple loop version. While optimizations for compatibility and performance are still needed. Suggestions and PRs are welcome.

Note: code and benchmark results are updated.

I appreciate @PeterCordes for many valuable suggestions.

#include &quot;textflag.h&quot;
// func AND(x []byte, y []byte)
// Requires: AVX
TEXT &#183;AND(SB), NOSPLIT|NOPTR, $0-48
// pointer of x
MOVQ x_base+0(FP), AX
// length of x
MOVQ x_len+8(FP), CX
// pointer of y
MOVQ y_base+24(FP), DX
// --------------------------------------------
// end address of x, will not change: p + n
MOVQ AX, BX
ADDQ CX, BX
// end address for loop
// n &lt;= 8, jump to tail
CMPQ CX, $0x00000008
JLE  tail
// n &lt; 16, jump to loop8
CMPQ CX, $0x00000010
JL   loop8_start
// n &lt; 32, jump to loop16
CMPQ CX, $0x00000020
JL   loop16_start
// --------------------------------------------
// end address for loop32
MOVQ BX, CX
SUBQ $0x0000001f, CX
loop32:
// compute x &amp; y, and save value to x
VMOVDQU (AX), Y0
VANDPS  (DX), Y0, Y0
VMOVDQU Y0, (AX)
// move pointer
ADDQ $0x00000020, AX
ADDQ $0x00000020, DX
CMPQ AX, CX
JL   loop32
// n &lt;= 8, jump to tail
MOVQ BX, CX
SUBQ AX, CX
CMPQ CX, $0x00000008
JLE  tail
// n &lt; 16, jump to loop8
CMPQ CX, $0x00000010
JL   loop8_start
// --------------------------------------------
loop16_start:
// end address for loop16
MOVQ BX, CX
SUBQ $0x0000000f, CX
loop16:
// compute x &amp; y, and save value to x
VMOVDQU (AX), X0
VANDPS  (DX), X0, X0
VMOVDQU X0, (AX)
// move pointer
ADDQ $0x00000010, AX
ADDQ $0x00000010, DX
CMPQ AX, CX
JL   loop16
// n &lt;= 8, jump to tail
MOVQ BX, CX
SUBQ AX, CX
CMPQ CX, $0x00000008
JLE  tail
// --------------------------------------------
loop8_start:
// end address for loop8
MOVQ BX, CX
SUBQ $0x00000007, CX
loop8:
// compute x &amp; y, and save value to x
MOVQ (AX), BX
ANDQ (DX), BX
MOVQ BX, (AX)
// move pointer
ADDQ $0x00000008, AX
ADDQ $0x00000008, DX
CMPQ AX, CX
JL   loop8
// --------------------------------------------
tail:
// left elements (&lt;=8)
MOVQ (AX), BX
ANDQ (DX), BX
MOVQ BX, (AX)
RET

Benchmark result:

test                       data-size        time
-------------------        ---------        -----------
BenchmarkGrailbio          8.00_B           4.654 ns/op
BenchmarkGoAsm             8.00_B           4.824 ns/op
BenchmarkUnrollLoop        8.00_B           6.851 ns/op
BenchmarkLoop              8.00_B           8.683 ns/op
BenchmarkGrailbio          16.00_B          5.363 ns/op
BenchmarkGoAsm             16.00_B          6.369 ns/op
BenchmarkUnrollLoop        16.00_B          10.47 ns/op
BenchmarkLoop              16.00_B          13.48 ns/op
BenchmarkGoAsm             32.00_B          6.079 ns/op
BenchmarkGrailbio          32.00_B          6.497 ns/op
BenchmarkUnrollLoop        32.00_B          17.46 ns/op
BenchmarkLoop              32.00_B          21.09 ns/op
BenchmarkGoAsm             128.00_B         10.52 ns/op
BenchmarkGrailbio          128.00_B         14.40 ns/op
BenchmarkUnrollLoop        128.00_B         56.97 ns/op
BenchmarkLoop              128.00_B         80.12 ns/op
BenchmarkGoAsm             256.00_B         15.48 ns/op
BenchmarkGrailbio          256.00_B         23.76 ns/op
BenchmarkUnrollLoop        256.00_B         110.8 ns/op
BenchmarkLoop              256.00_B         147.5 ns/op
BenchmarkGoAsm             1.00_KB          47.16 ns/op
BenchmarkGrailbio          1.00_KB          87.75 ns/op
BenchmarkUnrollLoop        1.00_KB          443.1 ns/op
BenchmarkLoop              1.00_KB          540.5 ns/op
BenchmarkGoAsm             16.00_KB         751.6 ns/op
BenchmarkGrailbio          16.00_KB         1342 ns/op
BenchmarkUnrollLoop        16.00_KB         7007 ns/op
BenchmarkLoop              16.00_KB         8623 ns/op

huangapple
  • 本文由 发表于 2021年7月7日 14:08:52
  • 转载请务必保留本文链接:https://go.coder-hub.com/68280854.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定