英文:
Different Benchmark results for functions with same early return statement
问题
我已经为将输入数组向右旋转n次实现了两个函数。
两种实现方式都会在初始数组等于所有旋转后的结果数组时提前退出,即当n为0或数组长度的倍数时。
func rotateRight1(nums []int, n int) {
n = n % len(nums)
if n == 0 {
return
}
lastNDigits := make([]int, n)
copy(lastNDigits, nums[len(nums)-n:])
copy(nums[n:], nums[:len(nums)-n])
copy(nums[:n], lastNDigits)
}
和
func rotateRight2(nums []int, n int) {
n = n % len(nums)
if n == 0 {
return
}
i := 0
current := nums[i]
iAlreadySeen := i
for j := 0; j < len(nums); j++ {
nextI := (i + n) % len(nums)
nums[nextI], current = current, nums[nextI]
i = nextI
// 处理长度为偶数的数组,其中i+k可能等于已经遍历过的索引
if nextI == iAlreadySeen {
i = (i + 1) % len(nums)
iAlreadySeen = i
current = nums[i]
}
}
}
在进行基准测试时,当n等于0时,两个函数的速度差异让我感到惊讶,差异达到了20倍。
func BenchmarkRotateRight1(b *testing.B) {
nums := make([]int, 5_000)
b.ResetTimer()
b.ReportAllocs()
for i := 0; i < b.N; i++ {
rotateRight1(nums, 0)
}
}
func BenchmarkRotateRight2(b *testing.B) {
nums := make([]int, 5_000)
b.ResetTimer()
b.ReportAllocs()
for i := 0; i < b.N; i++ {
rotateRight2(nums, 0)
}
}
运行go test -bench=.
会得到如下结果:
cpu: Intel(R) Core(TM) i7-7500U CPU @ 2.70GHz
BenchmarkRotateRight1-4 1000000000 0.4603 ns/op 0 B/op 0 allocs/op
BenchmarkRotateRight2-4 97236492 12.11 ns/op 0 B/op 0 allocs/op
PASS
我不理解这种性能差异,因为两个函数基本上都在做相同的事情,并且在if k == 0
条件下都会提前退出。
有人可以帮我理解这个问题吗?
go version go1.18 linux/amd64
英文:
I've implemented two functions for rotating an input array n
times to the right.
Both implementations exit early, without performing anything, if the initial array would be equal to the resulting array after all rotations. This happens anytime n
is 0 or a multiple of the length of the array.
func rotateRight1(nums []int, n int) {
n = n % len(nums)
if n == 0 {
return
}
lastNDigits := make([]int, n)
copy(lastNDigits, nums[len(nums)-n:])
copy(nums[n:], nums[:len(nums)-n])
copy(nums[:n], lastNDigits)
}
and
func rotateRight2(nums []int, n int) {
n = n % len(nums)
if n == 0 {
return
}
i := 0
current := nums[i]
iAlreadySeen := i
for j := 0; j < len(nums); j++ {
nextI := (i + n) % len(nums)
nums[nextI], current = current, nums[nextI]
i = nextI
// handle even length arrays where i+k might equal an already seen index
if nextI == iAlreadySeen {
i = (i + 1) % len(nums)
iAlreadySeen = i
current = nums[i]
}
}
}
When benchmarking, I was surprised to see a +20x difference in speed when n
equals 0 for both functions.
func BenchmarkRotateRight1(b *testing.B) {
nums := make([]int, 5_000)
b.ResetTimer()
b.ReportAllocs()
for i := 0; i < b.N; i++ {
rotateRight1(nums, 0)
}
}
func BenchmarkRotateRight2(b *testing.B) {
nums := make([]int, 5_000)
b.ResetTimer()
b.ReportAllocs()
for i := 0; i < b.N; i++ {
rotateRight2(nums, 0)
}
}
go test -bench=.
yields a result like this consistently:
cpu: Intel(R) Core(TM) i7-7500U CPU @ 2.70GHz
BenchmarkRotateRight1-4 1000000000 0.4603 ns/op 0 B/op 0 allocs/op
BenchmarkRotateRight2-4 97236492 12.11 ns/op 0 B/op 0 allocs/op
PASS
I don't understand this performance difference as both functions are basically doing the same thing and exiting early in the if k == 0
condition.
Could someone help me understand this?
go version go1.18 linux/amd64
答案1
得分: 3
你所看到的是编译器优化的结果。编译器足够聪明,通过内联函数调用来加快应用程序的速度。有时,编译器会优化以消除函数调用,并人为降低基准测试的运行时间(这可能会有些棘手)。
在进行分析之后,我们可以注意到函数rotateRight1
在基准测试执行期间甚至没有被调用:
(pprof) list BenchmarkRotateRight1
Total: 260ms
ROUTINE ======================== main_test.BenchmarkRotateRight1 in (edited)
260ms 260ms (flat, cum) 100% of Total
. . 43:func BenchmarkRotateRight1(b *testing.B) {
. . 44: nums := make([]int, 5_000)
. . 45:
. . 46: b.ResetTimer()
. . 47: b.ReportAllocs()
260ms 260ms 48: for i := 0; i < b.N; i++ {
. . 49: rotateRight1(nums, 0)
. . 50: }
. . 51:}
另一方面,rotateRight2
被调用了,这就是你在基准测试的运行时间中看到的差异:
(pprof) list BenchmarkRotateRight2
Total: 1.89s
ROUTINE ======================== main_test.BenchmarkRotateRight2 in (edited)
180ms 1.89s (flat, cum) 100% of Total
. . 53:func BenchmarkRotateRight2(b *testing.B) {
. . 54: nums := make([]int, 5_000)
. . 55:
. . 56: b.ResetTimer()
. . 57: b.ReportAllocs()
130ms 130ms 58: for i := 0; i < b.N; i++ {
50ms 1.76s 59: rotateRight2(nums, 0)
. . 60: }
. . 61:}
如果你使用-gcflags '-m -m'
(双重-m
或-m=2
)运行基准测试,你将看到编译器优化代码的一些决策(参考文档:https://pkg.go.dev/cmd/compile):
$ go test -gcflags '-m -m' -benchmem -bench . main_test.go
# command-line-arguments_test [command-line-arguments.test]
./main_test.go:7:6: can inline rotateRight1 with cost 40 as: func([]int, int) { n = n % len(nums); if n == 0 { return }; lastNDigits := make([]int, n); copy(lastNDigits, nums[len(nums) - n:]); copy(nums[n:], nums[:len(nums) - n]); copy(nums[:n], lastNDigits) }
./main_test.go:20:6: cannot inline rotateRight2: function too complex: cost 83 exceeds budget 80
...
因此,基于复杂度阈值,编译器决定是否进行优化。
你可以在要禁用内联优化的函数之前使用//go:noinline
指令。这将覆盖编译器通常的优化规则:
//go:noinline
func rotateRight1(nums []int, n int) {
...
}
现在你会注意到基准测试非常相似:
cpu: Intel(R) Core(TM) i9-9880H CPU @ 2.30GHz
BenchmarkRotateRight1
BenchmarkRotateRight1-16 135554571 8.886 ns/op 0 B/op 0 allocs/op
BenchmarkRotateRight2
BenchmarkRotateRight2-16 143716638 8.775 ns/op 0 B/op 0 allocs/op
PASS
英文:
What you see is the result of compiler optimization. The compiler is clever enough to make the applications faster by inlining function calls. Sometimes, the compiler optimizes to remove function calls and artificially lowers the run time of benchmarks (that can be tricky).
After profiling, we can notice the function rotateRight1
is not even being called during the benchmark execution:
(pprof) list BenchmarkRotateRight1
Total: 260ms
ROUTINE ======================== main_test.BenchmarkRotateRight1 in (edited)
260ms 260ms (flat, cum) 100% of Total
. . 43:func BenchmarkRotateRight1(b *testing.B) {
. . 44: nums := make([]int, 5_000)
. . 45:
. . 46: b.ResetTimer()
. . 47: b.ReportAllocs()
260ms 260ms 48: for i := 0; i < b.N; i++ {
. . 49: rotateRight1(nums, 0)
. . 50: }
. . 51:}
On the other hand, rotateRight2
is being called, and that's why you see that difference in the run time of the benchmarks:
(pprof) list BenchmarkRotateRight2
Total: 1.89s
ROUTINE ======================== main_test.BenchmarkRotateRight2 in (edited)
180ms 1.89s (flat, cum) 100% of Total
. . 53:func BenchmarkRotateRight2(b *testing.B) {
. . 54: nums := make([]int, 5_000)
. . 55:
. . 56: b.ResetTimer()
. . 57: b.ReportAllocs()
130ms 130ms 58: for i := 0; i < b.N; i++ {
50ms 1.76s 59: rotateRight2(nums, 0)
. . 60: }
. . 61:}
If you run your benchmark with -gcflags '-m -m'
(double -m
or -m=2
), you will see some of the compiler decisions to optimize the code (reference):
$ go test -gcflags '-m -m' -benchmem -bench . main_test.go
# command-line-arguments_test [command-line-arguments.test]
./main_test.go:7:6: can inline rotateRight1 with cost 40 as: func([]int, int) { n = n % len(nums); if n == 0 { return }; lastNDigits := make([]int, n); copy(lastNDigits, nums[len(nums) - n:]); copy(nums[n:], nums[:len(nums) - n]); copy(nums[:n], lastNDigits) }
./main_test.go:20:6: cannot inline rotateRight2: function too complex: cost 83 exceeds budget 80
...
So, based on a complexity threshold, the compiler decides whether to optimize or not.
You can use the//go:noinline
directive right before the function you want to disable inlining optimization though. That will override the compiler's usual optimization rules:
//go:noinline
func rotateRight1(nums []int, n int) {
...
}
Now you will notice the benchmarks are very similar:
cpu: Intel(R) Core(TM) i9-9880H CPU @ 2.30GHz
BenchmarkRotateRight1
BenchmarkRotateRight1-16 135554571 8.886 ns/op 0 B/op 0 allocs/op
BenchmarkRotateRight2
BenchmarkRotateRight2-16 143716638 8.775 ns/op 0 B/op 0 allocs/op
PASS
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论