英文:
Could one explain significant performance difference in char-by-char iteration over j.l.String?
问题
以下是翻译好的部分:
我已经尝试了两种迭代java.lang.String的char-by-char方式,并发现它们令人困惑。基准测试总结如下:
@BenchmarkMode(Mode.AverageTime)
@OutputTimeUnit(TimeUnit.NANOSECONDS)
@Fork(jvmArgsAppend = {"-Xms2g", "-Xmx2g"})
public class CharByCharIterationBenchmark {
@Benchmark
public void toCharArray(Data data, Blackhole b) {
char[] chars = data.string.toCharArray();
for (char ch : chars) {
b.consume(ch);
}
}
@Benchmark
public void charAt(Data data, Blackhole b) {
String string = data.string;
int length = string.length();
for (int i = 0; i < length; i++) {
b.consume(string.charAt(i));
}
}
@State(Scope.Thread)
public static class Data {
String string;
@Param({"true", "false"})
private boolean latin;
@Param({"5", "10", "50", "100"})
private int length;
@Setup
public void setup() {
String alphabet = latin
? "abcdefghijklmnopqrstuvwxyz" // English
: "абвгдеёжзиклмнопрстуфхцчшщьыъэюя"; // Russian
RandomStringGenerator generator = new RandomStringGenerator();
string = generator.randomString(alphabet, length);
}
}
}
直观上,toCharArray()
中描述的方法似乎不太有效,因为它在Java 8中分配了底层char[]
的副本,并在Java 9及更新版本中将byte[]
编码为char[]
。然而,实际上情况相反:toCharArray()
执行速度要快得多。
首先,我认为这里的原因与Nitsan Wakart的文章“Volatile read surprise”中描述的原因相同。然而,通过使用perfasm进行分析,我发现代码中最热的点与char[]
/byte[]
字段访问无关:
看起来最热门的地方是iinc
(循环索引的增量)和ishr
(算术右移)在StringUTF16.length()
中,这对我来说相当反直觉。
此外,使用perfnorm分析器,我发现toCharArray()
具有比charAt()
更少的周期、指令和加载缺失:
请有人帮助解释这个结果,并解释为什么会有如此显著的差异?
英文:
I've tried two ways to iterate char-by-char over java.lang.String and found them confusing. The benchmark summarizes it:
@BenchmarkMode(Mode.AverageTime)
@OutputTimeUnit(TimeUnit.NANOSECONDS)
@Fork(jvmArgsAppend = {"-Xms2g", "-Xmx2g"})
public class CharByCharIterationBenchmark {
@Benchmark
public void toCharArray(Data data, Blackhole b) {
char[] chars = data.string.toCharArray();
for (char ch : chars) {
b.consume(ch);
}
}
@Benchmark
public void charAt(Data data, Blackhole b) {
String string = data.string;
int length = string.length();
for (int i = 0; i < length; i++) {
b.consume(string.charAt(i));
}
}
@State(Scope.Thread)
public static class Data {
String string;
@Param({"true", "false"})
private boolean latin;
@Param({"5", "10", "50", "100"})
private int length;
@Setup
public void setup() {
String alphabet = latin
? "abcdefghijklmnopqrstuvwxyz" // English
: "абвгдеёжзиклмнопрстуфхцчшщьыъэюя"; // Russian
RandomStringGenerator generator = new RandomStringGenerator();
string = generator.randomString(alphabet, length);
}
}
Intuitively the approach described in toCharArray()
seems to be less effective as it allocates a copy of underlying char[]
as of Java 8 and encodes byte[]
into char[]
as of Java 9 and newer. In practise however it's vice versa: toCharArray()
performs much faster:
Java 8
(latin) (length) Mode Score Error Units
charAt true 5 avgt 21.051 ± 0.796 ns/op
charAt true 10 avgt 44.002 ± 2.324 ns/op
charAt true 50 avgt 221.068 ± 7.422 ns/op
charAt true 100 avgt 410.162 ± 13.441 ns/op
toCharArray true 5 avgt 16.819 ± 0.662 ns/op
toCharArray true 10 avgt 28.364 ± 0.663 ns/op
toCharArray true 50 avgt 110.910 ± 1.144 ns/op
toCharArray true 100 avgt 205.694 ± 1.669 ns/op
charAt:·gc.alloc.rate.norm true 5 avgt ≈ 10⁻⁵ B/op
charAt:·gc.alloc.rate.norm true 10 avgt ≈ 10⁻⁵ B/op
charAt:·gc.alloc.rate.norm true 50 avgt ≈ 10⁻⁴ B/op
charAt:·gc.alloc.rate.norm true 100 avgt ≈ 10⁻⁴ B/op
toCharArray:·gc.alloc.rate.norm true 5 avgt 32.000 ± 0.001 B/op
toCharArray:·gc.alloc.rate.norm true 10 avgt 40.000 ± 0.001 B/op
toCharArray:·gc.alloc.rate.norm true 50 avgt 120.000 ± 0.001 B/op
toCharArray:·gc.alloc.rate.norm true 100 avgt 216.000 ± 0.001 B/op
charAt false 5 avgt 20.372 ± 0.406 ns/op
charAt false 10 avgt 39.962 ± 0.911 ns/op
charAt false 50 avgt 201.337 ± 3.752 ns/op
charAt false 100 avgt 410.530 ± 17.931 ns/op
toCharArray false 5 avgt 15.767 ± 0.606 ns/op
toCharArray false 10 avgt 26.258 ± 0.345 ns/op
toCharArray false 50 avgt 109.631 ± 1.064 ns/op
toCharArray false 100 avgt 205.815 ± 4.716 ns/op
charAt:·gc.alloc.rate.norm false 5 avgt ≈ 10⁻⁵ B/op
charAt:·gc.alloc.rate.norm false 10 avgt ≈ 10⁻⁵ B/op
charAt:·gc.alloc.rate.norm false 50 avgt ≈ 10⁻⁴ B/op
charAt:·gc.alloc.rate.norm false 100 avgt ≈ 10⁻⁴ B/op
toCharArray:·gc.alloc.rate.norm false 5 avgt 32.000 ± 0.001 B/op
toCharArray:·gc.alloc.rate.norm false 10 avgt 40.000 ± 0.001 B/op
toCharArray:·gc.alloc.rate.norm false 50 avgt 120.000 ± 0.001 B/op
toCharArray:·gc.alloc.rate.norm false 100 avgt 216.000 ± 0.001 B/op
Java 11
(latin) (length) Mode Score Error Units
charAt true 5 avgt 22.035 ± 1.557 ns/op
charAt true 10 avgt 41.800 ± 1.572 ns/op
charAt true 50 avgt 227.180 ± 18.655 ns/op
charAt true 100 avgt 474.719 ± 29.782 ns/op
toCharArray true 5 avgt 17.091 ± 0.662 ns/op
toCharArray true 10 avgt 26.167 ± 0.220 ns/op
toCharArray true 50 avgt 127.876 ± 2.106 ns/op
toCharArray true 100 avgt 244.449 ± 9.330 ns/op
charAt:·gc.alloc.rate.norm true 5 avgt ≈ 10⁻⁵ B/op
charAt:·gc.alloc.rate.norm true 10 avgt ≈ 10⁻⁵ B/op
charAt:·gc.alloc.rate.norm true 50 avgt ≈ 10⁻⁴ B/op
charAt:·gc.alloc.rate.norm true 100 avgt ≈ 10⁻⁴ B/op
toCharArray:·gc.alloc.rate.norm true 5 avgt 32.000 ± 0.001 B/op
toCharArray:·gc.alloc.rate.norm true 10 avgt 40.000 ± 0.001 B/op
toCharArray:·gc.alloc.rate.norm true 50 avgt 120.000 ± 0.001 B/op
toCharArray:·gc.alloc.rate.norm true 100 avgt 216.000 ± 0.001 B/op
charAt false 5 avgt 22.215 ± 2.064 ns/op
charAt false 10 avgt 45.606 ± 2.567 ns/op
charAt false 50 avgt 204.577 ± 18.302 ns/op
charAt false 100 avgt 404.056 ± 10.203 ns/op
toCharArray false 5 avgt 17.055 ± 0.556 ns/op
toCharArray false 10 avgt 29.254 ± 2.616 ns/op
toCharArray false 50 avgt 123.610 ± 5.033 ns/op
toCharArray false 100 avgt 226.174 ± 6.396 ns/op
charAt:·gc.alloc.rate.norm false 5 avgt ≈ 10⁻⁵ B/op
charAt:·gc.alloc.rate.norm false 10 avgt ≈ 10⁻⁵ B/op
charAt:·gc.alloc.rate.norm false 50 avgt ≈ 10⁻⁴ B/op
charAt:·gc.alloc.rate.norm false 100 avgt ≈ 10⁻⁴ B/op
toCharArray:·gc.alloc.rate.norm false 5 avgt 32.000 ± 0.001 B/op
toCharArray:·gc.alloc.rate.norm false 10 avgt 40.000 ± 0.001 B/op
toCharArray:·gc.alloc.rate.norm false 50 avgt 120.000 ± 0.001 B/op
toCharArray:·gc.alloc.rate.norm false 100 avgt 216.000 ± 0.001 B/op
First I was thinking that reason here is the same as descrbed in Nitsan Wakart's article "Volatile read surprise". However doing profiling with perfasm I see that the hottest spots in the code are not related to char[]
/byte[]
field access:
╭│ 0x00007fa638407dd9: jmp 0x00007fa638407e4c
││ 0x00007fa638407ddb: nopl 0x0(%rax,%rax,1)
4.96% ││ ↗ 0x00007fa638407de0: shl $0x3,%r11
0.01% ││ │ 0x00007fa638407de4: movzwl 0x10(%r11,%r13,2),%edx ;*invokevirtual charAt {reexecute=0 rethrow=0 return_oop=0}
││ │ ; - tsypanov.strings.character.CharByCharIterationBenchmark::charAt@25 (line 35)
││ │ ; - tsypanov.strings.character.generated.CharByCharIterationBenchmark_charAt_jmhTest::charAt_avgt_jmhStub@19 (line 191)
3.58% ││ ↗│ 0x00007fa638407dea: mov %rsi,0x18(%rsp)
1.87% ││ ││ 0x00007fa638407def: mov %r8d,0x14(%rsp)
4.18% ││ ││ 0x00007fa638407df4: mov %edi,0x10(%rsp)
0.04% ││ ││ 0x00007fa638407df8: mov %rbx,0x8(%rsp)
1.29% ││ ││ 0x00007fa638407dfd: mov %r10,(%rsp)
1.83% ││ ││ 0x00007fa638407e01: mov %r9,0x70(%rsp)
4.32% ││ ││ 0x00007fa638407e06: mov %rax,0x60(%rsp)
0.05% ││ ││ 0x00007fa638407e0b: mov %r9,%rsi
1.27% ││ ││ 0x00007fa638407e0e: nop
1.88% ││ ││ 0x00007fa638407e0f: callq 0x00007fa630926e00 ; ImmutableOopMap{[96]=Oop [104]=Oop [112]=Oop [120]=Oop [0]=Oop [16]=NarrowOop [24]=Oop }
││ ││ ;*invokevirtual consume {reexecute=0 rethrow=0 return_oop=0}
││ ││ ; - tsypanov.strings.character.CharByCharIterationBenchmark::charAt@28 (line 35)
││ ││ ; - tsypanov.strings.character.generated.CharByCharIterationBenchmark_charAt_jmhTest::charAt_avgt_jmhStub@19 (line 191)
││ ││ ; {optimized virtual_call}
5.71% ││ ││ 0x00007fa638407e14: inc %ebp ;*iinc {reexecute=0 rethrow=0 return_oop=0}
││ ││ ; - tsypanov.strings.character.CharByCharIterationBenchmark::charAt@31 (line 34)
││ ││ ; - tsypanov.strings.character.generated.CharByCharIterationBenchmark_charAt_jmhTest::charAt_avgt_jmhStub@19 (line 191)
0.05% ││ ││ 0x00007fa638407e16: cmp 0x14(%rsp),%ebp
0.00% │╰ ││ 0x00007fa638407e1a: jge 0x00007fa638407d87 ;*if_icmpge {reexecute=0 rethrow=0 return_oop=0}
│ ││ ; - tsypanov.strings.character.CharByCharIterationBenchmark::charAt@18 (line 34)
│ ││ ; - tsypanov.strings.character.generated.CharByCharIterationBenchmark_charAt_jmhTest::charAt_avgt_jmhStub@19 (line 191)
3.05% │ ││ 0x00007fa638407e20: mov 0x10(%rsp),%edi
4.24% │ ││ 0x00007fa638407e24: movsbl 0x14(%r12,%rdi,8),%ecx ;*getfield coder {reexecute=0 rethrow=0 return_oop=0}
│ ││ ; - java.lang.String::isLatin1@7 (line 3266)
│ ││ ; - java.lang.String::charAt@1 (line 692)
│ ││ ; - tsypanov.strings.character.CharByCharIterationBenchmark::charAt@25 (line 35)
│ ││ ; - tsypanov.strings.character.generated.CharByCharIterationBenchmark_charAt_jmhTest::charAt_avgt_jmhStub@19 (line 191)
0.86% │ ││ 0x00007fa638407e2a: mov 0xc(%r12,%rdi,8),%r11d ;*getfield value {reexecute=0 rethrow=0 return_oop=0}
│ ││ ; - java.lang.String::charAt@8 (line 693)
│ ││ ; - tsypanov.strings.character.CharByCharIterationBenchmark::charAt@25 (line 35)
│ ││ ; - tsypanov.strings.character.generated.CharByCharIterationBenchmark_charAt_jmhTest::charAt_avgt_jmhStub@19 (line 191)
1.67% │ ││ 0x00007fa638407e2f: mov 0x60(%rsp),%rax
1.71% │ ││ 0x00007fa638407e34: mov 0x70(%rsp),%r9
3.92% │ ││ 0x00007fa638407e39: mov (%rsp),%r10
0.20% │ ││ 0x00007fa638407e3d: mov 0x8(%rsp),%rbx
1.44% │ ││ 0x00007fa638407e42: mov 0x14(%rsp),%r8d
1.70% │ ││ 0x00007fa638407e47: mov 0x18(%rsp),%rsi ;*aload_2 {reexecute=0 rethrow=0 return_oop=0}
│ ││ ; - tsypanov.strings.character.CharByCharIterationBenchmark::charAt@21 (line 35)
│ ││ ; - tsypanov.strings.character.generated.CharByCharIterationBenchmark_charAt_jmhTest::charAt_avgt_jmhStub@19 (line 191)
3.93% ↘ ││ 0x00007fa638407e4c: movslq %ebp,%r13 ;*invokestatic getChar {reexecute=0 rethrow=0 return_oop=0}
││ ; - java.lang.StringUTF16::charAt@7 (line 1268)
││ ; - java.lang.String::charAt@21 (line 695)
││ ; - tsypanov.strings.character.CharByCharIterationBenchmark::charAt@25 (line 35)
││ ; - tsypanov.strings.character.generated.CharByCharIterationBenchmark_charAt_jmhTest::charAt_avgt_jmhStub@19 (line 191)
0.23% ││ 0x00007fa638407e4f: test %ecx,%ecx
0.00% ╭││ 0x00007fa638407e51: jne 0x00007fa638407e6b ;*ifeq {reexecute=0 rethrow=0 return_oop=0}
│││ ; - java.lang.String::charAt@4 (line 692)
│││ ; - tsypanov.strings.character.CharByCharIterationBenchmark::charAt@25 (line 35)
│││ ; - tsypanov.strings.character.generated.CharByCharIterationBenchmark_charAt_jmhTest::charAt_avgt_jmhStub@19 (line 191)
│││ 0x00007fa638407e53: mov 0xc(%r12,%r11,8),%edx ; implicit exception: dispatches to 0x00007fa638407fbc
│││ 0x00007fa638407e58: cmp %edx,%ebp
│││ 0x00007fa638407e5a: jae 0x00007fa638407eb0
│││ 0x00007fa638407e5c: shl $0x3,%r11
│││ 0x00007fa638407e60: movzbl 0x10(%r11,%r13,1),%edx ;*iand {reexecute=0 rethrow=0 return_oop=0}
│││ ; - java.lang.StringLatin1::charAt@25 (line 49)
│││ ; - java.lang.String::charAt@12 (line 693)
│││ ; - tsypanov.strings.character.CharByCharIterationBenchmark::charAt@25 (line 35)
│││ ; - tsypanov.strings.character.generated.CharByCharIterationBenchmark_charAt_jmhTest::charAt_avgt_jmhStub@19 (line 191)
│╰│ 0x00007fa638407e66: jmpq 0x00007fa638407dea
1.52% ↘ │ 0x00007fa638407e6b: mov 0xc(%r12,%r11,8),%ecx ; implicit exception: dispatches to 0x00007fa638407fb0
5.99% │ 0x00007fa638407e70: sar %ecx ;*ishr {reexecute=0 rethrow=0 return_oop=0}
│ ; - java.lang.StringUTF16::length@3 (line 74)
│ ; - java.lang.StringUTF16::checkIndex@2 (line 1470)
│ ; - java.lang.StringUTF16::charAt@2 (line 1267)
│ ; - java.lang.String::charAt@21 (line 695)
│ ; - tsypanov.strings.character.CharByCharIterationBenchmark::charAt@25 (line 35)
│ ; - tsypanov.strings.character.generated.CharByCharIterationBenchmark_charAt_jmhTest::charAt_avgt_jmhStub@19 (line 191)
5.51% │ 0x00007fa638407e72: cmp %ecx,%ebp
0.01% ╰ 0x00007fa638407e74: jb 0x00007fa638407de0 ;*if_icmplt {reexecute=0 rethrow=0 return_oop=0}
It looks like the hottest place is iinc
(increment of loop index) and ishr
arythmetic shift in StringUTF16.length()
which is rather conter-intuitive to me.
Also using perfnorm profiler I see that toCharArray()
has less cycles, instructions and load-misses comparing to chatAt()
:
Benchmark Mode Cnt Score Error Units
charAt:L1-dcache-loads avgt 2104.816 #/op
charAt:L1-dcache-stores avgt 1200.878 #/op
charAt:branches avgt 603.754 #/op
charAt:cycles avgt 1461.282 #/op
charAt:dTLB-loads avgt 2105.253 #/op
charAt:dTLB-stores avgt 1200.909 #/op
charAt:instructions avgt 4716.775 #/op
toCharArray:L1-dcache-loads avgt 1026.341 #/op
toCharArray:L1-dcache-stores avgt 416.997 #/op
toCharArray:branches avgt 419.265 #/op
toCharArray:cycles avgt 820.521 #/op
toCharArray:dTLB-loads avgt 1026.506 #/op
toCharArray:dTLB-stores avgt 417.591 #/op
toCharArray:instructions avgt 2409.806 #/op
Could someone help to interpret this and explain such significant difference?
答案1
得分: 1
正如@apangin在他的评论中提到的:
问题在于BlackHole.consume在循环内部被调用。作为一个不被内联的黑匣子方法,它阻止了对调用周围的代码进行优化,特别是对String字段进行缓存。
英文:
As @apangin mentioned in his comment
> The problem is that BlackHole.consume is called inside the loop. Being a non-inlined black box method, it prevents from optimizing the code around the call, in particular, caching String fields.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论