能否解释一下在逐字符迭代上与j.l.String相比存在的显著性能差异?

huangapple go评论75阅读模式
英文:

Could one explain significant performance difference in char-by-char iteration over j.l.String?

问题

以下是翻译好的部分:

我已经尝试了两种迭代java.lang.String的char-by-char方式,并发现它们令人困惑。基准测试总结如下:

@BenchmarkMode(Mode.AverageTime)
@OutputTimeUnit(TimeUnit.NANOSECONDS)
@Fork(jvmArgsAppend = {"-Xms2g", "-Xmx2g"})
public class CharByCharIterationBenchmark {

  @Benchmark
  public void toCharArray(Data data, Blackhole b) {
    char[] chars = data.string.toCharArray();
    for (char ch : chars) {
      b.consume(ch);
    }
  }

  @Benchmark
  public void charAt(Data data, Blackhole b) {
    String string = data.string;
    int length = string.length();
    for (int i = 0; i < length; i++) {
      b.consume(string.charAt(i));
    }
  }

  @State(Scope.Thread)
  public static class Data {
    String string;

    @Param({"true", "false"})
    private boolean latin;

    @Param({"5", "10", "50", "100"})
    private int length;

    @Setup
    public void setup() {
      String alphabet = latin
        ? "abcdefghijklmnopqrstuvwxyz"        // English
        : "абвгдеёжзиклмнопрстуфхцчшщьыъэюя"; // Russian

      RandomStringGenerator generator = new RandomStringGenerator();

      string = generator.randomString(alphabet, length);
    }
  }
}

直观上,toCharArray() 中描述的方法似乎不太有效,因为它在Java 8中分配了底层char[]的副本,并在Java 9及更新版本中将byte[]编码为char[]。然而,实际上情况相反:toCharArray()执行速度要快得多。

首先,我认为这里的原因与Nitsan Wakart的文章“Volatile read surprise”中描述的原因相同。然而,通过使用perfasm进行分析,我发现代码中最热的点与char[]/byte[]字段访问无关:

看起来最热门的地方是iinc(循环索引的增量)和ishr(算术右移)在StringUTF16.length()中,这对我来说相当反直觉。

此外,使用perfnorm分析器,我发现toCharArray()具有比charAt()更少的周期、指令和加载缺失:

请有人帮助解释这个结果,并解释为什么会有如此显著的差异?

英文:

I've tried two ways to iterate char-by-char over java.lang.String and found them confusing. The benchmark summarizes it:

@BenchmarkMode(Mode.AverageTime)
@OutputTimeUnit(TimeUnit.NANOSECONDS)
@Fork(jvmArgsAppend = {"-Xms2g", "-Xmx2g"})
public class CharByCharIterationBenchmark {

  @Benchmark
  public void toCharArray(Data data, Blackhole b) {
    char[] chars = data.string.toCharArray();
    for (char ch : chars) {
      b.consume(ch);
    }
  }

  @Benchmark
  public void charAt(Data data, Blackhole b) {
    String string = data.string;
    int length = string.length();
    for (int i = 0; i < length; i++) {
      b.consume(string.charAt(i));
    }
  }

  @State(Scope.Thread)
  public static class Data {
    String string;

    @Param({"true", "false"})
    private boolean latin;

    @Param({"5", "10", "50", "100"})
    private int length;

    @Setup
    public void setup() {
      String alphabet = latin
        ? "abcdefghijklmnopqrstuvwxyz"        // English
        : "абвгдеёжзиклмнопрстуфхцчшщьыъэюя"; // Russian

      RandomStringGenerator generator = new RandomStringGenerator();

      string = generator.randomString(alphabet, length);
    }
  }

Intuitively the approach described in toCharArray() seems to be less effective as it allocates a copy of underlying char[] as of Java 8 and encodes byte[] into char[] as of Java 9 and newer. In practise however it's vice versa: toCharArray() performs much faster:

Java 8
(latin)  (length)  Mode      Score     Error   Units
charAt                              true         5  avgt     21.051 ±   0.796   ns/op
charAt                              true        10  avgt     44.002 ±   2.324   ns/op
charAt                              true        50  avgt    221.068 ±   7.422   ns/op
charAt                              true       100  avgt    410.162 ±  13.441   ns/op
toCharArray                         true         5  avgt     16.819 ±   0.662   ns/op
toCharArray                         true        10  avgt     28.364 ±   0.663   ns/op
toCharArray                         true        50  avgt    110.910 ±   1.144   ns/op
toCharArray                         true       100  avgt    205.694 ±   1.669   ns/op
charAt:·gc.alloc.rate.norm          true         5  avgt     ≈ 10⁻⁵              B/op
charAt:·gc.alloc.rate.norm          true        10  avgt     ≈ 10⁻⁵              B/op
charAt:·gc.alloc.rate.norm          true        50  avgt     ≈ 10⁻⁴              B/op
charAt:·gc.alloc.rate.norm          true       100  avgt     ≈ 10⁻⁴              B/op
toCharArray:·gc.alloc.rate.norm     true         5  avgt     32.000 ±   0.001    B/op
toCharArray:·gc.alloc.rate.norm     true        10  avgt     40.000 ±   0.001    B/op
toCharArray:·gc.alloc.rate.norm     true        50  avgt    120.000 ±   0.001    B/op
toCharArray:·gc.alloc.rate.norm     true       100  avgt    216.000 ±   0.001    B/op
charAt                              false        5  avgt     20.372 ±   0.406   ns/op
charAt                              false       10  avgt     39.962 ±   0.911   ns/op
charAt                              false       50  avgt    201.337 ±   3.752   ns/op
charAt                              false      100  avgt    410.530 ±  17.931   ns/op
toCharArray                         false        5  avgt     15.767 ±   0.606   ns/op
toCharArray                         false       10  avgt     26.258 ±   0.345   ns/op
toCharArray                         false       50  avgt    109.631 ±   1.064   ns/op
toCharArray                         false      100  avgt    205.815 ±   4.716   ns/op
charAt:·gc.alloc.rate.norm          false        5  avgt     ≈ 10⁻⁵              B/op
charAt:·gc.alloc.rate.norm          false       10  avgt     ≈ 10⁻⁵              B/op
charAt:·gc.alloc.rate.norm          false       50  avgt     ≈ 10⁻⁴              B/op
charAt:·gc.alloc.rate.norm          false      100  avgt     ≈ 10⁻⁴              B/op
toCharArray:·gc.alloc.rate.norm     false        5  avgt     32.000 ±   0.001    B/op
toCharArray:·gc.alloc.rate.norm     false       10  avgt     40.000 ±   0.001    B/op
toCharArray:·gc.alloc.rate.norm     false       50  avgt    120.000 ±   0.001    B/op
toCharArray:·gc.alloc.rate.norm     false      100  avgt    216.000 ±   0.001    B/op
Java 11
(latin)  (length)  Mode     Score     Error   Units
charAt                               true         5  avgt    22.035 ±   1.557   ns/op
charAt                               true        10  avgt    41.800 ±   1.572   ns/op
charAt                               true        50  avgt   227.180 ±  18.655   ns/op
charAt                               true       100  avgt   474.719 ±  29.782   ns/op
toCharArray                          true         5  avgt    17.091 ±   0.662   ns/op
toCharArray                          true        10  avgt    26.167 ±   0.220   ns/op
toCharArray                          true        50  avgt   127.876 ±   2.106   ns/op
toCharArray                          true       100  avgt   244.449 ±   9.330   ns/op
charAt:·gc.alloc.rate.norm           true         5  avgt    ≈ 10⁻⁵              B/op
charAt:·gc.alloc.rate.norm           true        10  avgt    ≈ 10⁻⁵              B/op
charAt:·gc.alloc.rate.norm           true        50  avgt    ≈ 10⁻⁴              B/op
charAt:·gc.alloc.rate.norm           true       100  avgt    ≈ 10⁻⁴              B/op
toCharArray:·gc.alloc.rate.norm      true         5  avgt    32.000 ±   0.001    B/op
toCharArray:·gc.alloc.rate.norm      true        10  avgt    40.000 ±   0.001    B/op
toCharArray:·gc.alloc.rate.norm      true        50  avgt   120.000 ±   0.001    B/op
toCharArray:·gc.alloc.rate.norm      true       100  avgt   216.000 ±   0.001    B/op
charAt                              false         5  avgt    22.215 ±   2.064   ns/op
charAt                              false        10  avgt    45.606 ±   2.567   ns/op
charAt                              false        50  avgt   204.577 ±  18.302   ns/op
charAt                              false       100  avgt   404.056 ±  10.203   ns/op
toCharArray                         false         5  avgt    17.055 ±   0.556   ns/op
toCharArray                         false        10  avgt    29.254 ±   2.616   ns/op
toCharArray                         false        50  avgt   123.610 ±   5.033   ns/op
toCharArray                         false       100  avgt   226.174 ±   6.396   ns/op
charAt:·gc.alloc.rate.norm          false         5  avgt    ≈ 10⁻⁵              B/op
charAt:·gc.alloc.rate.norm          false        10  avgt    ≈ 10⁻⁵              B/op
charAt:·gc.alloc.rate.norm          false        50  avgt    ≈ 10⁻⁴              B/op
charAt:·gc.alloc.rate.norm          false       100  avgt    ≈ 10⁻⁴              B/op
toCharArray:·gc.alloc.rate.norm     false         5  avgt    32.000 ±   0.001    B/op
toCharArray:·gc.alloc.rate.norm     false        10  avgt    40.000 ±   0.001    B/op
toCharArray:·gc.alloc.rate.norm     false        50  avgt   120.000 ±   0.001    B/op
toCharArray:·gc.alloc.rate.norm     false       100  avgt   216.000 ±   0.001    B/op

First I was thinking that reason here is the same as descrbed in Nitsan Wakart's article "Volatile read surprise". However doing profiling with perfasm I see that the hottest spots in the code are not related to char[]/byte[] field access:

           ╭│     0x00007fa638407dd9: jmp    0x00007fa638407e4c
││     0x00007fa638407ddb: nopl   0x0(%rax,%rax,1)
4.96%    ││  ↗  0x00007fa638407de0: shl    $0x3,%r11
0.01%    ││  │  0x00007fa638407de4: movzwl 0x10(%r11,%r13,2),%edx  ;*invokevirtual charAt {reexecute=0 rethrow=0 return_oop=0}
││  │                                                ; - tsypanov.strings.character.CharByCharIterationBenchmark::charAt@25 (line 35)
││  │                                                ; - tsypanov.strings.character.generated.CharByCharIterationBenchmark_charAt_jmhTest::charAt_avgt_jmhStub@19 (line 191)
3.58%    ││ ↗│  0x00007fa638407dea: mov    %rsi,0x18(%rsp)
1.87%    ││ ││  0x00007fa638407def: mov    %r8d,0x14(%rsp)
4.18%    ││ ││  0x00007fa638407df4: mov    %edi,0x10(%rsp)
0.04%    ││ ││  0x00007fa638407df8: mov    %rbx,0x8(%rsp)
1.29%    ││ ││  0x00007fa638407dfd: mov    %r10,(%rsp)
1.83%    ││ ││  0x00007fa638407e01: mov    %r9,0x70(%rsp)
4.32%    ││ ││  0x00007fa638407e06: mov    %rax,0x60(%rsp)
0.05%    ││ ││  0x00007fa638407e0b: mov    %r9,%rsi
1.27%    ││ ││  0x00007fa638407e0e: nop
1.88%    ││ ││  0x00007fa638407e0f: callq  0x00007fa630926e00  ; ImmutableOopMap{[96]=Oop [104]=Oop [112]=Oop [120]=Oop [0]=Oop [16]=NarrowOop [24]=Oop }
││ ││                                                ;*invokevirtual consume {reexecute=0 rethrow=0 return_oop=0}
││ ││                                                ; - tsypanov.strings.character.CharByCharIterationBenchmark::charAt@28 (line 35)
││ ││                                                ; - tsypanov.strings.character.generated.CharByCharIterationBenchmark_charAt_jmhTest::charAt_avgt_jmhStub@19 (line 191)
││ ││                                                ;   {optimized virtual_call}
5.71%    ││ ││  0x00007fa638407e14: inc    %ebp               ;*iinc {reexecute=0 rethrow=0 return_oop=0}
││ ││                                                ; - tsypanov.strings.character.CharByCharIterationBenchmark::charAt@31 (line 34)
││ ││                                                ; - tsypanov.strings.character.generated.CharByCharIterationBenchmark_charAt_jmhTest::charAt_avgt_jmhStub@19 (line 191)
0.05%    ││ ││  0x00007fa638407e16: cmp    0x14(%rsp),%ebp
0.00%    │╰ ││  0x00007fa638407e1a: jge    0x00007fa638407d87  ;*if_icmpge {reexecute=0 rethrow=0 return_oop=0}
│  ││                                                ; - tsypanov.strings.character.CharByCharIterationBenchmark::charAt@18 (line 34)
│  ││                                                ; - tsypanov.strings.character.generated.CharByCharIterationBenchmark_charAt_jmhTest::charAt_avgt_jmhStub@19 (line 191)
3.05%    │  ││  0x00007fa638407e20: mov    0x10(%rsp),%edi
4.24%    │  ││  0x00007fa638407e24: movsbl 0x14(%r12,%rdi,8),%ecx  ;*getfield coder {reexecute=0 rethrow=0 return_oop=0}
│  ││                                                ; - java.lang.String::isLatin1@7 (line 3266)
│  ││                                                ; - java.lang.String::charAt@1 (line 692)
│  ││                                                ; - tsypanov.strings.character.CharByCharIterationBenchmark::charAt@25 (line 35)
│  ││                                                ; - tsypanov.strings.character.generated.CharByCharIterationBenchmark_charAt_jmhTest::charAt_avgt_jmhStub@19 (line 191)
0.86%    │  ││  0x00007fa638407e2a: mov    0xc(%r12,%rdi,8),%r11d  ;*getfield value {reexecute=0 rethrow=0 return_oop=0}
│  ││                                                ; - java.lang.String::charAt@8 (line 693)
│  ││                                                ; - tsypanov.strings.character.CharByCharIterationBenchmark::charAt@25 (line 35)
│  ││                                                ; - tsypanov.strings.character.generated.CharByCharIterationBenchmark_charAt_jmhTest::charAt_avgt_jmhStub@19 (line 191)
1.67%    │  ││  0x00007fa638407e2f: mov    0x60(%rsp),%rax
1.71%    │  ││  0x00007fa638407e34: mov    0x70(%rsp),%r9
3.92%    │  ││  0x00007fa638407e39: mov    (%rsp),%r10
0.20%    │  ││  0x00007fa638407e3d: mov    0x8(%rsp),%rbx
1.44%    │  ││  0x00007fa638407e42: mov    0x14(%rsp),%r8d
1.70%    │  ││  0x00007fa638407e47: mov    0x18(%rsp),%rsi    ;*aload_2 {reexecute=0 rethrow=0 return_oop=0}
│  ││                                                ; - tsypanov.strings.character.CharByCharIterationBenchmark::charAt@21 (line 35)
│  ││                                                ; - tsypanov.strings.character.generated.CharByCharIterationBenchmark_charAt_jmhTest::charAt_avgt_jmhStub@19 (line 191)
3.93%    ↘  ││  0x00007fa638407e4c: movslq %ebp,%r13          ;*invokestatic getChar {reexecute=0 rethrow=0 return_oop=0}
││                                                ; - java.lang.StringUTF16::charAt@7 (line 1268)
││                                                ; - java.lang.String::charAt@21 (line 695)
││                                                ; - tsypanov.strings.character.CharByCharIterationBenchmark::charAt@25 (line 35)
││                                                ; - tsypanov.strings.character.generated.CharByCharIterationBenchmark_charAt_jmhTest::charAt_avgt_jmhStub@19 (line 191)
0.23%       ││  0x00007fa638407e4f: test   %ecx,%ecx
0.00%      ╭││  0x00007fa638407e51: jne    0x00007fa638407e6b  ;*ifeq {reexecute=0 rethrow=0 return_oop=0}
│││                                                ; - java.lang.String::charAt@4 (line 692)
│││                                                ; - tsypanov.strings.character.CharByCharIterationBenchmark::charAt@25 (line 35)
│││                                                ; - tsypanov.strings.character.generated.CharByCharIterationBenchmark_charAt_jmhTest::charAt_avgt_jmhStub@19 (line 191)
│││  0x00007fa638407e53: mov    0xc(%r12,%r11,8),%edx  ; implicit exception: dispatches to 0x00007fa638407fbc
│││  0x00007fa638407e58: cmp    %edx,%ebp
│││  0x00007fa638407e5a: jae    0x00007fa638407eb0
│││  0x00007fa638407e5c: shl    $0x3,%r11
│││  0x00007fa638407e60: movzbl 0x10(%r11,%r13,1),%edx  ;*iand {reexecute=0 rethrow=0 return_oop=0}
│││                                                ; - java.lang.StringLatin1::charAt@25 (line 49)
│││                                                ; - java.lang.String::charAt@12 (line 693)
│││                                                ; - tsypanov.strings.character.CharByCharIterationBenchmark::charAt@25 (line 35)
│││                                                ; - tsypanov.strings.character.generated.CharByCharIterationBenchmark_charAt_jmhTest::charAt_avgt_jmhStub@19 (line 191)
│╰│  0x00007fa638407e66: jmpq   0x00007fa638407dea
1.52%      ↘ │  0x00007fa638407e6b: mov    0xc(%r12,%r11,8),%ecx  ; implicit exception: dispatches to 0x00007fa638407fb0
5.99%        │  0x00007fa638407e70: sar    %ecx               ;*ishr {reexecute=0 rethrow=0 return_oop=0}
│                                                ; - java.lang.StringUTF16::length@3 (line 74)
│                                                ; - java.lang.StringUTF16::checkIndex@2 (line 1470)
│                                                ; - java.lang.StringUTF16::charAt@2 (line 1267)
│                                                ; - java.lang.String::charAt@21 (line 695)
│                                                ; - tsypanov.strings.character.CharByCharIterationBenchmark::charAt@25 (line 35)
│                                                ; - tsypanov.strings.character.generated.CharByCharIterationBenchmark_charAt_jmhTest::charAt_avgt_jmhStub@19 (line 191)
5.51%        │  0x00007fa638407e72: cmp    %ecx,%ebp
0.01%        ╰  0x00007fa638407e74: jb     0x00007fa638407de0  ;*if_icmplt {reexecute=0 rethrow=0 return_oop=0}

It looks like the hottest place is iinc (increment of loop index) and ishr arythmetic shift in StringUTF16.length() which is rather conter-intuitive to me.

Also using perfnorm profiler I see that toCharArray() has less cycles, instructions and load-misses comparing to chatAt():

Benchmark                          Mode  Cnt     Score   Error  Units
charAt:L1-dcache-loads             avgt       2104.816           #/op
charAt:L1-dcache-stores            avgt       1200.878           #/op
charAt:branches                    avgt        603.754           #/op
charAt:cycles                      avgt       1461.282           #/op
charAt:dTLB-loads                  avgt       2105.253           #/op
charAt:dTLB-stores                 avgt       1200.909           #/op
charAt:instructions                avgt       4716.775           #/op
toCharArray:L1-dcache-loads        avgt       1026.341           #/op
toCharArray:L1-dcache-stores       avgt        416.997           #/op
toCharArray:branches               avgt        419.265           #/op
toCharArray:cycles                 avgt        820.521           #/op
toCharArray:dTLB-loads             avgt       1026.506           #/op
toCharArray:dTLB-stores            avgt        417.591           #/op
toCharArray:instructions           avgt       2409.806           #/op

Could someone help to interpret this and explain such significant difference?

答案1

得分: 1

正如@apangin在他的评论中提到的:

问题在于BlackHole.consume在循环内部被调用。作为一个不被内联的黑匣子方法,它阻止了对调用周围的代码进行优化,特别是对String字段进行缓存。

英文:

As @apangin mentioned in his comment

> The problem is that BlackHole.consume is called inside the loop. Being a non-inlined black box method, it prevents from optimizing the code around the call, in particular, caching String fields.

huangapple
  • 本文由 发表于 2020年8月10日 18:59:00
  • 转载请务必保留本文链接:https://go.coder-hub.com/63338861.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定