2020年8月10日 18:59:00go评论75阅读模式

英文:

Could one explain significant performance difference in char-by-char iteration over j.l.String?

问题

以下是翻译好的部分：

我已经尝试了两种迭代java.lang.String的char-by-char方式，并发现它们令人困惑。基准测试总结如下：

@BenchmarkMode(Mode.AverageTime)
@OutputTimeUnit(TimeUnit.NANOSECONDS)
@Fork(jvmArgsAppend = {&quot;-Xms2g&quot;, &quot;-Xmx2g&quot;})
public class CharByCharIterationBenchmark {

  @Benchmark
  public void toCharArray(Data data, Blackhole b) {
    char[] chars = data.string.toCharArray();
    for (char ch : chars) {
      b.consume(ch);
    }
  }

  @Benchmark
  public void charAt(Data data, Blackhole b) {
    String string = data.string;
    int length = string.length();
    for (int i = 0; i &lt; length; i++) {
      b.consume(string.charAt(i));
    }
  }

  @State(Scope.Thread)
  public static class Data {
    String string;

    @Param({&quot;true&quot;, &quot;false&quot;})
    private boolean latin;

    @Param({&quot;5&quot;, &quot;10&quot;, &quot;50&quot;, &quot;100&quot;})
    private int length;

    @Setup
    public void setup() {
      String alphabet = latin
        ? &quot;abcdefghijklmnopqrstuvwxyz&quot;        // English
        : &quot;абвгдеёжзиклмнопрстуфхцчшщьыъэюя&quot;; // Russian

      RandomStringGenerator generator = new RandomStringGenerator();

      string = generator.randomString(alphabet, length);
    }
  }
}

直观上，toCharArray() 中描述的方法似乎不太有效，因为它在Java 8中分配了底层char[]的副本，并在Java 9及更新版本中将byte[]编码为char[]。然而，实际上情况相反：toCharArray()执行速度要快得多。

首先，我认为这里的原因与Nitsan Wakart的文章“Volatile read surprise”中描述的原因相同。然而，通过使用perfasm进行分析，我发现代码中最热的点与char[]/byte[]字段访问无关：

看起来最热门的地方是iinc（循环索引的增量）和ishr（算术右移）在StringUTF16.length()中，这对我来说相当反直觉。

此外，使用perfnorm分析器，我发现toCharArray()具有比charAt()更少的周期、指令和加载缺失：

请有人帮助解释这个结果，并解释为什么会有如此显著的差异？

英文:

I've tried two ways to iterate char-by-char over java.lang.String and found them confusing. The benchmark summarizes it:

@BenchmarkMode(Mode.AverageTime)
@OutputTimeUnit(TimeUnit.NANOSECONDS)
@Fork(jvmArgsAppend = {&quot;-Xms2g&quot;, &quot;-Xmx2g&quot;})
public class CharByCharIterationBenchmark {

  @Benchmark
  public void toCharArray(Data data, Blackhole b) {
    char[] chars = data.string.toCharArray();
    for (char ch : chars) {
      b.consume(ch);
    }
  }

  @Benchmark
  public void charAt(Data data, Blackhole b) {
    String string = data.string;
    int length = string.length();
    for (int i = 0; i &lt; length; i++) {
      b.consume(string.charAt(i));
    }
  }

  @State(Scope.Thread)
  public static class Data {
    String string;

    @Param({&quot;true&quot;, &quot;false&quot;})
    private boolean latin;

    @Param({&quot;5&quot;, &quot;10&quot;, &quot;50&quot;, &quot;100&quot;})
    private int length;

    @Setup
    public void setup() {
      String alphabet = latin
        ? &quot;abcdefghijklmnopqrstuvwxyz&quot;        // English
        : &quot;абвгдеёжзиклмнопрстуфхцчшщьыъэюя&quot;; // Russian

      RandomStringGenerator generator = new RandomStringGenerator();

      string = generator.randomString(alphabet, length);
    }
  }

Intuitively the approach described in toCharArray() seems to be less effective as it allocates a copy of underlying char[] as of Java 8 and encodes byte[] into char[] as of Java 9 and newer. In practise however it's vice versa: toCharArray() performs much faster:

Java 8
(latin)  (length)  Mode      Score     Error   Units
charAt                              true         5  avgt     21.051 &#177;   0.796   ns/op
charAt                              true        10  avgt     44.002 &#177;   2.324   ns/op
charAt                              true        50  avgt    221.068 &#177;   7.422   ns/op
charAt                              true       100  avgt    410.162 &#177;  13.441   ns/op
toCharArray                         true         5  avgt     16.819 &#177;   0.662   ns/op
toCharArray                         true        10  avgt     28.364 &#177;   0.663   ns/op
toCharArray                         true        50  avgt    110.910 &#177;   1.144   ns/op
toCharArray                         true       100  avgt    205.694 &#177;   1.669   ns/op
charAt:&#183;gc.alloc.rate.norm          true         5  avgt     ≈ 10⁻⁵              B/op
charAt:&#183;gc.alloc.rate.norm          true        10  avgt     ≈ 10⁻⁵              B/op
charAt:&#183;gc.alloc.rate.norm          true        50  avgt     ≈ 10⁻⁴              B/op
charAt:&#183;gc.alloc.rate.norm          true       100  avgt     ≈ 10⁻⁴              B/op
toCharArray:&#183;gc.alloc.rate.norm     true         5  avgt     32.000 &#177;   0.001    B/op
toCharArray:&#183;gc.alloc.rate.norm     true        10  avgt     40.000 &#177;   0.001    B/op
toCharArray:&#183;gc.alloc.rate.norm     true        50  avgt    120.000 &#177;   0.001    B/op
toCharArray:&#183;gc.alloc.rate.norm     true       100  avgt    216.000 &#177;   0.001    B/op
charAt                              false        5  avgt     20.372 &#177;   0.406   ns/op
charAt                              false       10  avgt     39.962 &#177;   0.911   ns/op
charAt                              false       50  avgt    201.337 &#177;   3.752   ns/op
charAt                              false      100  avgt    410.530 &#177;  17.931   ns/op
toCharArray                         false        5  avgt     15.767 &#177;   0.606   ns/op
toCharArray                         false       10  avgt     26.258 &#177;   0.345   ns/op
toCharArray                         false       50  avgt    109.631 &#177;   1.064   ns/op
toCharArray                         false      100  avgt    205.815 &#177;   4.716   ns/op
charAt:&#183;gc.alloc.rate.norm          false        5  avgt     ≈ 10⁻⁵              B/op
charAt:&#183;gc.alloc.rate.norm          false       10  avgt     ≈ 10⁻⁵              B/op
charAt:&#183;gc.alloc.rate.norm          false       50  avgt     ≈ 10⁻⁴              B/op
charAt:&#183;gc.alloc.rate.norm          false      100  avgt     ≈ 10⁻⁴              B/op
toCharArray:&#183;gc.alloc.rate.norm     false        5  avgt     32.000 &#177;   0.001    B/op
toCharArray:&#183;gc.alloc.rate.norm     false       10  avgt     40.000 &#177;   0.001    B/op
toCharArray:&#183;gc.alloc.rate.norm     false       50  avgt    120.000 &#177;   0.001    B/op
toCharArray:&#183;gc.alloc.rate.norm     false      100  avgt    216.000 &#177;   0.001    B/op
Java 11
(latin)  (length)  Mode     Score     Error   Units
charAt                               true         5  avgt    22.035 &#177;   1.557   ns/op
charAt                               true        10  avgt    41.800 &#177;   1.572   ns/op
charAt                               true        50  avgt   227.180 &#177;  18.655   ns/op
charAt                               true       100  avgt   474.719 &#177;  29.782   ns/op
toCharArray                          true         5  avgt    17.091 &#177;   0.662   ns/op
toCharArray                          true        10  avgt    26.167 &#177;   0.220   ns/op
toCharArray                          true        50  avgt   127.876 &#177;   2.106   ns/op
toCharArray                          true       100  avgt   244.449 &#177;   9.330   ns/op
charAt:&#183;gc.alloc.rate.norm           true         5  avgt    ≈ 10⁻⁵              B/op
charAt:&#183;gc.alloc.rate.norm           true        10  avgt    ≈ 10⁻⁵              B/op
charAt:&#183;gc.alloc.rate.norm           true        50  avgt    ≈ 10⁻⁴              B/op
charAt:&#183;gc.alloc.rate.norm           true       100  avgt    ≈ 10⁻⁴              B/op
toCharArray:&#183;gc.alloc.rate.norm      true         5  avgt    32.000 &#177;   0.001    B/op
toCharArray:&#183;gc.alloc.rate.norm      true        10  avgt    40.000 &#177;   0.001    B/op
toCharArray:&#183;gc.alloc.rate.norm      true        50  avgt   120.000 &#177;   0.001    B/op
toCharArray:&#183;gc.alloc.rate.norm      true       100  avgt   216.000 &#177;   0.001    B/op
charAt                              false         5  avgt    22.215 &#177;   2.064   ns/op
charAt                              false        10  avgt    45.606 &#177;   2.567   ns/op
charAt                              false        50  avgt   204.577 &#177;  18.302   ns/op
charAt                              false       100  avgt   404.056 &#177;  10.203   ns/op
toCharArray                         false         5  avgt    17.055 &#177;   0.556   ns/op
toCharArray                         false        10  avgt    29.254 &#177;   2.616   ns/op
toCharArray                         false        50  avgt   123.610 &#177;   5.033   ns/op
toCharArray                         false       100  avgt   226.174 &#177;   6.396   ns/op
charAt:&#183;gc.alloc.rate.norm          false         5  avgt    ≈ 10⁻⁵              B/op
charAt:&#183;gc.alloc.rate.norm          false        10  avgt    ≈ 10⁻⁵              B/op
charAt:&#183;gc.alloc.rate.norm          false        50  avgt    ≈ 10⁻⁴              B/op
charAt:&#183;gc.alloc.rate.norm          false       100  avgt    ≈ 10⁻⁴              B/op
toCharArray:&#183;gc.alloc.rate.norm     false         5  avgt    32.000 &#177;   0.001    B/op
toCharArray:&#183;gc.alloc.rate.norm     false        10  avgt    40.000 &#177;   0.001    B/op
toCharArray:&#183;gc.alloc.rate.norm     false        50  avgt   120.000 &#177;   0.001    B/op
toCharArray:&#183;gc.alloc.rate.norm     false       100  avgt   216.000 &#177;   0.001    B/op

First I was thinking that reason here is the same as descrbed in Nitsan Wakart's article "Volatile read surprise". However doing profiling with perfasm I see that the hottest spots in the code are not related to char[]/byte[] field access:

           ╭│     0x00007fa638407dd9: jmp    0x00007fa638407e4c
││     0x00007fa638407ddb: nopl   0x0(%rax,%rax,1)
4.96%    ││  ↗  0x00007fa638407de0: shl    $0x3,%r11
0.01%    ││  │  0x00007fa638407de4: movzwl 0x10(%r11,%r13,2),%edx  ;*invokevirtual charAt {reexecute=0 rethrow=0 return_oop=0}
││  │                                                ; - tsypanov.strings.character.CharByCharIterationBenchmark::charAt@25 (line 35)
││  │                                                ; - tsypanov.strings.character.generated.CharByCharIterationBenchmark_charAt_jmhTest::charAt_avgt_jmhStub@19 (line 191)
3.58%    ││ ↗│  0x00007fa638407dea: mov    %rsi,0x18(%rsp)
1.87%    ││ ││  0x00007fa638407def: mov    %r8d,0x14(%rsp)
4.18%    ││ ││  0x00007fa638407df4: mov    %edi,0x10(%rsp)
0.04%    ││ ││  0x00007fa638407df8: mov    %rbx,0x8(%rsp)
1.29%    ││ ││  0x00007fa638407dfd: mov    %r10,(%rsp)
1.83%    ││ ││  0x00007fa638407e01: mov    %r9,0x70(%rsp)
4.32%    ││ ││  0x00007fa638407e06: mov    %rax,0x60(%rsp)
0.05%    ││ ││  0x00007fa638407e0b: mov    %r9,%rsi
1.27%    ││ ││  0x00007fa638407e0e: nop
1.88%    ││ ││  0x00007fa638407e0f: callq  0x00007fa630926e00  ; ImmutableOopMap{[96]=Oop [104]=Oop [112]=Oop [120]=Oop [0]=Oop [16]=NarrowOop [24]=Oop }
││ ││                                                ;*invokevirtual consume {reexecute=0 rethrow=0 return_oop=0}
││ ││                                                ; - tsypanov.strings.character.CharByCharIterationBenchmark::charAt@28 (line 35)
││ ││                                                ; - tsypanov.strings.character.generated.CharByCharIterationBenchmark_charAt_jmhTest::charAt_avgt_jmhStub@19 (line 191)
││ ││                                                ;   {optimized virtual_call}
5.71%    ││ ││  0x00007fa638407e14: inc    %ebp               ;*iinc {reexecute=0 rethrow=0 return_oop=0}
││ ││                                                ; - tsypanov.strings.character.CharByCharIterationBenchmark::charAt@31 (line 34)
││ ││                                                ; - tsypanov.strings.character.generated.CharByCharIterationBenchmark_charAt_jmhTest::charAt_avgt_jmhStub@19 (line 191)
0.05%    ││ ││  0x00007fa638407e16: cmp    0x14(%rsp),%ebp
0.00%    │╰ ││  0x00007fa638407e1a: jge    0x00007fa638407d87  ;*if_icmpge {reexecute=0 rethrow=0 return_oop=0}
│  ││                                                ; - tsypanov.strings.character.CharByCharIterationBenchmark::charAt@18 (line 34)
│  ││                                                ; - tsypanov.strings.character.generated.CharByCharIterationBenchmark_charAt_jmhTest::charAt_avgt_jmhStub@19 (line 191)
3.05%    │  ││  0x00007fa638407e20: mov    0x10(%rsp),%edi
4.24%    │  ││  0x00007fa638407e24: movsbl 0x14(%r12,%rdi,8),%ecx  ;*getfield coder {reexecute=0 rethrow=0 return_oop=0}
│  ││                                                ; - java.lang.String::isLatin1@7 (line 3266)
│  ││                                                ; - java.lang.String::charAt@1 (line 692)
│  ││                                                ; - tsypanov.strings.character.CharByCharIterationBenchmark::charAt@25 (line 35)
│  ││                                                ; - tsypanov.strings.character.generated.CharByCharIterationBenchmark_charAt_jmhTest::charAt_avgt_jmhStub@19 (line 191)
0.86%    │  ││  0x00007fa638407e2a: mov    0xc(%r12,%rdi,8),%r11d  ;*getfield value {reexecute=0 rethrow=0 return_oop=0}
│  ││                                                ; - java.lang.String::charAt@8 (line 693)
│  ││                                                ; - tsypanov.strings.character.CharByCharIterationBenchmark::charAt@25 (line 35)
│  ││                                                ; - tsypanov.strings.character.generated.CharByCharIterationBenchmark_charAt_jmhTest::charAt_avgt_jmhStub@19 (line 191)
1.67%    │  ││  0x00007fa638407e2f: mov    0x60(%rsp),%rax
1.71%    │  ││  0x00007fa638407e34: mov    0x70(%rsp),%r9
3.92%    │  ││  0x00007fa638407e39: mov    (%rsp),%r10
0.20%    │  ││  0x00007fa638407e3d: mov    0x8(%rsp),%rbx
1.44%    │  ││  0x00007fa638407e42: mov    0x14(%rsp),%r8d
1.70%    │  ││  0x00007fa638407e47: mov    0x18(%rsp),%rsi    ;*aload_2 {reexecute=0 rethrow=0 return_oop=0}
│  ││                                                ; - tsypanov.strings.character.CharByCharIterationBenchmark::charAt@21 (line 35)
│  ││                                                ; - tsypanov.strings.character.generated.CharByCharIterationBenchmark_charAt_jmhTest::charAt_avgt_jmhStub@19 (line 191)
3.93%    ↘  ││  0x00007fa638407e4c: movslq %ebp,%r13          ;*invokestatic getChar {reexecute=0 rethrow=0 return_oop=0}
││                                                ; - java.lang.StringUTF16::charAt@7 (line 1268)
││                                                ; - java.lang.String::charAt@21 (line 695)
││                                                ; - tsypanov.strings.character.CharByCharIterationBenchmark::charAt@25 (line 35)
││                                                ; - tsypanov.strings.character.generated.CharByCharIterationBenchmark_charAt_jmhTest::charAt_avgt_jmhStub@19 (line 191)
0.23%       ││  0x00007fa638407e4f: test   %ecx,%ecx
0.00%      ╭││  0x00007fa638407e51: jne    0x00007fa638407e6b  ;*ifeq {reexecute=0 rethrow=0 return_oop=0}
│││                                                ; - java.lang.String::charAt@4 (line 692)
│││                                                ; - tsypanov.strings.character.CharByCharIterationBenchmark::charAt@25 (line 35)
│││                                                ; - tsypanov.strings.character.generated.CharByCharIterationBenchmark_charAt_jmhTest::charAt_avgt_jmhStub@19 (line 191)
│││  0x00007fa638407e53: mov    0xc(%r12,%r11,8),%edx  ; implicit exception: dispatches to 0x00007fa638407fbc
│││  0x00007fa638407e58: cmp    %edx,%ebp
│││  0x00007fa638407e5a: jae    0x00007fa638407eb0
│││  0x00007fa638407e5c: shl    $0x3,%r11
│││  0x00007fa638407e60: movzbl 0x10(%r11,%r13,1),%edx  ;*iand {reexecute=0 rethrow=0 return_oop=0}
│││                                                ; - java.lang.StringLatin1::charAt@25 (line 49)
│││                                                ; - java.lang.String::charAt@12 (line 693)
│││                                                ; - tsypanov.strings.character.CharByCharIterationBenchmark::charAt@25 (line 35)
│││                                                ; - tsypanov.strings.character.generated.CharByCharIterationBenchmark_charAt_jmhTest::charAt_avgt_jmhStub@19 (line 191)
│╰│  0x00007fa638407e66: jmpq   0x00007fa638407dea
1.52%      ↘ │  0x00007fa638407e6b: mov    0xc(%r12,%r11,8),%ecx  ; implicit exception: dispatches to 0x00007fa638407fb0
5.99%        │  0x00007fa638407e70: sar    %ecx               ;*ishr {reexecute=0 rethrow=0 return_oop=0}
│                                                ; - java.lang.StringUTF16::length@3 (line 74)
│                                                ; - java.lang.StringUTF16::checkIndex@2 (line 1470)
│                                                ; - java.lang.StringUTF16::charAt@2 (line 1267)
│                                                ; - java.lang.String::charAt@21 (line 695)
│                                                ; - tsypanov.strings.character.CharByCharIterationBenchmark::charAt@25 (line 35)
│                                                ; - tsypanov.strings.character.generated.CharByCharIterationBenchmark_charAt_jmhTest::charAt_avgt_jmhStub@19 (line 191)
5.51%        │  0x00007fa638407e72: cmp    %ecx,%ebp
0.01%        ╰  0x00007fa638407e74: jb     0x00007fa638407de0  ;*if_icmplt {reexecute=0 rethrow=0 return_oop=0}

It looks like the hottest place is iinc (increment of loop index) and ishr arythmetic shift in StringUTF16.length() which is rather conter-intuitive to me.

Also using perfnorm profiler I see that toCharArray() has less cycles, instructions and load-misses comparing to chatAt():

Benchmark                          Mode  Cnt     Score   Error  Units
charAt:L1-dcache-loads             avgt       2104.816           #/op
charAt:L1-dcache-stores            avgt       1200.878           #/op
charAt:branches                    avgt        603.754           #/op
charAt:cycles                      avgt       1461.282           #/op
charAt:dTLB-loads                  avgt       2105.253           #/op
charAt:dTLB-stores                 avgt       1200.909           #/op
charAt:instructions                avgt       4716.775           #/op
toCharArray:L1-dcache-loads        avgt       1026.341           #/op
toCharArray:L1-dcache-stores       avgt        416.997           #/op
toCharArray:branches               avgt        419.265           #/op
toCharArray:cycles                 avgt        820.521           #/op
toCharArray:dTLB-loads             avgt       1026.506           #/op
toCharArray:dTLB-stores            avgt        417.591           #/op
toCharArray:instructions           avgt       2409.806           #/op

Could someone help to interpret this and explain such significant difference?

答案1

得分: 1

正如@apangin在他的评论中提到的：

问题在于BlackHole.consume在循环内部被调用。作为一个不被内联的黑匣子方法，它阻止了对调用周围的代码进行优化，特别是对String字段进行缓存。

英文:

As @apangin mentioned in his comment

> The problem is that BlackHole.consume is called inside the loop. Being a non-inlined black box method, it prevents from optimizing the code around the call, in particular, caching String fields.

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

能否解释一下在逐字符迭代上与j.l.String相比存在的显著性能差异？

问题

答案1

我不明白为什么 mkdir() 返回 false？

Is it possible to configure a PropertySourcesPlaceholderConfigurer that reads from DataBase which further depends on properties file?

strings.Contains in switch-case GoLang

为什么在Java中元素会自动添加xmlns = “”？

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论