新的JVM性能下降

huangapple go评论77阅读模式
英文:

Decreasing performance of new JVMs

问题

以下是你提供的内容的翻译:

性能(吞吐量)在 for 循环中对数组中的所有元素求和方面,在较新的 JVM 上比 Java 1.8.0 JDK 中的 JVM 要慢。我进行了 JHM 基准测试(如下图所示)。在每次测试之前,使用提供的 javac.exe 编译了源代码,并由选定的 JDK 提供的 java.exe 运行,测试在 Windows 10 上执行,由 PowerShell 脚本启动,在后台没有运行任何程序(没有其他 JVM)。计算机配备有32GB的RAM,因此在HDD上没有使用虚拟内存。

数组中有1000万个元素:
新的JVM性能下降

数组中有1亿个元素:
新的JVM性能下降

我的测试源代码:

@Param({"10000000", "100000000"})
public static int ELEMENTS;

public static void main(String[] args) throws RunnerException, IOException {
    // ... (此处省略了一些代码)
}

@Benchmark
public static void cStyleForLoop(Blackhole bh, MockData data) {
    // ... (此处省略了一些代码)
}

@State(Scope.Thread)
public static class MockData {
    // ... (此处省略了一些代码)
}

原始数据:

JDK 1.8.0_241:
"Benchmark";"Mode";"Threads";"Samples";"Score";"Score Error (99,9%)";"Unit";"Param: ELEMENTS"
"benchmark.IteratingBenchmark.cStyleForLoop";"thrpt";1;100;331,446104;5,563589;"ops/s";10000000
"benchmark.IteratingBenchmark.cStyleForLoop";"thrpt";1;100;33,757268;0,431403;"ops/s";100000000

JDK 11.0.2:
"Benchmark";"Mode";"Threads";"Samples";"Score";"Score Error (99,9%)";"Unit";"Param: ELEMENTS"
"benchmark.IteratingBenchmark.cStyleForLoop";"thrpt";1;100;322,728461;4,823611;"ops/s";10000000
"benchmark.IteratingBenchmark.cStyleForLoop";"thrpt";1;100;31,075948;0,062830;"ops/s";100000000

(... 其他版本的JDK的数据,此处省略 ...)

OpenJDK 15:
"Benchmark";"Mode";"Threads";"Samples";"Score";"Score Error (99,9%)";"Unit";"Param: ELEMENTS"
"benchmark.IteratingBenchmark.cStyleForLoop";"thrpt";1;100;343,530895;0,445551;"ops/s";10000000
"benchmark.IteratingBenchmark.cStyleForLoop";"thrpt";1;100;34,287083;0,035028;"ops/s";100000000

是否有合理的解释,为什么较新版本的Java比1.8慢(除了OpenJDK 15)?<br><br>

更新1:

我针对不同的Xmx/Xms值运行了相同的测试(每个测试中Xmx == Xms),结果如下图所示:

新的JVM性能下降
<br><br>

更新2:

  • 首先,我将Level.Iteration更改为Level.Trial
  • 其次,我强制使用了G1垃圾收集器。
  • 第三,Xmx/Xms设置为8GB。

结果:<br>
新的JVM性能下降

原始数据:

JDK 1.8.0_241:
"Benchmark";"Mode";"Threads";"Samples";"Score";"Score Error (99,9%)";"Unit";"Param: ELEMENTS"
"benchmark.IteratingBenchmark.cStyleForLoop";"thrpt";1;15;33,760346;0,089646;"ops/s";100000000

(... 其他版本的JDK的数据,此处省略 ...)

OpenJDK 15:
"Benchmark";"Mode";"Threads";"Samples";"Score";"Score Error (99,9%)";"Unit";"Param: ELEMENTS"
"benchmark.IteratingBenchmark.cStyleForLoop";"thrpt";1;15;34,310620;0,087412;"ops/s";100000000

更新3:<br>
我创建了一个GitHub仓库,其中包含基准测试源代码,以及使用我使用的JMH参数执行基准测试的脚本,该脚本会自动生成png格式的图表。
此外,我在另一台机器上(Linux)执行了基准测试。来自Linux机器的结果似乎更为乐观:
新的JVM性能下降

不幸的是,在我的Windows机器上,结果仍然显示性能下降(不包括JDK 15)。

更新4:
使用-XX:-UseCountedLoopSafepoints的结果:
新的JVM性能下降

英文:

The performance (throughput) of summing all elements in the array in for loop is slower on newer JVMs, than on JVM from Java 1.8.0 JDK. I performed JHM benchmark (plots below). Before each test, sources were compiled by provided javac.exe and run by java.exe, both binaries provided by selected JDK. Tests were performed on Windows 10 and launched by powershell script without any programs running in the background (no other jvms). The computer was equipped with 32GB of RAM, so virtual memory on HDD was not used.

10M elements in the array:
新的JVM性能下降

100M elements in the array:
新的JVM性能下降

Source code of my test:

@Param({&quot;10000000&quot;, &quot;100000000&quot;})
public static int ELEMENTS;

public static void main(String[] args) throws RunnerException, IOException {
    File outputFile = new File(args[0]);

    int javaMajorVersion = Integer.parseInt(System.getProperty(&quot;java.version&quot;).split(&quot;\\.&quot;)[0]);

    ChainedOptionsBuilder builder = new OptionsBuilder()
            .include(IteratingBenchmark.class.getSimpleName())
            .mode(Mode.Throughput)
            .forks(2)
            .measurementTime(TimeValue.seconds(10))
            .measurementIterations(50)
            .warmupTime(TimeValue.seconds(2))
            .warmupIterations(10)
            .resultFormat(ResultFormatType.SCSV)
            .result(outputFile.getAbsolutePath());

    if (javaMajorVersion &gt; 8) {
        builder = builder.jvmArgs(&quot;-Xms20g&quot;, &quot;-Xmx20g&quot;, &quot;--enable-preview&quot;);
    } else {
        builder = builder.jvmArgs(&quot;-Xms20g&quot;, &quot;-Xmx20g&quot;);
    }

    new Runner(builder.build()).run();
}

@Benchmark
public static void cStyleForLoop(Blackhole bh, MockData data) {
    long sum = 0;
    for (int i = 0; i &lt; data.randomInts.length; i++) {
        sum += data.randomInts[i];
    }

    bh.consume(sum);
}

@State(Scope.Thread)
public static class MockData {
    private int[] randomInts = new int[ELEMENTS];

    @Setup(Level.Iteration)
    public void setup() {
        Random r = new Random();
        this.randomInts = Stream.iterate(r.nextInt(), i -&gt; i + r.nextInt(1022) + 1).mapToInt(Integer::intValue).limit(ELEMENTS).toArray();
    }
}

Raw data:

JDK 1.8.0_241:
&quot;Benchmark&quot;;&quot;Mode&quot;;&quot;Threads&quot;;&quot;Samples&quot;;&quot;Score&quot;;&quot;Score Error (99,9%)&quot;;&quot;Unit&quot;;&quot;Param: ELEMENTS&quot;
&quot;benchmark.IteratingBenchmark.cStyleForLoop&quot;;&quot;thrpt&quot;;1;100;331,446104;5,563589;&quot;ops/s&quot;;10000000
&quot;benchmark.IteratingBenchmark.cStyleForLoop&quot;;&quot;thrpt&quot;;1;100;33,757268;0,431403;&quot;ops/s&quot;;100000000

JDK 11.0.2:
&quot;Benchmark&quot;;&quot;Mode&quot;;&quot;Threads&quot;;&quot;Samples&quot;;&quot;Score&quot;;&quot;Score Error (99,9%)&quot;;&quot;Unit&quot;;&quot;Param: ELEMENTS&quot;
&quot;benchmark.IteratingBenchmark.cStyleForLoop&quot;;&quot;thrpt&quot;;1;100;322,728461;4,823611;&quot;ops/s&quot;;10000000
&quot;benchmark.IteratingBenchmark.cStyleForLoop&quot;;&quot;thrpt&quot;;1;100;31,075948;0,062830;&quot;ops/s&quot;;100000000

JDK 12.0.1:
&quot;Benchmark&quot;;&quot;Mode&quot;;&quot;Threads&quot;;&quot;Samples&quot;;&quot;Score&quot;;&quot;Score Error (99,9%)&quot;;&quot;Unit&quot;;&quot;Param: ELEMENTS&quot;
&quot;benchmark.IteratingBenchmark.cStyleForLoop&quot;;&quot;thrpt&quot;;1;100;322,914782;4,450969;&quot;ops/s&quot;;10000000
&quot;benchmark.IteratingBenchmark.cStyleForLoop&quot;;&quot;thrpt&quot;;1;100;31,095232;0,075051;&quot;ops/s&quot;;100000000

JDK 13.0.1:
&quot;Benchmark&quot;;&quot;Mode&quot;;&quot;Threads&quot;;&quot;Samples&quot;;&quot;Score&quot;;&quot;Score Error (99,9%)&quot;;&quot;Unit&quot;;&quot;Param: ELEMENTS&quot;
&quot;benchmark.IteratingBenchmark.cStyleForLoop&quot;;&quot;thrpt&quot;;1;100;325,103055;4,933257;&quot;ops/s&quot;;10000000
&quot;benchmark.IteratingBenchmark.cStyleForLoop&quot;;&quot;thrpt&quot;;1;100;31,228403;0,067954;&quot;ops/s&quot;;100000000

JDK 14.0.1:
&quot;Benchmark&quot;;&quot;Mode&quot;;&quot;Threads&quot;;&quot;Samples&quot;;&quot;Score&quot;;&quot;Score Error (99,9%)&quot;;&quot;Unit&quot;;&quot;Param: ELEMENTS&quot;
&quot;benchmark.IteratingBenchmark.cStyleForLoop&quot;;&quot;thrpt&quot;;1;100;300,861148;0,443404;&quot;ops/s&quot;;10000000
&quot;benchmark.IteratingBenchmark.cStyleForLoop&quot;;&quot;thrpt&quot;;1;100;29,863602;0,035781;&quot;ops/s&quot;;100000000

OpenJDK 14.0.2:
&quot;Benchmark&quot;;&quot;Mode&quot;;&quot;Threads&quot;;&quot;Samples&quot;;&quot;Score&quot;;&quot;Score Error (99,9%)&quot;;&quot;Unit&quot;;&quot;Param: ELEMENTS&quot;
&quot;benchmark.IteratingBenchmark.cStyleForLoop&quot;;&quot;thrpt&quot;;1;100;300,781930;0,481579;&quot;ops/s&quot;;10000000
&quot;benchmark.IteratingBenchmark.cStyleForLoop&quot;;&quot;thrpt&quot;;1;100;29,873509;0,033055;&quot;ops/s&quot;;100000000

OpenJDK 15:
&quot;Benchmark&quot;;&quot;Mode&quot;;&quot;Threads&quot;;&quot;Samples&quot;;&quot;Score&quot;;&quot;Score Error (99,9%)&quot;;&quot;Unit&quot;;&quot;Param: ELEMENTS&quot;
&quot;benchmark.IteratingBenchmark.cStyleForLoop&quot;;&quot;thrpt&quot;;1;100;343,530895;0,445551;&quot;ops/s&quot;;10000000
&quot;benchmark.IteratingBenchmark.cStyleForLoop&quot;;&quot;thrpt&quot;;1;100;34,287083;0,035028;&quot;ops/s&quot;;100000000

Is there any valid explanation, why newer versions of Java are slower than 1.8 (except OpenJDK 15)?<br><br>

UPDATE 1:

I run same tests for different Xmx/Xms values (for each test Xmx == Xms), results below:

新的JVM性能下降
<br><br>

UPDATE 2:

  • Firstly, I changed Level.Iteration to Level.Trial.
  • Secondly, I forced G1 garbage collector.
  • Thirdly, Xmx/Xms was set to 8GB

Results:<br>
新的JVM性能下降

Raw data:

JDK 1.8.0_241:
&quot;Benchmark&quot;;&quot;Mode&quot;;&quot;Threads&quot;;&quot;Samples&quot;;&quot;Score&quot;;&quot;Score Error (99,9%)&quot;;&quot;Unit&quot;;&quot;Param: ELEMENTS&quot;
&quot;benchmark.IteratingBenchmark.cStyleForLoop&quot;;&quot;thrpt&quot;;1;15;33,760346;0,089646;&quot;ops/s&quot;;100000000

JDK 11.0.2:
&quot;Benchmark&quot;;&quot;Mode&quot;;&quot;Threads&quot;;&quot;Samples&quot;;&quot;Score&quot;;&quot;Score Error (99,9%)&quot;;&quot;Unit&quot;;&quot;Param: ELEMENTS&quot;
&quot;benchmark.IteratingBenchmark.cStyleForLoop&quot;;&quot;thrpt&quot;;1;15;31,075120;0,086171;&quot;ops/s&quot;;100000000

JDK 12.0.1:
&quot;Benchmark&quot;;&quot;Mode&quot;;&quot;Threads&quot;;&quot;Samples&quot;;&quot;Score&quot;;&quot;Score Error (99,9%)&quot;;&quot;Unit&quot;;&quot;Param: ELEMENTS&quot;
&quot;benchmark.IteratingBenchmark.cStyleForLoop&quot;;&quot;thrpt&quot;;1;15;31,173939;0,044176;&quot;ops/s&quot;;100000000

JDK 13.0.1:
&quot;Benchmark&quot;;&quot;Mode&quot;;&quot;Threads&quot;;&quot;Samples&quot;;&quot;Score&quot;;&quot;Score Error (99,9%)&quot;;&quot;Unit&quot;;&quot;Param: ELEMENTS&quot;
&quot;benchmark.IteratingBenchmark.cStyleForLoop&quot;;&quot;thrpt&quot;;1;15;31,219283;0,062329;&quot;ops/s&quot;;100000000

JDK 14.0.1:
&quot;Benchmark&quot;;&quot;Mode&quot;;&quot;Threads&quot;;&quot;Samples&quot;;&quot;Score&quot;;&quot;Score Error (99,9%)&quot;;&quot;Unit&quot;;&quot;Param: ELEMENTS&quot;
&quot;benchmark.IteratingBenchmark.cStyleForLoop&quot;;&quot;thrpt&quot;;1;15;29,808609;0,072664;&quot;ops/s&quot;;100000000

OpenJDK 14.0.2:
&quot;Benchmark&quot;;&quot;Mode&quot;;&quot;Threads&quot;;&quot;Samples&quot;;&quot;Score&quot;;&quot;Score Error (99,9%)&quot;;&quot;Unit&quot;;&quot;Param: ELEMENTS&quot;
&quot;benchmark.IteratingBenchmark.cStyleForLoop&quot;;&quot;thrpt&quot;;1;15;29,845817;0,074315;&quot;ops/s&quot;;100000000

OpenJDK 15:
&quot;Benchmark&quot;;&quot;Mode&quot;;&quot;Threads&quot;;&quot;Samples&quot;;&quot;Score&quot;;&quot;Score Error (99,9%)&quot;;&quot;Unit&quot;;&quot;Param: ELEMENTS&quot;
&quot;benchmark.IteratingBenchmark.cStyleForLoop&quot;;&quot;thrpt&quot;;1;15;34,310620;0,087412;&quot;ops/s&quot;;100000000

UPDATE 3:<br>
I made GitHub Repository containing benchmark source code, and script to perform benchmark with JMH parameters used by me, which automatically generates plots in png format.
Additionally I performed benchmark on other machine (Linux).<br> Results from Linux machine seems to be more optimistic:
新的JVM性能下降

Unfortunately, on my Windows machine, results still show decreasing performance (excluding JDK 15).

UPDATE 4:
Results with -XX:-UseCountedLoopSafepoints:
新的JVM性能下降

答案1

得分: 2

以下是您要翻译的内容:

即使我从 GitHub 上逐字复制您的基准测试,并使用相同的参数运行,我仍然无法复现结果。在我的环境中,JDK 14 的性能与 JDK 8 一样快(甚至稍微更快一些)。因此,在这个答案中,我将根据编译后的代码的反汇编分析两个版本之间的差异。

首先,让我们从同一供应商获取最新的 OpenJDK 构建版本。
在这里,我比较了 Liberica JDK 8u265+1Liberica JDK 14.0.2+13 适用于 Windows 64 位系统。

JMH 得分如下:

Benchmark                         (ELEMENTS)   Mode  Cnt    Score   Error  Units
IteratingBenchmark.cStyleForLoop    10000000  thrpt   30  263.137 &#177; 0.484  ops/s  # JDK 8
IteratingBenchmark.cStyleForLoop    10000000  thrpt   30  264.406 &#177; 0.788  ops/s  # JDK 14

现在,让我们使用内置的 -prof xperfasm 分析器运行 JMH,以查看基准测试中热点部分的反汇编结果。预计地,大约 99.5% 的 CPU 时间花在了 C2 编译的 cStyleForLoop 方法中。

JDK 8 上的热点区域

....[热点区域 1]..............................................................................
C2,级别 4,codes.dbg.IteratingBenchmark::cStyleForLoop,版本 574(71 字节)

             0x0000028c5607fc5f: add     r10d,0fffffff9h
             0x0000028c5607fc63: lea     rax,[r12+rcx*8]
             0x0000028c5607fc67: mov     ebx,80000000h
             0x0000028c5607fc6c: cmp     r9d,r10d
             0x0000028c5607fc6f: cmovl   r10d,ebx
             0x0000028c5607fc73: mov     r9d,1h
             0x0000028c5607fc79: cmp     r10d,1h
         ╭   0x0000028c5607fc7d: jle     28c5607fccch
         │   0x0000028c5607fc7f: nop                       ;*lload_2
         │                                                 ; - codes.dbg.IteratingBenchmark::cStyleForLoop@15 (line 25)
  0.07%  │↗  0x0000028c5607fc80: movsxd  rbx,dword ptr [rax+r9*4+10h]
  0.06%  ││  0x0000028c5607fc85: add     rbx,r8
  8.93%  ││  0x0000028c5607fc88: movsxd  rcx,r9d
  0.41%  ││  0x0000028c5607fc8b: movsxd  r8,dword ptr [rax+rcx*4+2ch]
 25.02%  ││  0x0000028c5607fc90: movsxd  rdi,dword ptr [rax+rcx*4+14h]
  0.10%  ││  0x0000028c5607fc95: movsxd  rsi,dword ptr [rax+rcx*4+18h]
  8.56%  ││  0x0000028c5607fc9a: movsxd  rbp,dword ptr [rax+rcx*4+28h]
  0.58%  ││  0x0000028c5607fc9f: movsxd  r13,dword ptr [rax+rcx*4+1ch]
  0.41%  ││  0x0000028c5607fca4: movsxd  r14,dword ptr [rax+rcx*4+20h]
  0.20%  ││  0x0000028c5607fca9: movsxd  rcx,dword ptr [rax+rcx*4+24h]
  8.85%  ││  0x0000028c5607fcae: add     rdi,rbx
  0.38%  ││  0x0000028c5607fcb1: add     rsi,rdi
  0.15%  ││  0x0000028c5607fcb4: add     r13,rsi
  8.57%  ││  0x0000028c5607fcb7: add     r14,r13
 13.76%  ││  0x0000028c5607fcba: add     rcx,r14
  5.51%  ││  0x0000028c5607fcbd: add     rbp,rcx
  8.50%  ││  0x0000028c5607fcc0: add     r8,rbp            ;*ladd
         ││                                                ; - codes.dbg.IteratingBenchmark::cStyleForLoop@24 (line 25)
  8.95%  ││  0x0000028c5607fcc3: add     r9d,8h            ;*iinc
         ││                                                ; - codes.dbg.IteratingBenchmark::cStyleForLoop@26 (line 24)
  0.40%  ││  0x0000028c5607fcc7: cmp     r9d,r10d
         │╰  0x0000028c5607fcca: jl      28c5607fc80h      ;*if_icmpge
         │                                                 ; - codes.dbg.IteratingBenchmark::cStyleForLoop@12 (line 24)
         ↘   0x0000028c5607fccc: cmp

<details>
<summary>英文:</summary>

Even after copying your benchmark verbatim from [GitHub](https://github.com/JakubBialy/javaloopbenchmark) and running with the same parameters, I still cannot reproduce the results. In my environment, JDK 14 performs as fast as JDK 8 (even a little bit faster). So, in this answer I&#39;ll analyze the difference between both versions basing on the disassembly of the compiled code.

First, let&#39;s take the most recent OpenJDK builds from the same vendor.  
Here I compare [Liberica JDK 8u265+1](https://bell-sw.com/pages/downloads/#/java-8-lts) and [Liberica JDK 14.0.2+13](https://bell-sw.com/pages/downloads/#/java-14-current) for Windows 64 bit.

JMH scores are the following:

Benchmark (ELEMENTS) Mode Cnt Score Error Units
IteratingBenchmark.cStyleForLoop 10000000 thrpt 30 263,137 ± 0,484 ops/s # JDK 8
IteratingBenchmark.cStyleForLoop 10000000 thrpt 30 264,406 ± 0,788 ops/s # JDK 14


Now let&#39;s run JMH with built-in `-prof xperfasm` profiler to see the disassembly of the hottest part of the benchmark. Expectedly, ~99.5% CPU time is spent in C2-compiled `cStyleForLoop` method.

**Hottest region on JDK 8**

....[Hottest Region 1]..............................................................................
C2, level 4, codes.dbg.IteratingBenchmark::cStyleForLoop, version 574 (71 bytes)

         0x0000028c5607fc5f: add     r10d,0fffffff9h
         0x0000028c5607fc63: lea     rax,[r12+rcx*8]
         0x0000028c5607fc67: mov     ebx,80000000h
         0x0000028c5607fc6c: cmp     r9d,r10d
         0x0000028c5607fc6f: cmovl   r10d,ebx
         0x0000028c5607fc73: mov     r9d,1h
         0x0000028c5607fc79: cmp     r10d,1h
     ╭   0x0000028c5607fc7d: jle     28c5607fccch
     │   0x0000028c5607fc7f: nop                       ;*lload_2
     │                                                 ; - codes.dbg.IteratingBenchmark::cStyleForLoop@15 (line 25)

0,07% │↗ 0x0000028c5607fc80: movsxd rbx,dword ptr [rax+r94+10h]
0,06% ││ 0x0000028c5607fc85: add rbx,r8
8,93% ││ 0x0000028c5607fc88: movsxd rcx,r9d
0,41% ││ 0x0000028c5607fc8b: movsxd r8,dword ptr [rax+rcx
4+2ch]
25,02% ││ 0x0000028c5607fc90: movsxd rdi,dword ptr [rax+rcx4+14h]
0,10% ││ 0x0000028c5607fc95: movsxd rsi,dword ptr [rax+rcx
4+18h]
8,56% ││ 0x0000028c5607fc9a: movsxd rbp,dword ptr [rax+rcx4+28h]
0,58% ││ 0x0000028c5607fc9f: movsxd r13,dword ptr [rax+rcx
4+1ch]
0,41% ││ 0x0000028c5607fca4: movsxd r14,dword ptr [rax+rcx4+20h]
0,20% ││ 0x0000028c5607fca9: movsxd rcx,dword ptr [rax+rcx
4+24h]
8,85% ││ 0x0000028c5607fcae: add rdi,rbx
0,38% ││ 0x0000028c5607fcb1: add rsi,rdi
0,15% ││ 0x0000028c5607fcb4: add r13,rsi
8,57% ││ 0x0000028c5607fcb7: add r14,r13
13,76% ││ 0x0000028c5607fcba: add rcx,r14
5,51% ││ 0x0000028c5607fcbd: add rbp,rcx
8,50% ││ 0x0000028c5607fcc0: add r8,rbp ;*ladd
││ ; - codes.dbg.IteratingBenchmark::cStyleForLoop@24 (line 25)
8,95% ││ 0x0000028c5607fcc3: add r9d,8h ;*iinc
││ ; - codes.dbg.IteratingBenchmark::cStyleForLoop@26 (line 24)
0,40% ││ 0x0000028c5607fcc7: cmp r9d,r10d
│╰ 0x0000028c5607fcca: jl 28c5607fc80h ;*if_icmpge
│ ; - codes.dbg.IteratingBenchmark::cStyleForLoop@12 (line 24)
↘ 0x0000028c5607fccc: cmp r9d,edx
0x0000028c5607fccf: jnl 28c5607fce4h
0x0000028c5607fcd1: nop ;lload_2
; - codes.dbg.IteratingBenchmark::cStyleForLoop@15 (line 25)
0x0000028c5607fcd4: movsxd r10,dword ptr [rax+r9
4+10h]
0x0000028c5607fcd9: add r8,r10 ;*ladd
; - codes.dbg.IteratingBenchmark::cStyleForLoop@24 (line 25)
....................................................................................................


**Hottest region on JDK 14**

....[Hottest Region 1]..............................................................................
c2, level 4, codes.dbg.IteratingBenchmark::cStyleForLoop, version 622 (147 bytes)

                                                                     ; - codes.dbg.IteratingBenchmark::cStyleForLoop@23 (line 25)
           0x000001e844438f72:   mov     r11d,r10d
           0x000001e844438f75:   add     r11d,0fffffff9h
           0x000001e844438f79:   lea     rax,[r12+r9*8]
           0x000001e844438f7d:   mov     ebx,1h
           0x000001e844438f82:   cmp     r11d,1h
           0x000001e844438f86:   jle     1e8444390c0h                ;*goto {reexecute=0 rethrow=0 return_oop=0}
                                                                     ; - codes.dbg.IteratingBenchmark::cStyleForLoop@29 (line 24)
     ╭     0x000001e844438f8c:   jmp     1e844438ffah
     │     0x000001e844438f8e:   nop

0,04% │↗ 0x000001e844438f90: mov rsi,r8 ;lload_2 {reexecute=0 rethrow=0 return_oop=0}
││ ; - codes.dbg.IteratingBenchmark::cStyleForLoop@15 (line 25)
0,04% ││ ↗ 0x000001e844438f93: movsxd rdx,dword ptr [rax+rbx
4+10h]
8,41% ││ │ 0x000001e844438f98: movsxd rbp,dword ptr [rax+rbx4+14h]
1,23% ││ │ 0x000001e844438f9d: movsxd r13,dword ptr [rax+rbx
4+18h]
0,03% ││ │ 0x000001e844438fa2: movsxd r8,dword ptr [rax+rbx4+2ch]
23,87% ││ │ 0x000001e844438fa7: movsxd r11,dword ptr [rax+rbx
4+28h]
8,22% ││ │ 0x000001e844438fac: movsxd r9,dword ptr [rax+rbx4+24h]
1,25% ││ │ 0x000001e844438fb1: movsxd rcx,dword ptr [rax+rbx
4+20h]
0,14% ││ │ 0x000001e844438fb6: movsxd r14,dword ptr [rax+rbx*4+1ch]
0,28% ││ │ 0x000001e844438fbb: add rdx,rsi
7,82% ││ │ 0x000001e844438fbe: add rbp,rdx
1,14% ││ │ 0x000001e844438fc1: add r13,rbp
0,17% ││ │ 0x000001e844438fc4: add r14,r13
14,57% ││ │ 0x000001e844438fc7: add rcx,r14
11,05% ││ │ 0x000001e844438fca: add r9,rcx
5,26% ││ │ 0x000001e844438fcd: add r11,r9
6,32% ││ │ 0x000001e844438fd0: add r8,r11 ;*ladd {reexecute=0 rethrow=0 return_oop=0}
││ │ ; - codes.dbg.IteratingBenchmark::cStyleForLoop@24 (line 25)
8,45% ││ │ 0x000001e844438fd3: add ebx,8h ;*iinc {reexecute=0 rethrow=0 return_oop=0}
││ │ ; - codes.dbg.IteratingBenchmark::cStyleForLoop@26 (line 24)
1,15% ││ │ 0x000001e844438fd6: cmp ebx,edi
│╰ │ 0x000001e844438fd8: jl 1e844438f90h ;*if_icmpge {reexecute=0 rethrow=0 return_oop=0}
│ │ ; - codes.dbg.IteratingBenchmark::cStyleForLoop@12 (line 24)
│ │ 0x000001e844438fda: mov r11,qword ptr [r15+110h] ; ImmutableOopMap {rax=Oop xmm0=Oop xmm1=Oop }
│ │ ;*goto {reexecute=1 rethrow=0 return_oop=0}
│ │ ; - (reexecute) codes.dbg.IteratingBenchmark::cStyleForLoop@29 (line 24)
0,00% │ │ 0x000001e844438fe1: test dword ptr [r11],eax ;*goto {reexecute=0 rethrow=0 return_oop=0}
│ │ ; - codes.dbg.IteratingBenchmark::cStyleForLoop@29 (line 24)
│ │ ; {poll}
0,02% │ │ 0x000001e844438fe4: cmp ebx,dword ptr [rsp]
│ ╭│ 0x000001e844438fe7: jnl 1e844439028h
0,00% │ ││ 0x000001e844438fe9: mov rsi,r8
│ ││ 0x000001e844438fec: vmovq r8,xmm0
│ ││ 0x000001e844438ff1: vmovq rdx,xmm1
0,01% │ ││ 0x000001e844438ff6: mov r11d,dword ptr [rsp]
↘ ││ 0x000001e844438ffa: mov ecx,r10d
││ 0x000001e844438ffd: sub ecx,ebx
││ 0x000001e844438fff: add ecx,0fffffff9h
0,00% ││ 0x000001e844439002: mov r9d,1f40h
││ 0x000001e844439008: cmp r9d,ecx
││ 0x000001e84443900b: mov edi,1f40h
││ 0x000001e844439010: cmovnle edi,ecx
0,02% ││ 0x000001e844439013: add edi,ebx
││ 0x000001e844439015: vmovq xmm0,r8
││ 0x000001e84443901a: vmovq xmm1,rdx
││ 0x000001e84443901f: mov dword ptr [rsp],r11d
0,01% │╰ 0x000001e844439023: jmp 1e844438f93h
↘ 0x000001e844439028: vmovq rdx,xmm1
0x000001e84443902d: cmp ebx,r10d
0x000001e844439030: jnl 1e844439043h
0x000001e844439032: nop ;lload_2 {reexecute=0 rethrow=0 return_oop=0}
; - codes.dbg.IteratingBenchmark::cStyleForLoop@15 (line 25)
0x000001e844439034: movsxd r11,dword ptr [rax+rbx
4+10h]
0x000001e844439039: add r8,r11 ;*ladd {reexecute=0 rethrow=0 return_oop=0}
; - codes.dbg.IteratingBenchmark::cStyleForLoop@24 (line 25)
0x000001e84443903c: inc ebx ;*iinc {reexecute=0 rethrow=0 return_oop=0}
; - codes.dbg.IteratingBenchmark::cStyleForLoop@26 (line 24)
....................................................................................................


As we can see, the loop body is compiled similarly on both JDKs:

 - 8 loop iterations are unrolled;
 - there are 8 loads from the array without bounds check, followed by 8 `add` instructions;
 - the order of loads is a bit different, but all the addresses share the same or the adjacent cache line anyway.

The key difference is that on JDK 14 the loop iteration is split into two nested blocks. This is a result of [Loop strip mining](https://bugs.openjdk.java.net/browse/JDK-8186027) optimization appeared in JDK 10. The idea of this optimization is to split the counted loop into the hot inner part without a safepoint poll, and an outer part with a safepoint poll instruction.

C2 JIT transforms the loop into something like
for (int i = 0; i &lt; array.length; i += 8000) {
    for (int j = 0; j &lt; 8000; j += 8) {
        int ix = i + j;
        int v0 = array[ix];
        int v1 = array[ix + 1];
        ...
        int v7 = array[ix + 7];
        sum += v0 + v1 + ... + v7;
    }
    safepoint_poll();
}

Note that JDK 8 version does not have a safepoint poll inside the counted loop at all. On one hand, this can make the loop run faster. But on the other hand, this is actually bad for low latency applications, since the pause time may increase by the duration of the entire loop.

JDK 14 inserts a safepoint poll inside the loop. This *might* be a reason of a slow down you observe, but I don&#39;t really belive in this, since due to loop strip mining optimization, the safepoint polling is performed just once in 8000 iterations.

To verify this, you may disable safepoint polling with `-XX:-UseCountedLoopSafepoints` JVM option. In this case, JDK 14 compiled version will look almost identical to JDK 8 one. And so will the benchmark scores.

</details>



huangapple
  • 本文由 发表于 2020年8月21日 22:30:11
  • 转载请务必保留本文链接:https://go.coder-hub.com/63524800.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定