英文:
Decreasing performance of new JVMs
问题
以下是你提供的内容的翻译:
性能(吞吐量)在 for 循环中对数组中的所有元素求和方面,在较新的 JVM 上比 Java 1.8.0 JDK 中的 JVM 要慢。我进行了 JHM 基准测试(如下图所示)。在每次测试之前,使用提供的 javac.exe 编译了源代码,并由选定的 JDK 提供的 java.exe 运行,测试在 Windows 10 上执行,由 PowerShell 脚本启动,在后台没有运行任何程序(没有其他 JVM)。计算机配备有32GB的RAM,因此在HDD上没有使用虚拟内存。
我的测试源代码:
@Param({"10000000", "100000000"})
public static int ELEMENTS;
public static void main(String[] args) throws RunnerException, IOException {
// ... (此处省略了一些代码)
}
@Benchmark
public static void cStyleForLoop(Blackhole bh, MockData data) {
// ... (此处省略了一些代码)
}
@State(Scope.Thread)
public static class MockData {
// ... (此处省略了一些代码)
}
原始数据:
JDK 1.8.0_241:
"Benchmark";"Mode";"Threads";"Samples";"Score";"Score Error (99,9%)";"Unit";"Param: ELEMENTS"
"benchmark.IteratingBenchmark.cStyleForLoop";"thrpt";1;100;331,446104;5,563589;"ops/s";10000000
"benchmark.IteratingBenchmark.cStyleForLoop";"thrpt";1;100;33,757268;0,431403;"ops/s";100000000
JDK 11.0.2:
"Benchmark";"Mode";"Threads";"Samples";"Score";"Score Error (99,9%)";"Unit";"Param: ELEMENTS"
"benchmark.IteratingBenchmark.cStyleForLoop";"thrpt";1;100;322,728461;4,823611;"ops/s";10000000
"benchmark.IteratingBenchmark.cStyleForLoop";"thrpt";1;100;31,075948;0,062830;"ops/s";100000000
(... 其他版本的JDK的数据,此处省略 ...)
OpenJDK 15:
"Benchmark";"Mode";"Threads";"Samples";"Score";"Score Error (99,9%)";"Unit";"Param: ELEMENTS"
"benchmark.IteratingBenchmark.cStyleForLoop";"thrpt";1;100;343,530895;0,445551;"ops/s";10000000
"benchmark.IteratingBenchmark.cStyleForLoop";"thrpt";1;100;34,287083;0,035028;"ops/s";100000000
是否有合理的解释,为什么较新版本的Java比1.8慢(除了OpenJDK 15)?<br><br>
更新1:
我针对不同的Xmx/Xms值运行了相同的测试(每个测试中Xmx == Xms),结果如下图所示:
更新2:
- 首先,我将
Level.Iteration
更改为Level.Trial
。 - 其次,我强制使用了G1垃圾收集器。
- 第三,Xmx/Xms设置为8GB。
原始数据:
JDK 1.8.0_241:
"Benchmark";"Mode";"Threads";"Samples";"Score";"Score Error (99,9%)";"Unit";"Param: ELEMENTS"
"benchmark.IteratingBenchmark.cStyleForLoop";"thrpt";1;15;33,760346;0,089646;"ops/s";100000000
(... 其他版本的JDK的数据,此处省略 ...)
OpenJDK 15:
"Benchmark";"Mode";"Threads";"Samples";"Score";"Score Error (99,9%)";"Unit";"Param: ELEMENTS"
"benchmark.IteratingBenchmark.cStyleForLoop";"thrpt";1;15;34,310620;0,087412;"ops/s";100000000
更新3:<br>
我创建了一个GitHub仓库,其中包含基准测试源代码,以及使用我使用的JMH参数执行基准测试的脚本,该脚本会自动生成png格式的图表。
此外,我在另一台机器上(Linux)执行了基准测试。来自Linux机器的结果似乎更为乐观:
不幸的是,在我的Windows机器上,结果仍然显示性能下降(不包括JDK 15)。
更新4:
使用-XX:-UseCountedLoopSafepoints
的结果:
英文:
The performance (throughput) of summing all elements in the array in for loop is slower on newer JVMs, than on JVM from Java 1.8.0 JDK. I performed JHM benchmark (plots below). Before each test, sources were compiled by provided javac.exe and run by java.exe, both binaries provided by selected JDK. Tests were performed on Windows 10 and launched by powershell script without any programs running in the background (no other jvms). The computer was equipped with 32GB of RAM, so virtual memory on HDD was not used.
Source code of my test:
@Param({"10000000", "100000000"})
public static int ELEMENTS;
public static void main(String[] args) throws RunnerException, IOException {
File outputFile = new File(args[0]);
int javaMajorVersion = Integer.parseInt(System.getProperty("java.version").split("\\.")[0]);
ChainedOptionsBuilder builder = new OptionsBuilder()
.include(IteratingBenchmark.class.getSimpleName())
.mode(Mode.Throughput)
.forks(2)
.measurementTime(TimeValue.seconds(10))
.measurementIterations(50)
.warmupTime(TimeValue.seconds(2))
.warmupIterations(10)
.resultFormat(ResultFormatType.SCSV)
.result(outputFile.getAbsolutePath());
if (javaMajorVersion > 8) {
builder = builder.jvmArgs("-Xms20g", "-Xmx20g", "--enable-preview");
} else {
builder = builder.jvmArgs("-Xms20g", "-Xmx20g");
}
new Runner(builder.build()).run();
}
@Benchmark
public static void cStyleForLoop(Blackhole bh, MockData data) {
long sum = 0;
for (int i = 0; i < data.randomInts.length; i++) {
sum += data.randomInts[i];
}
bh.consume(sum);
}
@State(Scope.Thread)
public static class MockData {
private int[] randomInts = new int[ELEMENTS];
@Setup(Level.Iteration)
public void setup() {
Random r = new Random();
this.randomInts = Stream.iterate(r.nextInt(), i -> i + r.nextInt(1022) + 1).mapToInt(Integer::intValue).limit(ELEMENTS).toArray();
}
}
Raw data:
JDK 1.8.0_241:
"Benchmark";"Mode";"Threads";"Samples";"Score";"Score Error (99,9%)";"Unit";"Param: ELEMENTS"
"benchmark.IteratingBenchmark.cStyleForLoop";"thrpt";1;100;331,446104;5,563589;"ops/s";10000000
"benchmark.IteratingBenchmark.cStyleForLoop";"thrpt";1;100;33,757268;0,431403;"ops/s";100000000
JDK 11.0.2:
"Benchmark";"Mode";"Threads";"Samples";"Score";"Score Error (99,9%)";"Unit";"Param: ELEMENTS"
"benchmark.IteratingBenchmark.cStyleForLoop";"thrpt";1;100;322,728461;4,823611;"ops/s";10000000
"benchmark.IteratingBenchmark.cStyleForLoop";"thrpt";1;100;31,075948;0,062830;"ops/s";100000000
JDK 12.0.1:
"Benchmark";"Mode";"Threads";"Samples";"Score";"Score Error (99,9%)";"Unit";"Param: ELEMENTS"
"benchmark.IteratingBenchmark.cStyleForLoop";"thrpt";1;100;322,914782;4,450969;"ops/s";10000000
"benchmark.IteratingBenchmark.cStyleForLoop";"thrpt";1;100;31,095232;0,075051;"ops/s";100000000
JDK 13.0.1:
"Benchmark";"Mode";"Threads";"Samples";"Score";"Score Error (99,9%)";"Unit";"Param: ELEMENTS"
"benchmark.IteratingBenchmark.cStyleForLoop";"thrpt";1;100;325,103055;4,933257;"ops/s";10000000
"benchmark.IteratingBenchmark.cStyleForLoop";"thrpt";1;100;31,228403;0,067954;"ops/s";100000000
JDK 14.0.1:
"Benchmark";"Mode";"Threads";"Samples";"Score";"Score Error (99,9%)";"Unit";"Param: ELEMENTS"
"benchmark.IteratingBenchmark.cStyleForLoop";"thrpt";1;100;300,861148;0,443404;"ops/s";10000000
"benchmark.IteratingBenchmark.cStyleForLoop";"thrpt";1;100;29,863602;0,035781;"ops/s";100000000
OpenJDK 14.0.2:
"Benchmark";"Mode";"Threads";"Samples";"Score";"Score Error (99,9%)";"Unit";"Param: ELEMENTS"
"benchmark.IteratingBenchmark.cStyleForLoop";"thrpt";1;100;300,781930;0,481579;"ops/s";10000000
"benchmark.IteratingBenchmark.cStyleForLoop";"thrpt";1;100;29,873509;0,033055;"ops/s";100000000
OpenJDK 15:
"Benchmark";"Mode";"Threads";"Samples";"Score";"Score Error (99,9%)";"Unit";"Param: ELEMENTS"
"benchmark.IteratingBenchmark.cStyleForLoop";"thrpt";1;100;343,530895;0,445551;"ops/s";10000000
"benchmark.IteratingBenchmark.cStyleForLoop";"thrpt";1;100;34,287083;0,035028;"ops/s";100000000
Is there any valid explanation, why newer versions of Java are slower than 1.8 (except OpenJDK 15)?<br><br>
UPDATE 1:
I run same tests for different Xmx/Xms values (for each test Xmx == Xms), results below:
UPDATE 2:
- Firstly, I changed
Level.Iteration
toLevel.Trial
. - Secondly, I forced G1 garbage collector.
- Thirdly, Xmx/Xms was set to 8GB
Raw data:
JDK 1.8.0_241:
"Benchmark";"Mode";"Threads";"Samples";"Score";"Score Error (99,9%)";"Unit";"Param: ELEMENTS"
"benchmark.IteratingBenchmark.cStyleForLoop";"thrpt";1;15;33,760346;0,089646;"ops/s";100000000
JDK 11.0.2:
"Benchmark";"Mode";"Threads";"Samples";"Score";"Score Error (99,9%)";"Unit";"Param: ELEMENTS"
"benchmark.IteratingBenchmark.cStyleForLoop";"thrpt";1;15;31,075120;0,086171;"ops/s";100000000
JDK 12.0.1:
"Benchmark";"Mode";"Threads";"Samples";"Score";"Score Error (99,9%)";"Unit";"Param: ELEMENTS"
"benchmark.IteratingBenchmark.cStyleForLoop";"thrpt";1;15;31,173939;0,044176;"ops/s";100000000
JDK 13.0.1:
"Benchmark";"Mode";"Threads";"Samples";"Score";"Score Error (99,9%)";"Unit";"Param: ELEMENTS"
"benchmark.IteratingBenchmark.cStyleForLoop";"thrpt";1;15;31,219283;0,062329;"ops/s";100000000
JDK 14.0.1:
"Benchmark";"Mode";"Threads";"Samples";"Score";"Score Error (99,9%)";"Unit";"Param: ELEMENTS"
"benchmark.IteratingBenchmark.cStyleForLoop";"thrpt";1;15;29,808609;0,072664;"ops/s";100000000
OpenJDK 14.0.2:
"Benchmark";"Mode";"Threads";"Samples";"Score";"Score Error (99,9%)";"Unit";"Param: ELEMENTS"
"benchmark.IteratingBenchmark.cStyleForLoop";"thrpt";1;15;29,845817;0,074315;"ops/s";100000000
OpenJDK 15:
"Benchmark";"Mode";"Threads";"Samples";"Score";"Score Error (99,9%)";"Unit";"Param: ELEMENTS"
"benchmark.IteratingBenchmark.cStyleForLoop";"thrpt";1;15;34,310620;0,087412;"ops/s";100000000
UPDATE 3:<br>
I made GitHub Repository containing benchmark source code, and script to perform benchmark with JMH parameters used by me, which automatically generates plots in png format.
Additionally I performed benchmark on other machine (Linux).<br> Results from Linux machine seems to be more optimistic:
Unfortunately, on my Windows machine, results still show decreasing performance (excluding JDK 15).
答案1
得分: 2
以下是您要翻译的内容:
即使我从 GitHub 上逐字复制您的基准测试,并使用相同的参数运行,我仍然无法复现结果。在我的环境中,JDK 14 的性能与 JDK 8 一样快(甚至稍微更快一些)。因此,在这个答案中,我将根据编译后的代码的反汇编分析两个版本之间的差异。
首先,让我们从同一供应商获取最新的 OpenJDK 构建版本。
在这里,我比较了 Liberica JDK 8u265+1 和 Liberica JDK 14.0.2+13 适用于 Windows 64 位系统。
JMH 得分如下:
Benchmark (ELEMENTS) Mode Cnt Score Error Units
IteratingBenchmark.cStyleForLoop 10000000 thrpt 30 263.137 ± 0.484 ops/s # JDK 8
IteratingBenchmark.cStyleForLoop 10000000 thrpt 30 264.406 ± 0.788 ops/s # JDK 14
现在,让我们使用内置的 -prof xperfasm
分析器运行 JMH,以查看基准测试中热点部分的反汇编结果。预计地,大约 99.5% 的 CPU 时间花在了 C2 编译的 cStyleForLoop
方法中。
JDK 8 上的热点区域
....[热点区域 1]..............................................................................
C2,级别 4,codes.dbg.IteratingBenchmark::cStyleForLoop,版本 574(71 字节)
0x0000028c5607fc5f: add r10d,0fffffff9h
0x0000028c5607fc63: lea rax,[r12+rcx*8]
0x0000028c5607fc67: mov ebx,80000000h
0x0000028c5607fc6c: cmp r9d,r10d
0x0000028c5607fc6f: cmovl r10d,ebx
0x0000028c5607fc73: mov r9d,1h
0x0000028c5607fc79: cmp r10d,1h
╭ 0x0000028c5607fc7d: jle 28c5607fccch
│ 0x0000028c5607fc7f: nop ;*lload_2
│ ; - codes.dbg.IteratingBenchmark::cStyleForLoop@15 (line 25)
0.07% │↗ 0x0000028c5607fc80: movsxd rbx,dword ptr [rax+r9*4+10h]
0.06% ││ 0x0000028c5607fc85: add rbx,r8
8.93% ││ 0x0000028c5607fc88: movsxd rcx,r9d
0.41% ││ 0x0000028c5607fc8b: movsxd r8,dword ptr [rax+rcx*4+2ch]
25.02% ││ 0x0000028c5607fc90: movsxd rdi,dword ptr [rax+rcx*4+14h]
0.10% ││ 0x0000028c5607fc95: movsxd rsi,dword ptr [rax+rcx*4+18h]
8.56% ││ 0x0000028c5607fc9a: movsxd rbp,dword ptr [rax+rcx*4+28h]
0.58% ││ 0x0000028c5607fc9f: movsxd r13,dword ptr [rax+rcx*4+1ch]
0.41% ││ 0x0000028c5607fca4: movsxd r14,dword ptr [rax+rcx*4+20h]
0.20% ││ 0x0000028c5607fca9: movsxd rcx,dword ptr [rax+rcx*4+24h]
8.85% ││ 0x0000028c5607fcae: add rdi,rbx
0.38% ││ 0x0000028c5607fcb1: add rsi,rdi
0.15% ││ 0x0000028c5607fcb4: add r13,rsi
8.57% ││ 0x0000028c5607fcb7: add r14,r13
13.76% ││ 0x0000028c5607fcba: add rcx,r14
5.51% ││ 0x0000028c5607fcbd: add rbp,rcx
8.50% ││ 0x0000028c5607fcc0: add r8,rbp ;*ladd
││ ; - codes.dbg.IteratingBenchmark::cStyleForLoop@24 (line 25)
8.95% ││ 0x0000028c5607fcc3: add r9d,8h ;*iinc
││ ; - codes.dbg.IteratingBenchmark::cStyleForLoop@26 (line 24)
0.40% ││ 0x0000028c5607fcc7: cmp r9d,r10d
│╰ 0x0000028c5607fcca: jl 28c5607fc80h ;*if_icmpge
│ ; - codes.dbg.IteratingBenchmark::cStyleForLoop@12 (line 24)
↘ 0x0000028c5607fccc: cmp
<details>
<summary>英文:</summary>
Even after copying your benchmark verbatim from [GitHub](https://github.com/JakubBialy/javaloopbenchmark) and running with the same parameters, I still cannot reproduce the results. In my environment, JDK 14 performs as fast as JDK 8 (even a little bit faster). So, in this answer I'll analyze the difference between both versions basing on the disassembly of the compiled code.
First, let's take the most recent OpenJDK builds from the same vendor.
Here I compare [Liberica JDK 8u265+1](https://bell-sw.com/pages/downloads/#/java-8-lts) and [Liberica JDK 14.0.2+13](https://bell-sw.com/pages/downloads/#/java-14-current) for Windows 64 bit.
JMH scores are the following:
Benchmark (ELEMENTS) Mode Cnt Score Error Units
IteratingBenchmark.cStyleForLoop 10000000 thrpt 30 263,137 ± 0,484 ops/s # JDK 8
IteratingBenchmark.cStyleForLoop 10000000 thrpt 30 264,406 ± 0,788 ops/s # JDK 14
Now let's run JMH with built-in `-prof xperfasm` profiler to see the disassembly of the hottest part of the benchmark. Expectedly, ~99.5% CPU time is spent in C2-compiled `cStyleForLoop` method.
**Hottest region on JDK 8**
....[Hottest Region 1]..............................................................................
C2, level 4, codes.dbg.IteratingBenchmark::cStyleForLoop, version 574 (71 bytes)
0x0000028c5607fc5f: add r10d,0fffffff9h
0x0000028c5607fc63: lea rax,[r12+rcx*8]
0x0000028c5607fc67: mov ebx,80000000h
0x0000028c5607fc6c: cmp r9d,r10d
0x0000028c5607fc6f: cmovl r10d,ebx
0x0000028c5607fc73: mov r9d,1h
0x0000028c5607fc79: cmp r10d,1h
╭ 0x0000028c5607fc7d: jle 28c5607fccch
│ 0x0000028c5607fc7f: nop ;*lload_2
│ ; - codes.dbg.IteratingBenchmark::cStyleForLoop@15 (line 25)
0,07% │↗ 0x0000028c5607fc80: movsxd rbx,dword ptr [rax+r94+10h]
0,06% ││ 0x0000028c5607fc85: add rbx,r8
8,93% ││ 0x0000028c5607fc88: movsxd rcx,r9d
0,41% ││ 0x0000028c5607fc8b: movsxd r8,dword ptr [rax+rcx4+2ch]
25,02% ││ 0x0000028c5607fc90: movsxd rdi,dword ptr [rax+rcx4+14h]
0,10% ││ 0x0000028c5607fc95: movsxd rsi,dword ptr [rax+rcx4+18h]
8,56% ││ 0x0000028c5607fc9a: movsxd rbp,dword ptr [rax+rcx4+28h]
0,58% ││ 0x0000028c5607fc9f: movsxd r13,dword ptr [rax+rcx4+1ch]
0,41% ││ 0x0000028c5607fca4: movsxd r14,dword ptr [rax+rcx4+20h]
0,20% ││ 0x0000028c5607fca9: movsxd rcx,dword ptr [rax+rcx4+24h]
8,85% ││ 0x0000028c5607fcae: add rdi,rbx
0,38% ││ 0x0000028c5607fcb1: add rsi,rdi
0,15% ││ 0x0000028c5607fcb4: add r13,rsi
8,57% ││ 0x0000028c5607fcb7: add r14,r13
13,76% ││ 0x0000028c5607fcba: add rcx,r14
5,51% ││ 0x0000028c5607fcbd: add rbp,rcx
8,50% ││ 0x0000028c5607fcc0: add r8,rbp ;*ladd
││ ; - codes.dbg.IteratingBenchmark::cStyleForLoop@24 (line 25)
8,95% ││ 0x0000028c5607fcc3: add r9d,8h ;*iinc
││ ; - codes.dbg.IteratingBenchmark::cStyleForLoop@26 (line 24)
0,40% ││ 0x0000028c5607fcc7: cmp r9d,r10d
│╰ 0x0000028c5607fcca: jl 28c5607fc80h ;*if_icmpge
│ ; - codes.dbg.IteratingBenchmark::cStyleForLoop@12 (line 24)
↘ 0x0000028c5607fccc: cmp r9d,edx
0x0000028c5607fccf: jnl 28c5607fce4h
0x0000028c5607fcd1: nop ;lload_2
; - codes.dbg.IteratingBenchmark::cStyleForLoop@15 (line 25)
0x0000028c5607fcd4: movsxd r10,dword ptr [rax+r94+10h]
0x0000028c5607fcd9: add r8,r10 ;*ladd
; - codes.dbg.IteratingBenchmark::cStyleForLoop@24 (line 25)
....................................................................................................
**Hottest region on JDK 14**
....[Hottest Region 1]..............................................................................
c2, level 4, codes.dbg.IteratingBenchmark::cStyleForLoop, version 622 (147 bytes)
; - codes.dbg.IteratingBenchmark::cStyleForLoop@23 (line 25)
0x000001e844438f72: mov r11d,r10d
0x000001e844438f75: add r11d,0fffffff9h
0x000001e844438f79: lea rax,[r12+r9*8]
0x000001e844438f7d: mov ebx,1h
0x000001e844438f82: cmp r11d,1h
0x000001e844438f86: jle 1e8444390c0h ;*goto {reexecute=0 rethrow=0 return_oop=0}
; - codes.dbg.IteratingBenchmark::cStyleForLoop@29 (line 24)
╭ 0x000001e844438f8c: jmp 1e844438ffah
│ 0x000001e844438f8e: nop
0,04% │↗ 0x000001e844438f90: mov rsi,r8 ;lload_2 {reexecute=0 rethrow=0 return_oop=0}
││ ; - codes.dbg.IteratingBenchmark::cStyleForLoop@15 (line 25)
0,04% ││ ↗ 0x000001e844438f93: movsxd rdx,dword ptr [rax+rbx4+10h]
8,41% ││ │ 0x000001e844438f98: movsxd rbp,dword ptr [rax+rbx4+14h]
1,23% ││ │ 0x000001e844438f9d: movsxd r13,dword ptr [rax+rbx4+18h]
0,03% ││ │ 0x000001e844438fa2: movsxd r8,dword ptr [rax+rbx4+2ch]
23,87% ││ │ 0x000001e844438fa7: movsxd r11,dword ptr [rax+rbx4+28h]
8,22% ││ │ 0x000001e844438fac: movsxd r9,dword ptr [rax+rbx4+24h]
1,25% ││ │ 0x000001e844438fb1: movsxd rcx,dword ptr [rax+rbx4+20h]
0,14% ││ │ 0x000001e844438fb6: movsxd r14,dword ptr [rax+rbx*4+1ch]
0,28% ││ │ 0x000001e844438fbb: add rdx,rsi
7,82% ││ │ 0x000001e844438fbe: add rbp,rdx
1,14% ││ │ 0x000001e844438fc1: add r13,rbp
0,17% ││ │ 0x000001e844438fc4: add r14,r13
14,57% ││ │ 0x000001e844438fc7: add rcx,r14
11,05% ││ │ 0x000001e844438fca: add r9,rcx
5,26% ││ │ 0x000001e844438fcd: add r11,r9
6,32% ││ │ 0x000001e844438fd0: add r8,r11 ;*ladd {reexecute=0 rethrow=0 return_oop=0}
││ │ ; - codes.dbg.IteratingBenchmark::cStyleForLoop@24 (line 25)
8,45% ││ │ 0x000001e844438fd3: add ebx,8h ;*iinc {reexecute=0 rethrow=0 return_oop=0}
││ │ ; - codes.dbg.IteratingBenchmark::cStyleForLoop@26 (line 24)
1,15% ││ │ 0x000001e844438fd6: cmp ebx,edi
│╰ │ 0x000001e844438fd8: jl 1e844438f90h ;*if_icmpge {reexecute=0 rethrow=0 return_oop=0}
│ │ ; - codes.dbg.IteratingBenchmark::cStyleForLoop@12 (line 24)
│ │ 0x000001e844438fda: mov r11,qword ptr [r15+110h] ; ImmutableOopMap {rax=Oop xmm0=Oop xmm1=Oop }
│ │ ;*goto {reexecute=1 rethrow=0 return_oop=0}
│ │ ; - (reexecute) codes.dbg.IteratingBenchmark::cStyleForLoop@29 (line 24)
0,00% │ │ 0x000001e844438fe1: test dword ptr [r11],eax ;*goto {reexecute=0 rethrow=0 return_oop=0}
│ │ ; - codes.dbg.IteratingBenchmark::cStyleForLoop@29 (line 24)
│ │ ; {poll}
0,02% │ │ 0x000001e844438fe4: cmp ebx,dword ptr [rsp]
│ ╭│ 0x000001e844438fe7: jnl 1e844439028h
0,00% │ ││ 0x000001e844438fe9: mov rsi,r8
│ ││ 0x000001e844438fec: vmovq r8,xmm0
│ ││ 0x000001e844438ff1: vmovq rdx,xmm1
0,01% │ ││ 0x000001e844438ff6: mov r11d,dword ptr [rsp]
↘ ││ 0x000001e844438ffa: mov ecx,r10d
││ 0x000001e844438ffd: sub ecx,ebx
││ 0x000001e844438fff: add ecx,0fffffff9h
0,00% ││ 0x000001e844439002: mov r9d,1f40h
││ 0x000001e844439008: cmp r9d,ecx
││ 0x000001e84443900b: mov edi,1f40h
││ 0x000001e844439010: cmovnle edi,ecx
0,02% ││ 0x000001e844439013: add edi,ebx
││ 0x000001e844439015: vmovq xmm0,r8
││ 0x000001e84443901a: vmovq xmm1,rdx
││ 0x000001e84443901f: mov dword ptr [rsp],r11d
0,01% │╰ 0x000001e844439023: jmp 1e844438f93h
↘ 0x000001e844439028: vmovq rdx,xmm1
0x000001e84443902d: cmp ebx,r10d
0x000001e844439030: jnl 1e844439043h
0x000001e844439032: nop ;lload_2 {reexecute=0 rethrow=0 return_oop=0}
; - codes.dbg.IteratingBenchmark::cStyleForLoop@15 (line 25)
0x000001e844439034: movsxd r11,dword ptr [rax+rbx4+10h]
0x000001e844439039: add r8,r11 ;*ladd {reexecute=0 rethrow=0 return_oop=0}
; - codes.dbg.IteratingBenchmark::cStyleForLoop@24 (line 25)
0x000001e84443903c: inc ebx ;*iinc {reexecute=0 rethrow=0 return_oop=0}
; - codes.dbg.IteratingBenchmark::cStyleForLoop@26 (line 24)
....................................................................................................
As we can see, the loop body is compiled similarly on both JDKs:
- 8 loop iterations are unrolled;
- there are 8 loads from the array without bounds check, followed by 8 `add` instructions;
- the order of loads is a bit different, but all the addresses share the same or the adjacent cache line anyway.
The key difference is that on JDK 14 the loop iteration is split into two nested blocks. This is a result of [Loop strip mining](https://bugs.openjdk.java.net/browse/JDK-8186027) optimization appeared in JDK 10. The idea of this optimization is to split the counted loop into the hot inner part without a safepoint poll, and an outer part with a safepoint poll instruction.
C2 JIT transforms the loop into something like
for (int i = 0; i < array.length; i += 8000) {
for (int j = 0; j < 8000; j += 8) {
int ix = i + j;
int v0 = array[ix];
int v1 = array[ix + 1];
...
int v7 = array[ix + 7];
sum += v0 + v1 + ... + v7;
}
safepoint_poll();
}
Note that JDK 8 version does not have a safepoint poll inside the counted loop at all. On one hand, this can make the loop run faster. But on the other hand, this is actually bad for low latency applications, since the pause time may increase by the duration of the entire loop.
JDK 14 inserts a safepoint poll inside the loop. This *might* be a reason of a slow down you observe, but I don't really belive in this, since due to loop strip mining optimization, the safepoint polling is performed just once in 8000 iterations.
To verify this, you may disable safepoint polling with `-XX:-UseCountedLoopSafepoints` JVM option. In this case, JDK 14 compiled version will look almost identical to JDK 8 one. And so will the benchmark scores.
</details>
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论