英文:
ARM NEON: why is vector code slower than scalar?
问题
I am working with assembly for ARM NEON, and I came to a sequence of unrolled instructions that has actually double the execution time when compared to an equivalent scalar loop. I am actually working with intrinsics, but the code that is produced is the one shown below.
In this sequence, eight results are obtained.
...
vmov r10, r11, d18
vld1.32 {d21}, [r10]
vadd.i32 d21, d21, d20
vst1.32 {d21}, [r10]
vld1.32 {d21}, [r11]
vadd.i32 d21, d21, d20
vst1.32 {d21}, [r11]
vmov r10, r11, d19
...
vmov r10, r11, d16
...
vmov r10, r11, d17
...
The scalar loop is composed of six instructions with one result per iteration:
loop:
ldr.w r1, [r2], #4
ldr.w r3, [r4, r1, lsl #2]
adds r3, #1
str.w r3, [r4, r1, lsl #2]
cmp r0, r2
bhi.n 118 <loop>
According to my naive assumptions, in the vector sequence I spend roughly three instructions per result, whereas in the scalar sequence, there are six instructions. That would lead to a 2X decrease in the processing time, but instead, I've got a 2X increase. Even if I make the vector sequence four or eight times higher, it is still twice slower than the scalar sequence.
I read through "DEN0018A_neon_programmers_guide" and a couple of TRMs for Cortex-A processors, and my guess is that three factors might be impacting the performance of the vector sequence:
- ARM and NEON registers are being moved frequently
- there might be cache problems with the pattern of memory accesses that affect more NEON than ARM
- there might be some problems with both ARM and NEON pipelines with the LOAD/STORE sequence
Since I have just started dealing with the NEON architecture, these might be way off the real problem. So, I would appreciate pointers that help to produce effective intrinsic code for NEON.
Thanks,
Julio de Melo
Tools: arm-linux-gnueabihf-gcc-8.2.1
英文:
I am working with assembly for ARM NEON, and I came to a sequence of unrolled instructions that has actually double the execution time when compared to an equivalent scalar loop. I am actually working with intrinsics, but the code that is produced is the one shown below.
In this sequence, eight results are obtained.
...
vmov r10, r11, d18
vld1.32 {d21}, [r10]
vadd.i32 d21, d21, d20
vst1.32 {d21}, [r10]
vld1.32 {d21}, [r11]
vadd.i32 d21, d21, d20
vst1.32 {d21}, [r11]
vmov r10, r11, d19
...
vmov r10, r11, d16
...
vmov r10, r11, d17
...
The scalar loop is composed of six instructions with one result per iteration:
loop:
ldr.w r1, [r2], #4
ldr.w r3, [r4, r1, lsl #2]
adds r3, #1
str.w r3, [r4, r1, lsl #2]
cmp r0, r2
bhi.n 118 <loop>
According to my naïve assumptions, in the vector sequence I spend roughly three instructions per result, whereas in the scalar sequence there are six instructions. That would lead to a 2X decrease in the processing time, but instead I've got a 2X increase. Even if I make the vector sequence four or eight times higher, it is still twice slower than the scalar sequence.
I read through "DEN0018A_neon_programmers_guide" and a couple of TRMs for Cortex-A processors, and my guess is that three factors might be impacting the performance of the vector sequence:
- ARM and NEON registers are being moved frequently
- there might be cache problems with the pattern of memory accesses that affect more NEON than ARM
- there might be some problems with both ARM and NEON pipelines with the LOAD/STORE sequence
Since I have just started dealing with the NEON architecture, these might be way off the real problem. So, I would appreciate for pointers that help to producing effective intrinsic code for NEON.
Thanks,
Julio de Melo
Tools: arm-linux-gnueabihf-gcc-8.2.1
答案1
得分: 1
Speeding up histograms with SIMD is challenging and application-specific. On a small number of histogram bins, you can accelerate the naive version by using multiple accumulators, adjusting them based on input data. The provided code increases 8-bit histograms with just 32 elements. After 255 rounds of this method, you can widen the accumulation from uint8x16_t hist0to15
and uint8x16_t hist16to31
to uint16x8x4_t hist0to31
, with c0to15
containing {0,1,2,3,4,...15}
. To optimize, interleave these blocks to reduce load instruction latency and arithmetic operation latency. Depending on the application, you can use this method with multiple passes, similar to radix sort.
Scatter store/load instructions won't help parallelize histogram accumulation unless the SIMD instruction set offers a quick way to remove/count duplicate entries or compress them into a stream of tuples like offset, increment
. There have been proposals for accelerating histogram operations with SIMD instructions, but I'm not aware of any implementations.
英文:
Speeding up histograms is very difficult (and application dependent) with SIMD.
On a very limited number of histogram bins one can speed up the naive version by creating multiple accumulators, which are conditionally increased based on input data.
This loop will increase the 8-bit histograms having just 32 elements.
auto a = vdupq_n_u8(*input++);
auto a0 = vceqq_u8(a, c0to15);
auto a1 = vceqq_u8(a, c16to31);
hist0to15 = vsubq_u8(hist0to15, a0);
hist16to31 = vsubq_u8(hist16to31, a1);
After 255 rounds of calling this method, one can then do a widening accumulation of uint8x16_t hist0to15, hist16to31
to uint16x8x4_t hist0to31
, with c0to15
having the content of {0,1,2,3,4,...15}
.
One can/should interleave a few of these blocks to mitigate the ~7 cycle latency in load instructions and the 2 cycle latency between the basic arithmetic operations.
Depending on the application (like trying to find just Nth percentile of a histogram) one can use this method with multiple passes as in radix sort.
Scatter store/load instructions won't help in parallelising histogram accumulation either, unless the SIMD instruction set also has a quick method to remove/count the duplicate entries (and either compact/compress this kind of stream of tuples of offset, increment
or mask off work done on duplicate lanes). I've seen some papers/proposals for histogram accelerating SIMD instructions, but I'm not aware of any implementation.
input = [123 | 54 | 0 | 32 | 78 | 32 | 0 | 0 ]
--> offset 123, increment = 1
--> offset 54, increment = 1
--> offset 0, increment = 3
--> offset 32, increment = 2
--> offset 78, increment = 1
答案2
得分: 0
感谢Nate和artless的评论。是的,Nate,你说得对,这只是一个基本的直方图循环。正如你们两位提到的,这无法进行向量化处理,问题解决。再次感谢。
英文:
Thanks, both Nate and artless for the comments. Yes, Nate, you´re right, it's just a basic histogram loop. As you both mention, it won't help vectorize it, case closed. Thanks, again.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论