ARM NEON:为什么向量代码比标量代码慢?

huangapple go评论61阅读模式
英文:

ARM NEON: why is vector code slower than scalar?

问题

I am working with assembly for ARM NEON, and I came to a sequence of unrolled instructions that has actually double the execution time when compared to an equivalent scalar loop. I am actually working with intrinsics, but the code that is produced is the one shown below.

In this sequence, eight results are obtained.

	...
	vmov	  r10, r11, d18
	vld1.32	  {d21}, [r10]
	vadd.i32  d21, d21, d20
	vst1.32	  {d21}, [r10]
	vld1.32	  {d21}, [r11]
	vadd.i32  d21, d21, d20
	vst1.32	  {d21}, [r11]
	vmov	  r10, r11, d19
	...
	vmov	  r10, r11, d16
	...
	vmov	  r10, r11, d17
	...

The scalar loop is composed of six instructions with one result per iteration:

loop:
 	ldr.w	r1, [r2], #4
 	ldr.w	r3, [r4, r1, lsl #2]
      	adds	r3, #1
 	str.w	r3, [r4, r1, lsl #2]
      	cmp	r0, r2
      	bhi.n	118 <loop>

According to my naive assumptions, in the vector sequence I spend roughly three instructions per result, whereas in the scalar sequence, there are six instructions. That would lead to a 2X decrease in the processing time, but instead, I've got a 2X increase. Even if I make the vector sequence four or eight times higher, it is still twice slower than the scalar sequence.

I read through "DEN0018A_neon_programmers_guide" and a couple of TRMs for Cortex-A processors, and my guess is that three factors might be impacting the performance of the vector sequence:

  1. ARM and NEON registers are being moved frequently
  2. there might be cache problems with the pattern of memory accesses that affect more NEON than ARM
  3. there might be some problems with both ARM and NEON pipelines with the LOAD/STORE sequence

Since I have just started dealing with the NEON architecture, these might be way off the real problem. So, I would appreciate pointers that help to produce effective intrinsic code for NEON.

Thanks,
Julio de Melo

Tools: arm-linux-gnueabihf-gcc-8.2.1

英文:

I am working with assembly for ARM NEON, and I came to a sequence of unrolled instructions that has actually double the execution time when compared to an equivalent scalar loop. I am actually working with intrinsics, but the code that is produced is the one shown below.

In this sequence, eight results are obtained.

	...
	vmov	  r10, r11, d18
	vld1.32	  {d21}, [r10]
	vadd.i32  d21, d21, d20
	vst1.32	  {d21}, [r10]
	vld1.32	  {d21}, [r11]
	vadd.i32  d21, d21, d20
	vst1.32	  {d21}, [r11]
	vmov	  r10, r11, d19
	...
	vmov	  r10, r11, d16
	...
	vmov	  r10, r11, d17
	...

The scalar loop is composed of six instructions with one result per iteration:

loop:
 	ldr.w	r1, [r2], #4
 	ldr.w	r3, [r4, r1, lsl #2]
      	adds	r3, #1
 	str.w	r3, [r4, r1, lsl #2]
      	cmp	r0, r2
      	bhi.n	118 <loop>

According to my naïve assumptions, in the vector sequence I spend roughly three instructions per result, whereas in the scalar sequence there are six instructions. That would lead to a 2X decrease in the processing time, but instead I've got a 2X increase. Even if I make the vector sequence four or eight times higher, it is still twice slower than the scalar sequence.

I read through "DEN0018A_neon_programmers_guide" and a couple of TRMs for Cortex-A processors, and my guess is that three factors might be impacting the performance of the vector sequence:

  1. ARM and NEON registers are being moved frequently
  2. there might be cache problems with the pattern of memory accesses that affect more NEON than ARM
  3. there might be some problems with both ARM and NEON pipelines with the LOAD/STORE sequence

Since I have just started dealing with the NEON architecture, these might be way off the real problem. So, I would appreciate for pointers that help to producing effective intrinsic code for NEON.

Thanks,
Julio de Melo

Tools: arm-linux-gnueabihf-gcc-8.2.1

答案1

得分: 1

Speeding up histograms with SIMD is challenging and application-specific. On a small number of histogram bins, you can accelerate the naive version by using multiple accumulators, adjusting them based on input data. The provided code increases 8-bit histograms with just 32 elements. After 255 rounds of this method, you can widen the accumulation from uint8x16_t hist0to15 and uint8x16_t hist16to31 to uint16x8x4_t hist0to31, with c0to15 containing {0,1,2,3,4,...15}. To optimize, interleave these blocks to reduce load instruction latency and arithmetic operation latency. Depending on the application, you can use this method with multiple passes, similar to radix sort.

Scatter store/load instructions won't help parallelize histogram accumulation unless the SIMD instruction set offers a quick way to remove/count duplicate entries or compress them into a stream of tuples like offset, increment. There have been proposals for accelerating histogram operations with SIMD instructions, but I'm not aware of any implementations.

英文:

Speeding up histograms is very difficult (and application dependent) with SIMD.

On a very limited number of histogram bins one can speed up the naive version by creating multiple accumulators, which are conditionally increased based on input data.

This loop will increase the 8-bit histograms having just 32 elements.

    auto a = vdupq_n_u8(*input++);
    auto a0 = vceqq_u8(a, c0to15);
    auto a1 = vceqq_u8(a, c16to31);
    hist0to15 = vsubq_u8(hist0to15, a0);
    hist16to31 = vsubq_u8(hist16to31, a1);

After 255 rounds of calling this method, one can then do a widening accumulation of uint8x16_t hist0to15, hist16to31 to uint16x8x4_t hist0to31, with c0to15 having the content of {0,1,2,3,4,...15}.

One can/should interleave a few of these blocks to mitigate the ~7 cycle latency in load instructions and the 2 cycle latency between the basic arithmetic operations.

Depending on the application (like trying to find just Nth percentile of a histogram) one can use this method with multiple passes as in radix sort.

Scatter store/load instructions won't help in parallelising histogram accumulation either, unless the SIMD instruction set also has a quick method to remove/count the duplicate entries (and either compact/compress this kind of stream of tuples of offset, increment or mask off work done on duplicate lanes). I've seen some papers/proposals for histogram accelerating SIMD instructions, but I'm not aware of any implementation.

input     = [123 | 54 | 0 | 32 | 78 | 32 | 0 | 0 ]

-->         offset 123, increment = 1
-->         offset 54, increment = 1
-->         offset 0, increment = 3
-->         offset 32, increment = 2
-->         offset 78, increment = 1

答案2

得分: 0

感谢Nate和artless的评论。是的,Nate,你说得对,这只是一个基本的直方图循环。正如你们两位提到的,这无法进行向量化处理,问题解决。再次感谢。

英文:

Thanks, both Nate and artless for the comments. Yes, Nate, you´re right, it's just a basic histogram loop. As you both mention, it won't help vectorize it, case closed. Thanks, again.

huangapple
  • 本文由 发表于 2023年7月4日 22:50:07
  • 转载请务必保留本文链接:https://go.coder-hub.com/76613808.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定