2023年7月4日 22:50:07go评论66阅读模式

英文:

ARM NEON: why is vector code slower than scalar?

问题

I am working with assembly for ARM NEON, and I came to a sequence of unrolled instructions that has actually double the execution time when compared to an equivalent scalar loop. I am actually working with intrinsics, but the code that is produced is the one shown below.

In this sequence, eight results are obtained.

	...
	vmov	  r10, r11, d18
	vld1.32	  {d21}, [r10]
	vadd.i32  d21, d21, d20
	vst1.32	  {d21}, [r10]
	vld1.32	  {d21}, [r11]
	vadd.i32  d21, d21, d20
	vst1.32	  {d21}, [r11]
	vmov	  r10, r11, d19
	...
	vmov	  r10, r11, d16
	...
	vmov	  r10, r11, d17
	...

The scalar loop is composed of six instructions with one result per iteration:

loop:
 	ldr.w	r1, [r2], #4
 	ldr.w	r3, [r4, r1, lsl #2]
      	adds	r3, #1
 	str.w	r3, [r4, r1, lsl #2]
      	cmp	r0, r2
      	bhi.n	118 &lt;loop&gt;

According to my naive assumptions, in the vector sequence I spend roughly three instructions per result, whereas in the scalar sequence, there are six instructions. That would lead to a 2X decrease in the processing time, but instead, I've got a 2X increase. Even if I make the vector sequence four or eight times higher, it is still twice slower than the scalar sequence.

I read through "DEN0018A_neon_programmers_guide" and a couple of TRMs for Cortex-A processors, and my guess is that three factors might be impacting the performance of the vector sequence:

ARM and NEON registers are being moved frequently
there might be cache problems with the pattern of memory accesses that affect more NEON than ARM
there might be some problems with both ARM and NEON pipelines with the LOAD/STORE sequence

Since I have just started dealing with the NEON architecture, these might be way off the real problem. So, I would appreciate pointers that help to produce effective intrinsic code for NEON.

Thanks,
Julio de Melo

Tools: arm-linux-gnueabihf-gcc-8.2.1

英文:

In this sequence, eight results are obtained.

	...
	vmov	  r10, r11, d18
	vld1.32	  {d21}, [r10]
	vadd.i32  d21, d21, d20
	vst1.32	  {d21}, [r10]
	vld1.32	  {d21}, [r11]
	vadd.i32  d21, d21, d20
	vst1.32	  {d21}, [r11]
	vmov	  r10, r11, d19
	...
	vmov	  r10, r11, d16
	...
	vmov	  r10, r11, d17
	...

The scalar loop is composed of six instructions with one result per iteration:

loop:
 	ldr.w	r1, [r2], #4
 	ldr.w	r3, [r4, r1, lsl #2]
      	adds	r3, #1
 	str.w	r3, [r4, r1, lsl #2]
      	cmp	r0, r2
      	bhi.n	118 &lt;loop&gt;

According to my naïve assumptions, in the vector sequence I spend roughly three instructions per result, whereas in the scalar sequence there are six instructions. That would lead to a 2X decrease in the processing time, but instead I've got a 2X increase. Even if I make the vector sequence four or eight times higher, it is still twice slower than the scalar sequence.

I read through "DEN0018A_neon_programmers_guide" and a couple of TRMs for Cortex-A processors, and my guess is that three factors might be impacting the performance of the vector sequence:

ARM and NEON registers are being moved frequently
there might be cache problems with the pattern of memory accesses that affect more NEON than ARM
there might be some problems with both ARM and NEON pipelines with the LOAD/STORE sequence

Since I have just started dealing with the NEON architecture, these might be way off the real problem. So, I would appreciate for pointers that help to producing effective intrinsic code for NEON.

Thanks,
Julio de Melo

Tools: arm-linux-gnueabihf-gcc-8.2.1

答案1

得分: 1

Speeding up histograms with SIMD is challenging and application-specific. On a small number of histogram bins, you can accelerate the naive version by using multiple accumulators, adjusting them based on input data. The provided code increases 8-bit histograms with just 32 elements. After 255 rounds of this method, you can widen the accumulation from uint8x16_t hist0to15 and uint8x16_t hist16to31 to uint16x8x4_t hist0to31, with c0to15 containing {0,1,2,3,4,...15}. To optimize, interleave these blocks to reduce load instruction latency and arithmetic operation latency. Depending on the application, you can use this method with multiple passes, similar to radix sort.

Scatter store/load instructions won't help parallelize histogram accumulation unless the SIMD instruction set offers a quick way to remove/count duplicate entries or compress them into a stream of tuples like offset, increment. There have been proposals for accelerating histogram operations with SIMD instructions, but I'm not aware of any implementations.

英文:

Speeding up histograms is very difficult (and application dependent) with SIMD.

On a very limited number of histogram bins one can speed up the naive version by creating multiple accumulators, which are conditionally increased based on input data.

This loop will increase the 8-bit histograms having just 32 elements.

    auto a = vdupq_n_u8(*input++);
    auto a0 = vceqq_u8(a, c0to15);
    auto a1 = vceqq_u8(a, c16to31);
    hist0to15 = vsubq_u8(hist0to15, a0);
    hist16to31 = vsubq_u8(hist16to31, a1);

After 255 rounds of calling this method, one can then do a widening accumulation of uint8x16_t hist0to15, hist16to31 to uint16x8x4_t hist0to31, with c0to15 having the content of {0,1,2,3,4,...15}.

One can/should interleave a few of these blocks to mitigate the ~7 cycle latency in load instructions and the 2 cycle latency between the basic arithmetic operations.

Depending on the application (like trying to find just Nth percentile of a histogram) one can use this method with multiple passes as in radix sort.

Scatter store/load instructions won't help in parallelising histogram accumulation either, unless the SIMD instruction set also has a quick method to remove/count the duplicate entries (and either compact/compress this kind of stream of tuples of offset, increment or mask off work done on duplicate lanes). I've seen some papers/proposals for histogram accelerating SIMD instructions, but I'm not aware of any implementation.

input     = [123 | 54 | 0 | 32 | 78 | 32 | 0 | 0 ]

--&gt;         offset 123, increment = 1
--&gt;         offset 54, increment = 1
--&gt;         offset 0, increment = 3
--&gt;         offset 32, increment = 2
--&gt;         offset 78, increment = 1

答案2

得分: 0

感谢Nate和artless的评论。是的，Nate，你说得对，这只是一个基本的直方图循环。正如你们两位提到的，这无法进行向量化处理，问题解决。再次感谢。

英文:

Thanks, both Nate and artless for the comments. Yes, Nate, you´re right, it's just a basic histogram loop. As you both mention, it won't help vectorize it, case closed. Thanks, again.

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

ARM NEON：为什么向量代码比标量代码慢？

问题

答案1

答案2

为什么gcc在条件乘法的std::vector向量化方面要比clang差得多？

问题在于在Mac上运行ScalaZ3时存在冲突的体系结构。

如何将数据框中的每一列分别乘以不同的值每列。

在多维数组上映射一个函数

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论