2023年5月17日 20:07:11go评论72阅读模式

英文:

Are these two for loops equivalent?

问题

I'm providing the translated portion of your text:

我正在使用卷积操作，特别是试图加速其执行速度。
为了实现这种加速，我正在使用SIMD指令，以便同时执行两个乘法操作，其中一个结果放在64位变量的32高位，而另一个结果在32低位。
问题是新代码似乎不像旧代码那样工作。

初始代码包含以下for循环：

 int32_t v32;
 int16_t arr_2[1024];
 int16_t data[96];
 int32_t accu;
  ...
        for(int j=0; j&lt;INPUT_F; j++){
          v32 = arr_2[l*OUT_F+j]*data[k*K*INPUT_F+(l-i+K/2)*INPUT_F+j]
          accu += v32;
        }
  ...

问题是：除了乘法函数以外，其他操作是否等效，还是我做错了什么？

 uint64_t v64; 
 int16_t arr_2[1024];
 int16_t data[96];
 int32_t accu;
 ...
        for(int j=0; j&lt;INPUT_F/2; j++){
          v64 = __mul(arr_2[l*OUT_F+2*j],data[k*K*INPUT_F+(l-i+K/2)*INPUT_F+2*j]); //使用SIMD指令在数组中的两个连续值之间执行乘法。
          accu += ((int32_t)(v64 &amp; 0xFFFFFFFF); //第一个值
          accu += ((int32_t)((v64 &gt;&gt; 32) &amp; 0xFFFFFFFF); //第二个值
        }
 ...

__mul() 定义为 uint64_t __mul(uint32_t a, uint32_t b);，即使操作数是 uint32_t，它也考虑了内部有两个 int16_t 值的事实。

英文:

I working with a convolution and, in particular, I'm trying to speedup its execution.
To obtain this acceleration I'm using a SIMD instruction in order to perform two multiplication at the same time where the result of one is put in the 32 higher bit of a 64 bit variable while the other result is in 32 lower bit.
The problem is that the new code seems not working as the old one.

The initial code contains this for-loop

 int32_t v32;
 int16_t arr_2[1024];
 int16_t data[96];
 int32_t accu;
  ...
        for(int j=0; j&lt;INPUT_F; j++){
          v32 = arr_2[l*OUT_F+j]*data[k*K*INPUT_F+(l-i+K/2)*INPUT_F+j]
          accu += v32;
        }
  ...

the questions is: apart for the multiplication functions, are the other operations equivalent or am I doing something wrong ?

 uint64_t v64; 
 int16_t arr_2[1024];
 int16_t data[96];
 int32_t accu;
 ...
        for(int j=0; j&lt;INPUT_F/2; j++){
          v64 = __mul(arr_2[l*OUT_F+2*j],data[k*K*INPUT_F+(l-i+K/2)*INPUT_F+2*j]); //use a simd instruction to perform mul between two consecutive values in the arrays.
          accu += ((int32_t)(v64 &amp; 0xFFFFFFFF); //first value
          accu += ((int32_t)((v64 &gt;&gt; 32) &amp; 0xFFFFFFFF); //second value
        }
 ...

__mul() is defined as uint64_t __mul(uint32_t a, uint32_t b); and even if the operands are uint32_t it takes into account the fact that there are two int16_t values internally.

答案1

得分: 4

以下是已翻译好的部分：

[From a comment] 我以为在声明 a 和 b 为 uint32_t 时，当我传递带有该索引的参数时，它将占用 32 个连续的位（这就是我使用 2*j 的原因）

函数不会从调用它们的环境中“获取”东西。

当参数的类型为 uint32_t 时，这意味着传递给该参数的参数将被转换为 uint32_t 类型。这并不意味着会从参数来自的任何地方提取 32 位。

在 C 中，表达式由子表达式和它们的操作数组成，每个操作数和子表达式都根据其类型进行评估，而不是根据封闭表达式的类型。

在 __mul(array_2[l*OUT_FEA+2*j],weights[k*CONV_K*INPUT_FEA+(l-i+CONV_K/2)*INPUT_FEA+2*j]) 中，array_2[l*OUT_FEA+2*j] 的类型为 in16_t，因为 array_2 声明为 int16_t 元素的数组。因此，索引 l*OUT_FEA+2*j 被计算并用于查找数组中的一个元素。单个 int16_t 元素被取出并传递给 __mul 的 a 参数。由于该参数的类型为 uint32_t，单个 int16_t 值会被转换为 uint32_t 类型。

此代码中没有任何内容会导致提取或使用 array_2 的两个元素。

这些都是 C 语言的基本方面，尝试在 C 中进行 SIMD 编程而不理解这些方面是徒劳的。

要将包含两个 int16_t 元素位的 uint32_t 值传递给 __mul，您必须提取两个 int16_t 元素。在 C 中，有多种方法可以做到这一点。其中一种方法是提取两个元素（通过将它们作为表达式中的独立操作数写入）并使用转换和位移操作将它们组合起来。然而，当我们试图使用 SIMD 加速性能时，通常希望避免单独提取单独的元素。（编译器的优化可能会将单独的提取组合成单次提取，但依赖于这一点需要超出本答案范围的额外知识和考虑。）

为此，在 SIMD 代码中通常使用 uint32_t 类型的 lvalue 访问 int16_t 元素的数组。但是，这需要考虑 C 语言的规则，特别是有关别名类型和对齐的规则。必须确保 array_2 和 weights 正确对齐以适应 uint32_t 类型（或编写适应它们的任何对齐方式的代码），并且我们要么根据 C 编译器的规则进行安全地使用 uint32_t 别名数组，要么编译器提供了符合 C 标准以外的别名支持的保证。

解释这些事情超出了简单的 Stack Overflow 回答范围。这些都是在开始或在开始 SIMD 编程之前应该学习的先决条件。

英文:

> [From a comment] I thought that having declared a and b as uint32_t when I pass the with that index it would take 32 consecutive bits (that's why I used 2*j)

Functions do not “take” things from the environment where they are called.

When a parameter has type uint32_t, that means an argument passed for that parameter will be converted to the type uint32_t. It does not mean 32 bits will be pulled from wherever the argument comes from.

In C, expressions are formed from subexpressions and their operands, and each operand and subexpression is evaluated based on its type, not the type of the enclosing expression.

In __mul(array_2[l*OUT_FEA+2*j],weights[k*CONV_K*INPUT_FEA+(l-i+CONV_K/2)*INPUT_FEA+2*j]), array_2[l*OUT_FEA+2*j] has type in16_t because array_2 is declared an array of int16_t elements. So the index l*OUT_FEA+2*j is calculated and used to look up an element in the array. That single int16_t element is taken and is passed for the a parameter of __mul. Since that parameter has type uint32_t, the single int16_t value is converted to the type uint32_t.

Nothing in this code causes two elements of array_2 to be fetched or used.

These are fundamental aspects of C, and it is futile to attempt SIMD programming in C without understanding them.

To pass to __mul a uint32_t value that contains the bits of two int16_t elements, you must fetch two int16_t elements. There are multiple ways to do this in C. One would be to fetch two elements (by writing them as separate operands in an expression) and combine them using conversions and bit-shifting. However, when we are trying to accelerate performance using SIMD, we generally want to avoid separate fetches of separate elements. (Optimization by the compiler might combine separate fetches into a single fetch, but relying on this requires additional knowledge and considerations beyond the scope of this answer.)

To that end, it is common in SIMD code to access an array of int16_t elements using an lvalue of type uint32_t. However, this requires additional considerations of the rules of C, notably rules about aliasing types and about alignment. It is necessary to ensure that array_2 and weights are correctly aligned for the uint32_t type (or that we write code that adapts to whatever alignment they have) and that either we make arrangements to alias the array using the uint32_t in accordance with the rules of the C compiler or the compiler provides assurances beyond the C standard that it supports the aliasing.

Explaining these things goes beyond the scope of a simple Stack Overflow answer. They are prerequisites that should be learned when or before starting SIMD programming.

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

这两个for循环等效吗？

问题

答案1

在结构体数组中搜索数值 C

C round() function rounding incorrectly?

gcc在数组初始化器中有额外逗号时不会出错。

数组变量在我给出section属性时没有分配到数据部分，为什么？

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论