2023年2月14日 00:46:21go评论80阅读模式

英文:

SIMD in AssemblyScript

问题

I can provide a translation of the text, excluding the code part, as requested:

"我在AssemblyScript中创建了一个Box Blur算法。
为了提高效率，我想要使用SIMD操作。
例如，我有以下代码：

如何使用SIMD使这个for循环更快？如果可能的话，也请给我一些解释 :)"

Please note that the code part has been excluded from the translation, as per your request. If you have any further questions or need explanations related to using SIMD in your code, feel free to ask.

英文:

Hey I created a Box Blur algorithm in AssemblyScript.

To make it more efficient, I would like to use SIMD Operations.

For example I have which looks like this:

for(let column: i16 = x + deviationBase + 1; column &lt; x + width - deviationBase; column++){
    r += load&lt;u8&gt;(rowPosition + (column + deviationBase) * 4    )
    g += load&lt;u8&gt;(rowPosition + (column + deviationBase) * 4 + 1) 
    b += load&lt;u8&gt;(rowPosition + (column + deviationBase) * 4 + 2)
    r -= load&lt;u8&gt;(rowPosition + (column - deviationBase) * 4    ) 
    g -= load&lt;u8&gt;(rowPosition + (column - deviationBase) * 4 + 1)
    b -= load&lt;u8&gt;(rowPosition + (column - deviationBase) * 4 + 2)
    
    store&lt;u8&gt;(rowPosition + column * 4    , (r / diameter) as u8)
    store&lt;u8&gt;(rowPosition + column * 4 + 1, (g / diameter) as u8)
    store&lt;u8&gt;(rowPosition + column * 4 + 2, (b / diameter) as u8)
  }

How can I make this for loop faster with SIMD? It would be nice if you could also give me some explanations

答案1

得分: 1

我试图解决这个问题，我的尝试在这里。

因此，处理simd的常规方式，包括AssemblyScript，如下：

从常规寄存器或内存加载数据到simd寄存器
计算矢量化的计算
将数据存储回常规寄存器或内存

AssemblyScript有一个simd数据类型：v128，它可以存储16个u8的值。我们将使用它来进行simd计算。

我使用了这个资源来构建一个框模糊示例。我做出了一些决策，以去除边缘情况，并仅关注simd部分：

我使用了rgba格式，而不是rgb格式。保持幂的方式有助于减少边缘情况，并更适合v128。
我用值为0的像素填充了要模糊的图像，以减少算法的复杂性。现在无需进行边界检查。

现在，关于实际的算法：我实现了三种算法。box_blur_naive、box_blur_improved和box_blur_simd。

在box_blur_naive中，对于每个像素及其邻居，有熟悉的加载和加法。例如，对于红色：

red += load<u8>(img + i * width * 4 + j * 4 + 0);
red += load<u8>(img + (i - 1) * width * 4 + (j) * 4 + 0);
red += load<u8>(img + (i + 1) * width * 4 + (j) * 4 + 0);
red += load<u8>(img + (i) * width * 4 + (j - 1) * 4 + 0);
red += load<u8>(img + (i) * width * 4 + (j + 1) * 4 + 0);

基准时间：29.430秒

要对此进行simd化，我们需要稍微重新排列事物。首先，垂直线（i、i - 1、i + 1）非常容易转换为simd指令。问题在于，使用v128很难将水平邻居相加，因为它们都在同一个寄存器中结束。

幸运的是，对于框模糊来说，有一种简单的方法可以将水平和垂直相加分离，借助一个转置后的图像，这就是box_blur_improved的作用：

red += load<u8>(img + i * width * 4 + j * 4 + 0);
red += load<u8>(img + (i - 1) * width * 4 + (j) * 4 + 0);
red += load<u8>(img + (i + 1) * width * 4 + (j) * 4 + 0);
...
red += load<u8>(transposed_img + (i - 1) * height * 4 + (j) * 4 + 0);
red += load<u8>(transposed_img + (i + 1) * height * 4 + (j) * 4 + 0);

基准时间：30.225秒

现在，我们只有垂直相加，所以我们终于可以开始使用v128。将数据加载到v128中：

line = v128.load(img + i * width * 4 + 4 + k * 16);
line_before = v128.load(img + (i - 1) * width * 4 + 4 + k * 16);
line_after = v128.load(img + (i + 1) * width * 4 + 4 + k * 16);

这与之前的load执行相同的操作，但一次加载16个u8值。现在我们可以执行加法操作：

let result = v128.add<u8>(line, v128.add<u8>(line_before, line_after));

这一行执行了之前的所有垂直相加。我们尚未添加转置的相加，因为我将很快解释一个未来的问题。

v128.store(blurred_img + i * width * 4 + 4 + k * 16, result);

这将结果存储在指定地址。这就是全部。

基准时间：17.637秒

simd解决方案似乎可以节省约一半的时间，但代码还没有完全simd化。

最后的问题是，使用v128（以及simd一般）很难进行整数除法。

考虑我们在第一个算法中将数据存储在内存中的方式：

(load<u8>(blurred_img + j * width * 4 + i * 4 + 0) + red) / 5) as u8

我们必须进行除以5的操作，这不是v128上的操作。可能有一些使用位移和减法进行除法的方法。从我们已经获得的速度增益来看，这可能值得一试。

英文:

I tried to tackle this problem, my attempt is here.

So, the usual way of dealing with simd, AssemblyScript being no different, is the following:

load the data into simd registers from regular registers or memory
compute the vectorized calculations
store the data back into regular registers or memory

AssemblyScript has one simd datatype: v128, which can store 16 u8's. We will use it in order to do simd calculations.

I built a box blur example using this resource. I took some decisions in order to remove edge-cases and focus only on the simd part:

I used the rgba format, as opposed to the rgb format. Keeping things as powers of two helps reduce edge-cases, and better fits for the v128.
I padded the image that I am blurring with pixels with values of 0, in order to reduce the complexity of the algorithm. There is no need for bounds checking now.

Now, for the actual algorithms: I implemented three algorithms. The box_blur_naive, box_blur_improved and box_blur_simd.

In the box_blur_naive there is the familiar loading and adding for every pixel and its neighbours. For example, for the red color:

red += load&lt;u8&gt;(img + i * width * 4 + j * 4 + 0);
red += load&lt;u8&gt;(img + (i - 1) * width * 4 + (j) * 4 + 0);
red += load&lt;u8&gt;(img + (i + 1) * width * 4 + (j) * 4 + 0);
red += load&lt;u8&gt;(img + (i) * width * 4 + (j - 1) * 4 + 0);
red += load&lt;u8&gt;(img + (i) * width * 4 + (j + 1) * 4 + 0);

Benchmarking time: 29.430s

In order to simdize this, we need to rearrange things a bit. First, the vertical lines (i, i - 1, i + 1) are very easily convertible to simd instructions. The problem is that there is no easy way to add the horizontal neighbours with v128, because they all end up in the same register.

Fortunately, there is an easy way for box blur to separate the horizontal and vertical additions, with the help of a transposed image, which is what box_blur_improved does:

red += load&lt;u8&gt;(img + i * width * 4 + j * 4 + 0);
red += load&lt;u8&gt;(img + (i - 1) * width * 4 + (j) * 4 + 0);
red += load&lt;u8&gt;(img + (i + 1) * width * 4 + (j) * 4 + 0);
...
red += load&lt;u8&gt;(transposed_img + (i - 1) * height * 4 + (j) * 4 + 0);
red += load&lt;u8&gt;(transposed_img + (i + 1) * height * 4 + (j) * 4 + 0);

Benchmarking time: 30.225s

Now, we only have vertical additions, so we can finally start bringing in v128. Getting the data into v128:

line = v128.load(img + i * width * 4 + 4 + k * 16);
line_before = v128.load(img + (i - 1) * width * 4 + 4 + k * 16);
line_after = v128.load(img + (i + 1) * width * 4 + 4 + k * 16);

This does the same thing as the loads before, but for 16 u8 values at a single time. We can now perform the additions:

let result = v128.add&lt;u8&gt;(line, v128.add&lt;u8&gt;(line_before, line_after));

This line performs all the vertical additions from before. We have not added the transposed additions, because of a future problem that I will explain shortly.

v128.store(blurred_img + i * width * 4 + 4 + k * 16, result);

This stores the result at the specified address. And that is all.

Benchmarking time: 17.637s

The simd solutions seems to save about half the time, and the code is not fully simdized.

The last problem is that there is no easy way to do integer division in an easy way with v128 (and simd in general).

Consider the way in which we store the data in memory in the first algorithms:

(load&lt;u8&gt;(blurred_img + j * width * 4 + i * 4 + 0) + red) / 5) as u8

We have to do division by 5, which is not an operation on v128. There may be some ways of doing the division by using bit shifts and subtractions. Judging from the speed increase we already obtained, it may be worthwhile to do it.

答案2

得分: 1

除了在 https://stackoverflow.com/a/75494927/11042887 中回答的内容之外，您可以通过标量来模拟i8x16除法，如下所示：

export function i8x16_fastdiv(x: v128, d: u8): v128 {
    let n = u8(23 - clz((u32(d) | 1) << 16)); // 找到移位因子
    // 设置8个“反向除法”单词：
    let inv = i16x8.splat((1 << n) / d + 1);
    // 拆分成两个16位值
    let lo = i16x8.extend_low_i8x16_u(x);
    let hi = i16x8.extend_high_i8x16_u(x);
    // 用“反向除法”和除法相乘
    lo = i16x8.shr_u(i16x8.mul(lo, inv), n);
    hi = i16x8.shr_u(i16x8.mul(hi, inv), n);
    // 重新打包
    return i8x16.narrow_i16x8_u(lo, hi);
}

但请注意，在Firefox上可能比使用提取+替换通道的标量除法版本更快，而在Chrome上可能更慢。在性能方面，Chrome在使用SIMD方面的表现仍然相当不可预测。

英文:

In addition to answered in https://stackoverflow.com/a/75494927/11042887 you can emulate i8x16 division by scalar as:

export function i8x16_fastdiv(x: v128, d: u8): v128 {
    let n = u8(23 - clz((u32(d) | 1) &lt;&lt; 16)); // find shift factor
    // set 8 words of &quot;inverse div&quot;:
    let inv = i16x8.splat((1 &lt;&lt; n) / d + 1);
    // unpack into two 16-bit values
    let lo = i16x8.extend_low_i8x16_u(x);
    let hi = i16x8.extend_high_i8x16_u(x);
    // mul with &quot;inverse div&quot; and div
    lo = i16x8.shr_u(i16x8.mul(lo, inv), n);
    hi = i16x8.shr_u(i16x8.mul(hi, inv), n);
    // repack
    return i8x16.narrow_i16x8_u(lo, hi);
}

But note it may be faster on Firefox while slower on Chrome than scalar division version with extract + replace lanes. Chrome still behaves rather unpredictably with SIMD in term of performance.

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

SIMD在AssemblyScript中

问题

答案1

答案2

WASM（不使用Blazor）对于大型数字数组非常慢。

_mm_comieq_ss 在 Clang 和 GCC 之间的区别

如何检查寄存器是否包含零字节而不使用SIMD指令

The fastest way to convert a UInt64 hex string to a UInt32 value preserving as many leading digits as possible, i.e. truncation

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。