SIMD在AssemblyScript中

huangapple go评论48阅读模式
英文:

SIMD in AssemblyScript

问题

I can provide a translation of the text, excluding the code part, as requested:

"我在AssemblyScript中创建了一个Box Blur算法。
为了提高效率,我想要使用SIMD操作。
例如,我有以下代码:

如何使用SIMD使这个for循环更快?如果可能的话,也请给我一些解释 :)"

Please note that the code part has been excluded from the translation, as per your request. If you have any further questions or need explanations related to using SIMD in your code, feel free to ask.

英文:

Hey I created a Box Blur algorithm in AssemblyScript.

To make it more efficient, I would like to use SIMD Operations.

For example I have which looks like this:

for(let column: i16 = x + deviationBase + 1; column < x + width - deviationBase; column++){
    r += load<u8>(rowPosition + (column + deviationBase) * 4    )
    g += load<u8>(rowPosition + (column + deviationBase) * 4 + 1) 
    b += load<u8>(rowPosition + (column + deviationBase) * 4 + 2)
    r -= load<u8>(rowPosition + (column - deviationBase) * 4    ) 
    g -= load<u8>(rowPosition + (column - deviationBase) * 4 + 1)
    b -= load<u8>(rowPosition + (column - deviationBase) * 4 + 2)
    
    store<u8>(rowPosition + column * 4    , (r / diameter) as u8)
    store<u8>(rowPosition + column * 4 + 1, (g / diameter) as u8)
    store<u8>(rowPosition + column * 4 + 2, (b / diameter) as u8)
  }

How can I make this for loop faster with SIMD? It would be nice if you could also give me some explanations SIMD在AssemblyScript中

答案1

得分: 1

我试图解决这个问题,我的尝试在这里

因此,处理simd的常规方式,包括AssemblyScript,如下:

  1. 从常规寄存器或内存加载数据到simd寄存器
  2. 计算矢量化的计算
  3. 将数据存储回常规寄存器或内存

AssemblyScript有一个simd数据类型:v128,它可以存储16个u8的值。我们将使用它来进行simd计算。

我使用了这个资源来构建一个框模糊示例。我做出了一些决策,以去除边缘情况,并仅关注simd部分:

  1. 我使用了rgba格式,而不是rgb格式。保持幂的方式有助于减少边缘情况,并更适合v128
  2. 我用值为0的像素填充了要模糊的图像,以减少算法的复杂性。现在无需进行边界检查。

现在,关于实际的算法:我实现了三种算法。box_blur_naive、box_blur_improved和box_blur_simd。

在box_blur_naive中,对于每个像素及其邻居,有熟悉的加载和加法。例如,对于红色:

red += load<u8>(img + i * width * 4 + j * 4 + 0);
red += load<u8>(img + (i - 1) * width * 4 + (j) * 4 + 0);
red += load<u8>(img + (i + 1) * width * 4 + (j) * 4 + 0);
red += load<u8>(img + (i) * width * 4 + (j - 1) * 4 + 0);
red += load<u8>(img + (i) * width * 4 + (j + 1) * 4 + 0);

基准时间:29.430秒

要对此进行simd化,我们需要稍微重新排列事物。首先,垂直线(i、i - 1、i + 1)非常容易转换为simd指令。问题在于,使用v128很难将水平邻居相加,因为它们都在同一个寄存器中结束。

幸运的是,对于框模糊来说,有一种简单的方法可以将水平和垂直相加分离,借助一个转置后的图像,这就是box_blur_improved的作用:

red += load<u8>(img + i * width * 4 + j * 4 + 0);
red += load<u8>(img + (i - 1) * width * 4 + (j) * 4 + 0);
red += load<u8>(img + (i + 1) * width * 4 + (j) * 4 + 0);
...
red += load<u8>(transposed_img + (i - 1) * height * 4 + (j) * 4 + 0);
red += load<u8>(transposed_img + (i + 1) * height * 4 + (j) * 4 + 0);

基准时间:30.225秒

现在,我们只有垂直相加,所以我们终于可以开始使用v128。将数据加载到v128中:

line = v128.load(img + i * width * 4 + 4 + k * 16);
line_before = v128.load(img + (i - 1) * width * 4 + 4 + k * 16);
line_after = v128.load(img + (i + 1) * width * 4 + 4 + k * 16);

这与之前的load执行相同的操作,但一次加载16个u8值。现在我们可以执行加法操作:

let result = v128.add<u8>(line, v128.add<u8>(line_before, line_after));

这一行执行了之前的所有垂直相加。我们尚未添加转置的相加,因为我将很快解释一个未来的问题。

v128.store(blurred_img + i * width * 4 + 4 + k * 16, result);

这将结果存储在指定地址。这就是全部。

基准时间:17.637秒

simd解决方案似乎可以节省约一半的时间,但代码还没有完全simd化。

最后的问题是,使用v128(以及simd一般)很难进行整数除法。

考虑我们在第一个算法中将数据存储在内存中的方式:

(load<u8>(blurred_img + j * width * 4 + i * 4 + 0) + red) / 5) as u8

我们必须进行除以5的操作,这不是v128上的操作。可能有一些使用位移和减法进行除法的方法。从我们已经获得的速度增益来看,这可能值得一试。

英文:

I tried to tackle this problem, my attempt is here.

So, the usual way of dealing with simd, AssemblyScript being no different, is the following:

  1. load the data into simd registers from regular registers or memory
  2. compute the vectorized calculations
  3. store the data back into regular registers or memory

AssemblyScript has one simd datatype: v128, which can store 16 u8's. We will use it in order to do simd calculations.

I built a box blur example using this resource. I took some decisions in order to remove edge-cases and focus only on the simd part:

  1. I used the rgba format, as opposed to the rgb format. Keeping things as powers of two helps reduce edge-cases, and better fits for the v128.
  2. I padded the image that I am blurring with pixels with values of 0, in order to reduce the complexity of the algorithm. There is no need for bounds checking now.

Now, for the actual algorithms: I implemented three algorithms. The box_blur_naive, box_blur_improved and box_blur_simd.

In the box_blur_naive there is the familiar loading and adding for every pixel and its neighbours. For example, for the red color:

red += load&lt;u8&gt;(img + i * width * 4 + j * 4 + 0);
red += load&lt;u8&gt;(img + (i - 1) * width * 4 + (j) * 4 + 0);
red += load&lt;u8&gt;(img + (i + 1) * width * 4 + (j) * 4 + 0);
red += load&lt;u8&gt;(img + (i) * width * 4 + (j - 1) * 4 + 0);
red += load&lt;u8&gt;(img + (i) * width * 4 + (j + 1) * 4 + 0);

Benchmarking time: 29.430s

In order to simdize this, we need to rearrange things a bit. First, the vertical lines (i, i - 1, i + 1) are very easily convertible to simd instructions. The problem is that there is no easy way to add the horizontal neighbours with v128, because they all end up in the same register.

Fortunately, there is an easy way for box blur to separate the horizontal and vertical additions, with the help of a transposed image, which is what box_blur_improved does:

red += load&lt;u8&gt;(img + i * width * 4 + j * 4 + 0);
red += load&lt;u8&gt;(img + (i - 1) * width * 4 + (j) * 4 + 0);
red += load&lt;u8&gt;(img + (i + 1) * width * 4 + (j) * 4 + 0);
...
red += load&lt;u8&gt;(transposed_img + (i - 1) * height * 4 + (j) * 4 + 0);
red += load&lt;u8&gt;(transposed_img + (i + 1) * height * 4 + (j) * 4 + 0);

Benchmarking time: 30.225s

Now, we only have vertical additions, so we can finally start bringing in v128. Getting the data into v128:

line = v128.load(img + i * width * 4 + 4 + k * 16);
line_before = v128.load(img + (i - 1) * width * 4 + 4 + k * 16);
line_after = v128.load(img + (i + 1) * width * 4 + 4 + k * 16);

This does the same thing as the loads before, but for 16 u8 values at a single time. We can now perform the additions:

let result = v128.add&lt;u8&gt;(line, v128.add&lt;u8&gt;(line_before, line_after));

This line performs all the vertical additions from before. We have not added the transposed additions, because of a future problem that I will explain shortly.

v128.store(blurred_img + i * width * 4 + 4 + k * 16, result);

This stores the result at the specified address. And that is all.

Benchmarking time: 17.637s

The simd solutions seems to save about half the time, and the code is not fully simdized.

The last problem is that there is no easy way to do integer division in an easy way with v128 (and simd in general).

Consider the way in which we store the data in memory in the first algorithms:

(load&lt;u8&gt;(blurred_img + j * width * 4 + i * 4 + 0) + red) / 5) as u8

We have to do division by 5, which is not an operation on v128. There may be some ways of doing the division by using bit shifts and subtractions. Judging from the speed increase we already obtained, it may be worthwhile to do it.

答案2

得分: 1

除了在 https://stackoverflow.com/a/75494927/11042887 中回答的内容之外,您可以通过标量来模拟i8x16除法,如下所示:

export function i8x16_fastdiv(x: v128, d: u8): v128 {
    let n = u8(23 - clz((u32(d) | 1) << 16)); // 找到移位因子
    // 设置8个“反向除法”单词:
    let inv = i16x8.splat((1 << n) / d + 1);
    // 拆分成两个16位值
    let lo = i16x8.extend_low_i8x16_u(x);
    let hi = i16x8.extend_high_i8x16_u(x);
    // 用“反向除法”和除法相乘
    lo = i16x8.shr_u(i16x8.mul(lo, inv), n);
    hi = i16x8.shr_u(i16x8.mul(hi, inv), n);
    // 重新打包
    return i8x16.narrow_i16x8_u(lo, hi);
}

但请注意,在Firefox上可能比使用提取+替换通道的标量除法版本更快,而在Chrome上可能更慢。在性能方面,Chrome在使用SIMD方面的表现仍然相当不可预测。

英文:

In addition to answered in https://stackoverflow.com/a/75494927/11042887 you can emulate i8x16 division by scalar as:

export function i8x16_fastdiv(x: v128, d: u8): v128 {
    let n = u8(23 - clz((u32(d) | 1) &lt;&lt; 16)); // find shift factor
    // set 8 words of &quot;inverse div&quot;:
    let inv = i16x8.splat((1 &lt;&lt; n) / d + 1);
    // unpack into two 16-bit values
    let lo = i16x8.extend_low_i8x16_u(x);
    let hi = i16x8.extend_high_i8x16_u(x);
    // mul with &quot;inverse div&quot; and div
    lo = i16x8.shr_u(i16x8.mul(lo, inv), n);
    hi = i16x8.shr_u(i16x8.mul(hi, inv), n);
    // repack
    return i8x16.narrow_i16x8_u(lo, hi);
}

But note it may be faster on Firefox while slower on Chrome than scalar division version with extract + replace lanes. Chrome still behaves rather unpredictably with SIMD in term of performance.

huangapple
  • 本文由 发表于 2023年2月14日 00:46:21
  • 转载请务必保留本文链接:https://go.coder-hub.com/75438836.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定