2023年4月19日 21:23:12go评论95阅读模式

英文:

Why can't the Rust compiler auto-vectorize this FP dot product implementation?

问题

考虑一个简单的缩减，比如点乘：

pub fn add(a: &[f32], b: &[f32]) -> f32 {
    a.iter().zip(b.iter()).fold(0.0, |c, (x, y)| c + x * y)
}

使用rustc 1.68，带有-C opt-level=3 -C target-feature=+avx2,+fma，得到：

.LBB0_5:
        vmovss  xmm1, dword ptr [rdi + 4*rsi]
        vmulss  xmm1, xmm1, dword ptr [rdx + 4*rsi]
        vmovss  xmm2, dword ptr [rdi + 4*rsi + 4]
        vaddss  xmm0, xmm0, xmm1
        vmulss  xmm1, xmm2, dword ptr [rdx + 4*rsi + 4]
        vaddss  xmm0, xmm0, xmm1
        vmovss  xmm1, dword ptr [rdi + 4*rsi + 8]
        vmulss  xmm1, xmm1, dword ptr [rdx + 4*rsi + 8]
        vaddss  xmm0, xmm0, xmm1
        vmovss  xmm1, dword ptr [rdi + 4*rsi + 12]
        vmulss  xmm1, xmm1, dword ptr [rdx + 4*rsi + 12]
        lea     rax, [rsi + 4]
        vaddss  xmm0, xmm0, xmm1
        mov     rsi, rax
        cmp     rcx, rax
        jne     .LBB0_5

这是一个标量实现，带有循环展开，甚至没有将乘法和加法合并成FMA。从这段代码转换为SIMD代码应该很容易，为什么rustc没有进行这种优化呢？

如果我用i32替换f32，我会得到期望的自动矢量化：

.LBB0_5:
        vmovdqu ymm4, ymmword ptr [rdx + 4*rax]
        vmovdqu ymm5, ymmword ptr [rdx + 4*rax + 32]
        vmovdqu ymm6, ymmword ptr [rdx + 4*rax + 64]
        vmovdqu ymm7, ymmword ptr [rdx + 4*rax + 96]
        vpmulld ymm4, ymm4, ymmword ptr [rdi + 4*rax]
        vpaddd  ymm0, ymm4, ymm0
        vpmulld ymm4, ymm5, ymmword ptr [rdi + 4*rax + 32]
        vpaddd  ymm1, ymm4, ymm1
        vpmulld ymm4, ymm6, ymmword ptr [rdi + 4*rax + 64]
        vpmulld ymm5, ymm7, ymmword ptr [rdi + 4*rax + 96]
        vpaddd  ymm2, ymm4, ymm2
        vpaddd  ymm3, ymm5, ymm3
        add     rax, 32
        cmp     r8, rax
        jne     .LBB0_5

英文:

Lets consider a simple reduction, such as a dot product:

pub fn add(a:&amp;[f32], b:&amp;[f32]) -&gt; f32 {
    a.iter().zip(b.iter()).fold(0.0, |c,(x,y)| c+x*y))
}

Using rustc 1.68 with -C opt-level=3 -C target-feature=+avx2,+fma
I get

.LBB0_5:
        vmovss  xmm1, dword ptr [rdi + 4*rsi]
        vmulss  xmm1, xmm1, dword ptr [rdx + 4*rsi]
        vmovss  xmm2, dword ptr [rdi + 4*rsi + 4]
        vaddss  xmm0, xmm0, xmm1
        vmulss  xmm1, xmm2, dword ptr [rdx + 4*rsi + 4]
        vaddss  xmm0, xmm0, xmm1
        vmovss  xmm1, dword ptr [rdi + 4*rsi + 8]
        vmulss  xmm1, xmm1, dword ptr [rdx + 4*rsi + 8]
        vaddss  xmm0, xmm0, xmm1
        vmovss  xmm1, dword ptr [rdi + 4*rsi + 12]
        vmulss  xmm1, xmm1, dword ptr [rdx + 4*rsi + 12]
        lea     rax, [rsi + 4]
        vaddss  xmm0, xmm0, xmm1
        mov     rsi, rax
        cmp     rcx, rax
        jne     .LBB0_5

which is a scalar implementation with loop unrolling, not even contracting the mul+add into FMAs. From this code to simd code should be easy, why does rustc not optimize this?

If I replace f32 with i32 I get the desired auto-vectorization:

.LBB0_5:
        vmovdqu ymm4, ymmword ptr [rdx + 4*rax]
        vmovdqu ymm5, ymmword ptr [rdx + 4*rax + 32]
        vmovdqu ymm6, ymmword ptr [rdx + 4*rax + 64]
        vmovdqu ymm7, ymmword ptr [rdx + 4*rax + 96]
        vpmulld ymm4, ymm4, ymmword ptr [rdi + 4*rax]
        vpaddd  ymm0, ymm4, ymm0
        vpmulld ymm4, ymm5, ymmword ptr [rdi + 4*rax + 32]
        vpaddd  ymm1, ymm4, ymm1
        vpmulld ymm4, ymm6, ymmword ptr [rdi + 4*rax + 64]
        vpmulld ymm5, ymm7, ymmword ptr [rdi + 4*rax + 96]
        vpaddd  ymm2, ymm4, ymm2
        vpaddd  ymm3, ymm5, ymm3
        add     rax, 32
        cmp     r8, rax
        jne     .LBB0_5

答案1

得分: 5

这是因为浮点数不是可结合的，通常意味着 a+(b+c) != (a+b)+c。因此，对浮点数求和变成了串行任务，因为编译器不会将 ((a+b)+c)+d 重新排序为 (a+b)+(c+d)。后者可以矢量化，而前者则不能。

在大多数情况下，程序员不关心求和顺序的差异。

gcc 和 clang 提供 -fassociative-math 标志，允许编译器为了性能而重新排序浮点运算。

rustc 不提供这一选项，据我所知，llvm 也不接受更改这种行为的标志。

在 Rust 的 nightly 版本中，你可以使用 #![feature(core_intrinsics)] 来进行优化：

#![feature(core_intrinsics)]
pub fn add(a: &[f32], b: &[f32]) -> f32 {
    unsafe {
        a.iter().zip(b.iter()).fold(0.0, |c, (x, y)| std::intrinsics::fadd_fast(c, x * y))
    }
}

这不使用 fma。要使用 fma，你可以这样做：

#![feature(core_intrinsics)]
pub fn add(a: &[f32], b: &[f32]) -> f32 {
    unsafe {
        a.iter().zip(b.iter()).fold(0.0, |c, (&x, &y)| std::intrinsics::fadd_fast(c, std::intrinsics::fmul_fast(x, y)))
    }
}

我不知道一个稳定的 Rust 解决方案，不涉及显式的 simd 内置函数。

英文:

This is because floating points are not associative, meaning in general a+(b+c) != (a+b)+c. So summing up floating points becomes are serial task, because the compiler will not reorder ((a+b)+c)+d into (a+b)+(c+d). The last can be vectorized, the first cannot.

In most cases the programmer does not care about the differences in summing order.

gcc and clang provide the -fassociative-math flag which will allow the compiler to reorder floating point operations for performance.

rustc does not provide this and for all I know llvm also does not accept flags which will change this behavior.

In nightly Rust you can use #![feature(core_intrinsics)] to get the optimization:

#![feature(core_intrinsics)]
pub fn add(a:&amp;[f32], b:&amp;[f32]) -&gt; f32 {
    unsafe {
        a.iter().zip(b.iter()).fold(0.0, |c,(x,y)| std::intrinsics::fadd_fast(c,x*y))
    }
}

This does not use fma. So for fma you have to use:

#![feature(core_intrinsics)]
pub fn add(a:&amp;[f32], b:&amp;[f32]) -&gt; f32 {
    unsafe {
        a.iter().zip(b.iter()).fold(0.0, |c,(&amp;x,&amp;y)| std::intrinsics::fadd_fast(c,std::intrinsics::fmul_fast(x,y)))
    }
}

I am not aware of a stable Rust solution, which does not involve explicit simd intrinsics.

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

为什么Rust编译器无法自动矢量化这个浮点数点积实现？

问题

答案1

[T]与&[T]之间的区别

Stop command execution and exit process in raw mode.

有没有办法将一个属性传递给dioxus中的主组件？

有没有一种简便的方法从 Rust 调用 Java 函数？

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。