如何在AVX2中获取_mm256_rcp_pd?

huangapple go评论52阅读模式
英文:

How to get _mm256_rcp_pd in AVX2?

问题

_mm256_rcp_pd 在 AVX 或 AVX2 中并不存在。

在 AVX512 中,我们有 _mm256_rcp14_pd

在 AVX2 中,是否有一种获取双精度快速近似倒数的方法?我们是否应该先转换为单精度,然后再转回双精度?

英文:

For some reason _mm256_rcp_pd is not in AVX or AVX2.

In AVX512 we got _mm256_rcp14_pd.

Is there a way to get a fast approximate reciprocal in double precision on AVX2? Are we supposed to convert to single precision and then back?

答案1

得分: 2

通过一些整数强制类型转换和牛顿-拉弗森修正步骤,您可以使用3个微操作获得相对准确的近似值。延迟可能不太理想,因为这涉及混合整数和双精度操作。但它应该比divpd要好得多。此解决方案还假定所有输入都是标准化的双精度。

__m256d fastinv(__m256d y)
{
    // 对于2的幂,获得精确结果
    __m256i const magic = _mm256_set1_epi64x(0x7fe0'0000'0000'0000);
    // 位运算魔术:对于2的幂,这只是反转指数,
    // 对于其他值,进行线性插值
    __m256d x = _mm256_castsi256_pd(_mm256_sub_epi64(magic, _mm256_castpd_si256(y)));

    // 牛顿-拉弗森修正:x = x*(2.0 - x*y):
    x = _mm256_mul_pd(x, _mm256_fnmadd_pd(x, y, _mm256_set1_pd(2.0)));

    return x;
}

使用上述常数,对于2的幂,反转是精确的,但在sqrt(2)附近有大约1.44%的误差。

如果您微调magic常数以及2.0常数或添加另一个NR步骤,可以增加精度。

Godbolt链接:https://godbolt.org/z/f7YhnhT96

英文:

With some integer-cast-hacking, and a Newton–Raphson refinement step, you can get a somewhat accurate approximation with 3 uops. Latency is probably not too good, since this involves mixing integer and double operations. But it should be much better than divpd.
This solution also assumes that all inputs are normalized doubles.

__m256d fastinv(__m256d y)
{
    // exact results for powers of two
    __m256i const magic = _mm256_set1_epi64x(0x7fe0'0000'0000'0000);
    // Bit-magic: For powers of two this just inverts the exponent, 
    // and values between that are linearly interpolated 
    __m256d x = _mm256_castsi256_pd(_mm256_sub_epi64(magic,_mm256_castpd_si256(y)));

    // Newton-Raphson refinement: x = x*(2.0 - x*y):
    x = _mm256_mul_pd(x, _mm256_fnmadd_pd(x, y, _mm256_set1_pd(2.0)));

    return x;
}

With the constants above, the inverse is exact for powers of two, but has an error of ~1.44% near sqrt(2).

If you fine-tune the magic constant as well as the 2.0 constant or add another NR-step, you can increase the accuracy.

Godbolt link: https://godbolt.org/z/f7YhnhT96

huangapple
  • 本文由 发表于 2023年3月3日 22:35:06
  • 转载请务必保留本文链接:https://go.coder-hub.com/75628394.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定