问题

_mm256_rcp_pd 在 AVX 或 AVX2 中并不存在。

在 AVX512 中，我们有 _mm256_rcp14_pd。

在 AVX2 中，是否有一种获取双精度快速近似倒数的方法？我们是否应该先转换为单精度，然后再转回双精度？

英文:

For some reason _mm256_rcp_pd is not in AVX or AVX2.

In AVX512 we got _mm256_rcp14_pd.

Is there a way to get a fast approximate reciprocal in double precision on AVX2? Are we supposed to convert to single precision and then back?

答案1

得分: 2

通过一些整数强制类型转换和牛顿-拉弗森修正步骤，您可以使用3个微操作获得相对准确的近似值。延迟可能不太理想，因为这涉及混合整数和双精度操作。但它应该比divpd要好得多。此解决方案还假定所有输入都是标准化的双精度。

__m256d fastinv(__m256d y)
{
    // 对于2的幂，获得精确结果
    __m256i const magic = _mm256_set1_epi64x(0x7fe0'0000'0000'0000);
    // 位运算魔术：对于2的幂，这只是反转指数，
    // 对于其他值，进行线性插值
    __m256d x = _mm256_castsi256_pd(_mm256_sub_epi64(magic, _mm256_castpd_si256(y)));

    // 牛顿-拉弗森修正：x = x*(2.0 - x*y):
    x = _mm256_mul_pd(x, _mm256_fnmadd_pd(x, y, _mm256_set1_pd(2.0)));

    return x;
}

使用上述常数，对于2的幂，反转是精确的，但在sqrt(2)附近有大约1.44%的误差。

如果您微调magic常数以及2.0常数或添加另一个NR步骤，可以增加精度。

Godbolt链接：https://godbolt.org/z/f7YhnhT96

英文:

With some integer-cast-hacking, and a Newton–Raphson refinement step, you can get a somewhat accurate approximation with 3 uops. Latency is probably not too good, since this involves mixing integer and double operations. But it should be much better than divpd.
This solution also assumes that all inputs are normalized doubles.

__m256d fastinv(__m256d y)
{
    // exact results for powers of two
    __m256i const magic = _mm256_set1_epi64x(0x7fe0&#39;0000&#39;0000&#39;0000);
    // Bit-magic: For powers of two this just inverts the exponent, 
    // and values between that are linearly interpolated 
    __m256d x = _mm256_castsi256_pd(_mm256_sub_epi64(magic,_mm256_castpd_si256(y)));

    // Newton-Raphson refinement: x = x*(2.0 - x*y):
    x = _mm256_mul_pd(x, _mm256_fnmadd_pd(x, y, _mm256_set1_pd(2.0)));

    return x;
}

With the constants above, the inverse is exact for powers of two, but has an error of ~1.44% near sqrt(2).

If you fine-tune the magic constant as well as the 2.0 constant or add another NR-step, you can increase the accuracy.

Godbolt link: https://godbolt.org/z/f7YhnhT96

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

如何在AVX2中获取_mm256_rcp_pd？

问题

答案1

如何明确在Zig中利用SIMD？

The fastest way to convert a UInt64 hex string to a UInt32 value preserving as many leading digits as possible, i.e. truncation

如何在汇编中将128位数据加载到ymm寄存器？

如何在C++中交错三个AVX寄存器的字节。

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论