问题

以下是您要翻译的内容：

为了我正在开发的游戏 - 需要在向量上运行数百万次 "exp" 调用的能力。基本上

void vector_exp(const double *x, const double *result, int n)
{
    for (int i=0 ; i&lt;n ; i++) result[i] = exp(x[i]) ;
}

对于我的具体情况，输入都在-50到+50之间。需要双精度，与当前的 'exp' 匹配，以通过测试用例。

我在 'log' 方面也面临相同的挑战。输入范围是1e-7到1e7。

我想要利用AVX 512指令 - 理论上应该能够一次处理8个双精度数。我已经获取了glibc的代码（包括“C”版本和为AVX构建的“.S”版本） - 但我不确定接下来该如何进行。

https://github.com/bminor/glibc/tree/master/sysdeps/x86_64/fpu

英文:

For a game that I'm working on - need the ability to run million of "exp" calls on vectors. Basically

void vector_exp(const double *x, const double *result, int n)
{
    for (int i=0 ; i&lt;n ; i++) result[i] = exp(x[i]) ;
}

For my specific case, inputs are all -50..+50. Need double precision, with 8 decimals match current ‘exp’ - to pass test cases.

I have the same challenge also with 'log'. Input range is 1e-7 to 1e7.

Would like to utilize the AVX 512 instructions - which should be able to do (in theory) 8 double precision at a time. I've retrieved the glibc code (both the "C" version, and the ".S" version built for AVX) - but I'm not sure how to move forward from here.

https://github.com/bminor/glibc/tree/master/sysdeps/x86_64/fpu

答案1

得分: 3

I'm sure the other answers are better than mine - running a very quick and dirty polynomial approximation, I end up with these.

inline __m512d exp2(const __m512d x) {
    const __m512d a = _mm512_set1_pd(0.000217549227054);
    const __m512d b = _mm512_set1_pd(0.00124218531444);
    const __m512d c = _mm512_set1_pd(0.00968102455999);
    const __m512d d = _mm512_set1_pd(0.0554821818101);
    const __m512d e = _mm512_set1_pd(0.240230073528);
    const __m512d f = _mm512_set1_pd(0.693146979806);
    const __m512d g = _mm512_set1_pd(1.0);
    const __m512d fx = _mm512_floor_pd(x);  // integer part
    const __m512d X = _mm512_sub_pd(x, fx); // fractional part
    __m512d y = _mm512_fmadd_pd(a, X, b);
    y = _mm512_fmadd_pd(y, X, c);
    y = _mm512_fmadd_pd(y, X, d);
    y = _mm512_fmadd_pd(y, X, e);
    y = _mm512_fmadd_pd(y, X, f);
    y = _mm512_fmadd_pd(y, X, g);      // polynomial approximation over [0,1)
    return _mm512_scalef_pd(y, fx);    // scale by 2^integer
}

inline __m512d exp(const __m512d x) {
    return exp2(_mm512_mul_pd(x, _mm512_set1_pd(1.442695040888963387)));
}

inline __m512d log2(const __m512d x) {
    const __m512d m = _mm512_getmant_pd(x, _MM_MANT_NORM_1_2, _MM_MANT_SIGN_zero);
    const __m512d a = _mm512_set1_pd(0.0146498917256);
    const __m512d b = _mm512_set1_pd(-0.178725976271);
    const __m512d c = _mm512_set1_pd(0.953841083567);
    const __m512d d = _mm512_set1_pd(-2.92298892586);
    const __m512d e = _mm512_set1_pd(5.68725545823);
    const __m512d f = _mm512_set1_pd(-7.4092580291);
    const __m512d g = _mm512_set1_pd(7.09194627711);
    const __m512d h = _mm512_set1_pd(-3.23671917705);
    __m512d y = _mm512_fmadd_pd(a, m, b);
    y = _mm512_fmadd_pd(y, m, c);
    y = _mm512_fmadd_pd(y, m, d);
    y = _mm512_fmadd_pd(y, m, e);
    y = _mm512_fmadd_pd(y, m, f);
    y = _mm512_fmadd_pd(y, m, g);
    y = _mm512_fmadd_pd(y, m, h);  // poly approximation over [1,2) mantissa
    return _mm512_add_pd(y, _mm512_getexp_pd(x));
}

inline __m512d log(const __m512d x) {
    return _mm512_mul_pd(log2(x), _mm512_set1_pd(0.693147180559945286));
}

Out-of-order execution across independent exp2() or log2() operations can handle the FMA dependency chains from using Horner's rule for the 6th-order polynomials.

英文:

I'm sure the other answers are better than mine - running a very quick and dirty polynomial approximation, I end up with these.

inline __m512d exp2(const __m512d x) {
    const __m512d a = _mm512_set1_pd(0.000217549227054);
    const __m512d b = _mm512_set1_pd(0.00124218531444);
    const __m512d c = _mm512_set1_pd(0.00968102455999);
    const __m512d d = _mm512_set1_pd(0.0554821818101);
    const __m512d e = _mm512_set1_pd(0.240230073528);
    const __m512d f = _mm512_set1_pd(0.693146979806);
    const __m512d g = _mm512_set1_pd(1.0);
    const __m512d fx = _mm512_floor_pd(x);  // integer part
    const __m512d X = _mm512_sub_pd(x, fx); // fractional part
    __m512d y = _mm512_fmadd_pd(a, X, b);
    y = _mm512_fmadd_pd(y, X, c);
    y = _mm512_fmadd_pd(y, X, d);
    y = _mm512_fmadd_pd(y, X, e);
    y = _mm512_fmadd_pd(y, X, f);
    y = _mm512_fmadd_pd(y, X, g);      // polynomial approximation over [0,1)
    return _mm512_scalef_pd(y, fx);    // scale by 2^integer
}

inline __m512d exp(const __m512d x) {
    return exp2(_mm512_mul_pd(x, _mm512_set1_pd(1.442695040888963387)));
}

inline __m512d log2(const __m512d x) {
    const __m512d m = _mm512_getmant_pd(x, _MM_MANT_NORM_1_2, _MM_MANT_SIGN_zero);
    const __m512d a = _mm512_set1_pd(0.0146498917256);
    const __m512d b = _mm512_set1_pd(-0.178725976271);
    const __m512d c = _mm512_set1_pd(0.953841083567);
    const __m512d d = _mm512_set1_pd(-2.92298892586);
    const __m512d e = _mm512_set1_pd(5.68725545823);
    const __m512d f = _mm512_set1_pd(-7.4092580291);
    const __m512d g = _mm512_set1_pd(7.09194627711);
    const __m512d h = _mm512_set1_pd(-3.23671917705);
    __m512d y = _mm512_fmadd_pd(a, m, b);
    y = _mm512_fmadd_pd(y, m, c);
    y = _mm512_fmadd_pd(y, m, d);
    y = _mm512_fmadd_pd(y, m, e);
    y = _mm512_fmadd_pd(y, m, f);
    y = _mm512_fmadd_pd(y, m, g);
    y = _mm512_fmadd_pd(y, m, h);  // poly approximation over [1,2) mantissa
    return _mm512_add_pd(y, _mm512_getexp_pd(x));
}

inline __m512d log(const __m512d x) {
    return _mm512_mul_pd(log2(x), _mm512_set1_pd(0.693147180559945286));
}

Out-of-order execution across independent exp2() or log2() operations can handle the FMA dependency chains from using Horner's rule for the 6th-order polynomials.

See also Agner Fog's VCL implementations which aim for high precision, close to full precision of double:

double-precision exp_d template, supporting a few bases including 2.0, and exp(x-1) vs. exp(x). (See the exp2 caller for the right template params).

Uses a 13th-order Taylor series, which comments in the code indicate is faster than the alternate version, using a Pade expansion: a ratio of two polynomials. One division per many FMAs isn't a disaster for throughput, especially if you have lots surrounding code that also does some non-division work with each result, but doing just this might be too much division per FMA.
double-precision log_d template. This does use a ratio of 5th-order polynomials for the mantissa. Template params support log(x) vs. log(x+1) to avoid losing precision. It only does natural log (base e), so you'll need to scale the result by 1/ln(2).

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

如何使用AVX-512实现向量化的“exp”和“log”基数2函数。

问题

答案1

如何明确在Zig中利用SIMD？

RISCV Li指令

在x86-64上如何将无符号128/64位数相除？

虚拟机是否必须解释字节码，还是可以直接解释虚拟机的汇编语言？

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论