如何使用AVX-512实现向量化的“exp”和“log”基数2函数。

huangapple go评论64阅读模式
英文:

How to implement vectorize "exp" and "log" base-2 functions using AVX-512

问题

以下是您要翻译的内容:

为了我正在开发的游戏 - 需要在向量上运行数百万次 "exp" 调用的能力。基本上

void vector_exp(const double *x, const double *result, int n)
{
    for (int i=0 ; i<n ; i++) result[i] = exp(x[i]) ;
}

对于我的具体情况,输入都在-50到+50之间。需要双精度,与当前的 'exp' 匹配,以通过测试用例。

我在 'log' 方面也面临相同的挑战。输入范围是1e-7到1e7。

我想要利用AVX 512指令 - 理论上应该能够一次处理8个双精度数。我已经获取了glibc的代码(包括“C”版本和为AVX构建的“.S”版本) - 但我不确定接下来该如何进行。

https://github.com/bminor/glibc/tree/master/sysdeps/x86_64/fpu

英文:

For a game that I'm working on - need the ability to run million of "exp" calls on vectors. Basically

void vector_exp(const double *x, const double *result, int n)
{
    for (int i=0 ; i<n ; i++) result[i] = exp(x[i]) ;
}

For my specific case, inputs are all -50..+50. Need double precision, with 8 decimals match current ‘exp’ - to pass test cases.

I have the same challenge also with 'log'. Input range is 1e-7 to 1e7.

Would like to utilize the AVX 512 instructions - which should be able to do (in theory) 8 double precision at a time. I've retrieved the glibc code (both the "C" version, and the ".S" version built for AVX) - but I'm not sure how to move forward from here.

https://github.com/bminor/glibc/tree/master/sysdeps/x86_64/fpu

答案1

得分: 3

I'm sure the other answers are better than mine - running a very quick and dirty polynomial approximation, I end up with these.

inline __m512d exp2(const __m512d x) {
    const __m512d a = _mm512_set1_pd(0.000217549227054);
    const __m512d b = _mm512_set1_pd(0.00124218531444);
    const __m512d c = _mm512_set1_pd(0.00968102455999);
    const __m512d d = _mm512_set1_pd(0.0554821818101);
    const __m512d e = _mm512_set1_pd(0.240230073528);
    const __m512d f = _mm512_set1_pd(0.693146979806);
    const __m512d g = _mm512_set1_pd(1.0);
    const __m512d fx = _mm512_floor_pd(x);  // integer part
    const __m512d X = _mm512_sub_pd(x, fx); // fractional part
    __m512d y = _mm512_fmadd_pd(a, X, b);
    y = _mm512_fmadd_pd(y, X, c);
    y = _mm512_fmadd_pd(y, X, d);
    y = _mm512_fmadd_pd(y, X, e);
    y = _mm512_fmadd_pd(y, X, f);
    y = _mm512_fmadd_pd(y, X, g);      // polynomial approximation over [0,1)
    return _mm512_scalef_pd(y, fx);    // scale by 2^integer
}

inline __m512d exp(const __m512d x) {
    return exp2(_mm512_mul_pd(x, _mm512_set1_pd(1.442695040888963387)));
}

inline __m512d log2(const __m512d x) {
    const __m512d m = _mm512_getmant_pd(x, _MM_MANT_NORM_1_2, _MM_MANT_SIGN_zero);
    const __m512d a = _mm512_set1_pd(0.0146498917256);
    const __m512d b = _mm512_set1_pd(-0.178725976271);
    const __m512d c = _mm512_set1_pd(0.953841083567);
    const __m512d d = _mm512_set1_pd(-2.92298892586);
    const __m512d e = _mm512_set1_pd(5.68725545823);
    const __m512d f = _mm512_set1_pd(-7.4092580291);
    const __m512d g = _mm512_set1_pd(7.09194627711);
    const __m512d h = _mm512_set1_pd(-3.23671917705);
    __m512d y = _mm512_fmadd_pd(a, m, b);
    y = _mm512_fmadd_pd(y, m, c);
    y = _mm512_fmadd_pd(y, m, d);
    y = _mm512_fmadd_pd(y, m, e);
    y = _mm512_fmadd_pd(y, m, f);
    y = _mm512_fmadd_pd(y, m, g);
    y = _mm512_fmadd_pd(y, m, h);  // poly approximation over [1,2) mantissa
    return _mm512_add_pd(y, _mm512_getexp_pd(x));
}

inline __m512d log(const __m512d x) {
    return _mm512_mul_pd(log2(x), _mm512_set1_pd(0.693147180559945286));
}

Out-of-order execution across independent exp2() or log2() operations can handle the FMA dependency chains from using Horner's rule for the 6th-order polynomials.

英文:

I'm sure the other answers are better than mine - running a very quick and dirty polynomial approximation, I end up with these.

inline __m512d exp2(const __m512d x) {
    const __m512d a = _mm512_set1_pd(0.000217549227054);
    const __m512d b = _mm512_set1_pd(0.00124218531444);
    const __m512d c = _mm512_set1_pd(0.00968102455999);
    const __m512d d = _mm512_set1_pd(0.0554821818101);
    const __m512d e = _mm512_set1_pd(0.240230073528);
    const __m512d f = _mm512_set1_pd(0.693146979806);
    const __m512d g = _mm512_set1_pd(1.0);
    const __m512d fx = _mm512_floor_pd(x);  // integer part
    const __m512d X = _mm512_sub_pd(x, fx); // fractional part
    __m512d y = _mm512_fmadd_pd(a, X, b);
    y = _mm512_fmadd_pd(y, X, c);
    y = _mm512_fmadd_pd(y, X, d);
    y = _mm512_fmadd_pd(y, X, e);
    y = _mm512_fmadd_pd(y, X, f);
    y = _mm512_fmadd_pd(y, X, g);      // polynomial approximation over [0,1)
    return _mm512_scalef_pd(y, fx);    // scale by 2^integer
}

inline __m512d exp(const __m512d x) {
    return exp2(_mm512_mul_pd(x, _mm512_set1_pd(1.442695040888963387)));
}
inline __m512d log2(const __m512d x) {
    const __m512d m = _mm512_getmant_pd(x, _MM_MANT_NORM_1_2, _MM_MANT_SIGN_zero);
    const __m512d a = _mm512_set1_pd(0.0146498917256);
    const __m512d b = _mm512_set1_pd(-0.178725976271);
    const __m512d c = _mm512_set1_pd(0.953841083567);
    const __m512d d = _mm512_set1_pd(-2.92298892586);
    const __m512d e = _mm512_set1_pd(5.68725545823);
    const __m512d f = _mm512_set1_pd(-7.4092580291);
    const __m512d g = _mm512_set1_pd(7.09194627711);
    const __m512d h = _mm512_set1_pd(-3.23671917705);
    __m512d y = _mm512_fmadd_pd(a, m, b);
    y = _mm512_fmadd_pd(y, m, c);
    y = _mm512_fmadd_pd(y, m, d);
    y = _mm512_fmadd_pd(y, m, e);
    y = _mm512_fmadd_pd(y, m, f);
    y = _mm512_fmadd_pd(y, m, g);
    y = _mm512_fmadd_pd(y, m, h);  // poly approximation over [1,2) mantissa
    return _mm512_add_pd(y, _mm512_getexp_pd(x));
}

inline __m512d log(const __m512d x) {
    return _mm512_mul_pd(log2(x), _mm512_set1_pd(0.693147180559945286));
}

Out-of-order execution across independent exp2() or log2() operations can handle the FMA dependency chains from using Horner's rule for the 6th-order polynomials.


See also Agner Fog's VCL implementations which aim for high precision, close to full precision of double:

  • double-precision exp_d template, supporting a few bases including 2.0, and exp(x-1) vs. exp(x). (See the exp2 caller for the right template params).

    Uses a 13th-order Taylor series, which comments in the code indicate is faster than the alternate version, using a Pade expansion: a ratio of two polynomials. One division per many FMAs isn't a disaster for throughput, especially if you have lots surrounding code that also does some non-division work with each result, but doing just this might be too much division per FMA.

  • double-precision log_d template. This does use a ratio of 5th-order polynomials for the mantissa. Template params support log(x) vs. log(x+1) to avoid losing precision. It only does natural log (base e), so you'll need to scale the result by 1/ln(2).

huangapple
  • 本文由 发表于 2023年5月21日 11:07:25
  • 转载请务必保留本文链接:https://go.coder-hub.com/76298129.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定