英文:
How to implement vectorize "exp" and "log" base-2 functions using AVX-512
问题
以下是您要翻译的内容:
为了我正在开发的游戏 - 需要在向量上运行数百万次 "exp" 调用的能力。基本上
void vector_exp(const double *x, const double *result, int n)
{
for (int i=0 ; i<n ; i++) result[i] = exp(x[i]) ;
}
对于我的具体情况,输入都在-50到+50之间。需要双精度,与当前的 'exp' 匹配,以通过测试用例。
我在 'log' 方面也面临相同的挑战。输入范围是1e-7到1e7。
我想要利用AVX 512指令 - 理论上应该能够一次处理8个双精度数。我已经获取了glibc的代码(包括“C”版本和为AVX构建的“.S”版本) - 但我不确定接下来该如何进行。
https://github.com/bminor/glibc/tree/master/sysdeps/x86_64/fpu
英文:
For a game that I'm working on - need the ability to run million of "exp" calls on vectors. Basically
void vector_exp(const double *x, const double *result, int n)
{
for (int i=0 ; i<n ; i++) result[i] = exp(x[i]) ;
}
For my specific case, inputs are all -50..+50. Need double precision, with 8 decimals match current ‘exp’ - to pass test cases.
I have the same challenge also with 'log'. Input range is 1e-7 to 1e7.
Would like to utilize the AVX 512 instructions - which should be able to do (in theory) 8 double precision at a time. I've retrieved the glibc code (both the "C" version, and the ".S" version built for AVX) - but I'm not sure how to move forward from here.
https://github.com/bminor/glibc/tree/master/sysdeps/x86_64/fpu
答案1
得分: 3
I'm sure the other answers are better than mine - running a very quick and dirty polynomial approximation, I end up with these.
inline __m512d exp2(const __m512d x) {
const __m512d a = _mm512_set1_pd(0.000217549227054);
const __m512d b = _mm512_set1_pd(0.00124218531444);
const __m512d c = _mm512_set1_pd(0.00968102455999);
const __m512d d = _mm512_set1_pd(0.0554821818101);
const __m512d e = _mm512_set1_pd(0.240230073528);
const __m512d f = _mm512_set1_pd(0.693146979806);
const __m512d g = _mm512_set1_pd(1.0);
const __m512d fx = _mm512_floor_pd(x); // integer part
const __m512d X = _mm512_sub_pd(x, fx); // fractional part
__m512d y = _mm512_fmadd_pd(a, X, b);
y = _mm512_fmadd_pd(y, X, c);
y = _mm512_fmadd_pd(y, X, d);
y = _mm512_fmadd_pd(y, X, e);
y = _mm512_fmadd_pd(y, X, f);
y = _mm512_fmadd_pd(y, X, g); // polynomial approximation over [0,1)
return _mm512_scalef_pd(y, fx); // scale by 2^integer
}
inline __m512d exp(const __m512d x) {
return exp2(_mm512_mul_pd(x, _mm512_set1_pd(1.442695040888963387)));
}
inline __m512d log2(const __m512d x) {
const __m512d m = _mm512_getmant_pd(x, _MM_MANT_NORM_1_2, _MM_MANT_SIGN_zero);
const __m512d a = _mm512_set1_pd(0.0146498917256);
const __m512d b = _mm512_set1_pd(-0.178725976271);
const __m512d c = _mm512_set1_pd(0.953841083567);
const __m512d d = _mm512_set1_pd(-2.92298892586);
const __m512d e = _mm512_set1_pd(5.68725545823);
const __m512d f = _mm512_set1_pd(-7.4092580291);
const __m512d g = _mm512_set1_pd(7.09194627711);
const __m512d h = _mm512_set1_pd(-3.23671917705);
__m512d y = _mm512_fmadd_pd(a, m, b);
y = _mm512_fmadd_pd(y, m, c);
y = _mm512_fmadd_pd(y, m, d);
y = _mm512_fmadd_pd(y, m, e);
y = _mm512_fmadd_pd(y, m, f);
y = _mm512_fmadd_pd(y, m, g);
y = _mm512_fmadd_pd(y, m, h); // poly approximation over [1,2) mantissa
return _mm512_add_pd(y, _mm512_getexp_pd(x));
}
inline __m512d log(const __m512d x) {
return _mm512_mul_pd(log2(x), _mm512_set1_pd(0.693147180559945286));
}
Out-of-order execution across independent exp2()
or log2()
operations can handle the FMA dependency chains from using Horner's rule for the 6th-order polynomials.
英文:
I'm sure the other answers are better than mine - running a very quick and dirty polynomial approximation, I end up with these.
inline __m512d exp2(const __m512d x) {
const __m512d a = _mm512_set1_pd(0.000217549227054);
const __m512d b = _mm512_set1_pd(0.00124218531444);
const __m512d c = _mm512_set1_pd(0.00968102455999);
const __m512d d = _mm512_set1_pd(0.0554821818101);
const __m512d e = _mm512_set1_pd(0.240230073528);
const __m512d f = _mm512_set1_pd(0.693146979806);
const __m512d g = _mm512_set1_pd(1.0);
const __m512d fx = _mm512_floor_pd(x); // integer part
const __m512d X = _mm512_sub_pd(x, fx); // fractional part
__m512d y = _mm512_fmadd_pd(a, X, b);
y = _mm512_fmadd_pd(y, X, c);
y = _mm512_fmadd_pd(y, X, d);
y = _mm512_fmadd_pd(y, X, e);
y = _mm512_fmadd_pd(y, X, f);
y = _mm512_fmadd_pd(y, X, g); // polynomial approximation over [0,1)
return _mm512_scalef_pd(y, fx); // scale by 2^integer
}
inline __m512d exp(const __m512d x) {
return exp2(_mm512_mul_pd(x, _mm512_set1_pd(1.442695040888963387)));
}
inline __m512d log2(const __m512d x) {
const __m512d m = _mm512_getmant_pd(x, _MM_MANT_NORM_1_2, _MM_MANT_SIGN_zero);
const __m512d a = _mm512_set1_pd(0.0146498917256);
const __m512d b = _mm512_set1_pd(-0.178725976271);
const __m512d c = _mm512_set1_pd(0.953841083567);
const __m512d d = _mm512_set1_pd(-2.92298892586);
const __m512d e = _mm512_set1_pd(5.68725545823);
const __m512d f = _mm512_set1_pd(-7.4092580291);
const __m512d g = _mm512_set1_pd(7.09194627711);
const __m512d h = _mm512_set1_pd(-3.23671917705);
__m512d y = _mm512_fmadd_pd(a, m, b);
y = _mm512_fmadd_pd(y, m, c);
y = _mm512_fmadd_pd(y, m, d);
y = _mm512_fmadd_pd(y, m, e);
y = _mm512_fmadd_pd(y, m, f);
y = _mm512_fmadd_pd(y, m, g);
y = _mm512_fmadd_pd(y, m, h); // poly approximation over [1,2) mantissa
return _mm512_add_pd(y, _mm512_getexp_pd(x));
}
inline __m512d log(const __m512d x) {
return _mm512_mul_pd(log2(x), _mm512_set1_pd(0.693147180559945286));
}
Out-of-order execution across independent exp2()
or log2()
operations can handle the FMA dependency chains from using Horner's rule for the 6th-order polynomials.
See also Agner Fog's VCL implementations which aim for high precision, close to full precision of double
:
-
double-precision
exp_d
template, supporting a few bases including2.0
, andexp(x-1)
vs.exp(x)
. (See theexp2
caller for the right template params).Uses a 13th-order Taylor series, which comments in the code indicate is faster than the alternate version, using a Pade expansion: a ratio of two polynomials. One division per many FMAs isn't a disaster for throughput, especially if you have lots surrounding code that also does some non-division work with each result, but doing just this might be too much division per FMA.
-
double-precision
log_d
template. This does use a ratio of 5th-order polynomials for the mantissa. Template params supportlog(x)
vs.log(x+1)
to avoid losing precision. It only does natural log (basee
), so you'll need to scale the result by1/ln(2)
.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论