英文:
Speed up nested loops doing products of popcounts of intersections of pairs of elements within each of 3 arrays
问题
我有一个外表看似无害的函数 f
,它被紧密循环调用并导致速度瓶颈。对于如何改进它,有什么见解吗?
#define N 48
// N = 47也是相关的
int f(
const uint16_t * __restrict A,
const uint16_t * __restrict B,
const uint16_t * __restrict C) {
int E = 0;
for (int r = 0; r < N; ++r) {
for (int s = r; s < N; ++s) {
int pa = __builtin_popcount(A[r] & A展开收缩);
int pb = __builtin_popcount(B[r] & B展开收缩);
int pc = __builtin_popcount(C[r] & C展开收缩);
E += pa*pb*pc;
}
}
return E;
}
我尝试过:
(a) 在线性时间内预处理 A
,B
,C
数组,并仅对 pa*pb*pc
非零的三元组进入双重循环。然而,由于 A
,B
,C
中位的分布是均匀的,几乎没有什么被过滤掉。
(b) 阻塞以最小化缓存未命中
(c) 将 A
,B
和 C
重新打包为 uint64_t
,并重新设计 popcount
,使其一次处理4个输入
这些都没有帮助。看起来迭代次数(约1000次)是主要问题。我是不是漏掉了什么?
编辑。我可以假设目标处理器上有 AVX2。目前的相关编译器选项包括 -O3
,-mpopcnt
,-funroll-loops
,-mtune=native
和 -march=native
。
英文:
I have a deceptively innocent-looking function f
which is called in a tight loop and is causing a speed bottleneck. Any insights on how can I improve it?
#define N 48
// N = 47 is also relevant
int f(
const uint16_t * __restrict A,
const uint16_t * __restrict B,
const uint16_t * __restrict C) {
int E = 0;
for (int r = 0; r < N; ++r) {
for (int s = r; s < N; ++s) {
int pa = __builtin_popcount(A[r] & A展开收缩);
int pb = __builtin_popcount(B[r] & B展开收缩);
int pc = __builtin_popcount(C[r] & C展开收缩);
E += pa*pb*pc;
}
}
return E;
}
What I've tried:
(a) preprocessing the A
, B
, C
arrays in linear time and entering the double loop only with those triplets for which pa*pb*pc
would be non-zero. However, since the distribution of bits in A
, B
, C
is uniform, almost nothing is filtered out.
(b) blocking to minimize cache misses
(c) repacking the A
, B
and C
in uint64_t
s and reworking popcount
so that it processes 4 inputs at once
None of these helped. It seems that the sheer number of iterations (~1000) is the main issue. Is it something I'm missing here?
EDIT. I can assume that AVX2 is available on the target processor. Relevant compiler options currently include -O3
, -mpopcnt
, -funroll-loops
, -mtune=native
and -march=native
答案1
得分: 3
将下三角矩阵替代上三角矩阵,并分离对角线,导致了优化,使其在我的环境下加速了2.7倍(在Clang 14.0.3上使用-O3
,Apple M1)。这使得可以使用矢量(NEON)指令和一些循环展开:
int f2(const uint16_t * __restrict a,
const uint16_t * __restrict b,
const uint16_t * __restrict c) {
int e = 0;
for (int r = 1; r < N; ++r)
for (int s = 0; s < r; ++s) {
int pa = __builtin_popcount(a[r] & a展开收缩);
int pb = __builtin_popcount(b[r] & b展开收缩);
int pc = __builtin_popcount(c[r] & c展开收缩);
e += pa * pb * pc;
}
for (int r = 0; r < N; ++r) {
int pa = __builtin_popcount(a[r]);
int pb = __builtin_popcount(b[r]);
int pc = __builtin_popcount(c[r]);
e += pa * pb * pc;
}
return e;
}
我还尝试过使用表查找来代替popcount指令,但在M1以及Intel i7上(使用-march=native
)速度更慢。在i7上,Clang 11与这个版本相比并没有表现出显著的改进(只有10%的提升)。
英文:
Doing the lower triangular instead of upper triangular matrices, and splitting off the diagonal, resulted in optimizations that sped it up by a factor of 2.7 for me (on Clang 14.0.3 with -O3
, Apple M1). It enabled the use of vector (NEON) instructions and some loop unrolling:
int f2(const uint16_t * __restrict a,
const uint16_t * __restrict b,
const uint16_t * __restrict c) {
int e = 0;
for (int r = 1; r < N; ++r)
for (int s = 0; s < r; ++s) {
int pa = __builtin_popcount(a[r] & a展开收缩);
int pb = __builtin_popcount(b[r] & b展开收缩);
int pc = __builtin_popcount(c[r] & c展开收缩);
e += pa * pb * pc;
}
for (int r = 0; r < N; ++r) {
int pa = __builtin_popcount(a[r]);
int pb = __builtin_popcount(b[r]);
int pc = __builtin_popcount(c[r]);
e += pa * pb * pc;
}
return e;
}
I also tried using a table lookup instead of the popcount instruction, but that was slower on the M1 as well as on an Intel i7 (with -march=native
). Clang 11 on the i7 was not able to do significantly better with this version than with the original (just a 10% improvement).
答案2
得分: 3
未指定原始硬件支持,对于旧硬件的支持在这种情况下很重要。为了支持旧硬件,这里有一个函数,它只需要 SSE2 而不需要 popcounts 硬件支持。SSE2 版本使用了我在 https://stackoverflow.com/questions/17354971/fast-counting-the-number-of-set-bits-in-m128i-register 中找到的 popcnt8() 函数...
以下函数与 OP 的函数相比非常慢,除非硬件 popcount 不可用。在我的 Nehalem i5 CPU 上,我测得 OP 的函数在启用硬件 popcount 时大约为 8000-9000 个 rdtsc 'cycles',而在没有启用硬件 popcount 时大约为 40000 个。这个使用 SSE2(gcc -msse2)的函数大约需要 19000 个周期。它可以在不更改 N 的情况下运行...
虽然这个函数的性能不会让任何人印象深刻,但我发现编写和基准测试它很有趣,我想有些人可能也会觉得很有趣。由于普遍假定至少支持 SSE2,并且为多个架构编码可能非常复杂,我认为分享我做的事情是有价值的。如果 OP 要求广泛兼容的代码,并假设不超过 SSE2 支持,那么这可能是一个值得改进的方法。
编辑:
我通过重新排列计算制作了一个更快的 SSE2 版本的函数。它运行速度略快于启用硬件 popcount 的函数,大约需要 5900 个周期。我知道 OP 想要 AVX2,但我认为这种方法对于想要在 AVX2 或 AVX512 中创建手动矢量化版本的人是有趣的。
SSE2 + AVX512:
_mm_popcnt_epi16 是 AVX512 内置函数,如果用它替换 X_mm_popcnt_epi16 应该能获得不错的性能提升,但我没有任何支持 AVX512 的硬件来提供基准。
以下是 X_mm_popcnt_epi16 函数...
以下是 getBrs 函数...
以下是 f8 函数...
英文:
The OP didn't originally specify what hardware support could be assumed and in this case, the hardware makes a ton of difference. To support older hardware, here is a function which does the job requiring SSE2 only without requiring hardware support for popcounts. The SSE2 version uses the popcnt8() function I found at https://stackoverflow.com/questions/17354971/fast-counting-the-number-of-set-bits-in-m128i-register...
static inline __m128i popcnt8(__m128i x) {
const __m128i popcount_mask1 = _mm_set1_epi8(0x77);
const __m128i popcount_mask2 = _mm_set1_epi8(0x0F);
__m128i n;
// Count bits in each 4-bit field.
n = _mm_srli_epi64(x, 1);
n = _mm_and_si128(popcount_mask1, n);
x = _mm_sub_epi8(x, n);
n = _mm_srli_epi64(n, 1);
n = _mm_and_si128(popcount_mask1, n);
x = _mm_sub_epi8(x, n);
n = _mm_srli_epi64(n, 1);
n = _mm_and_si128(popcount_mask1, n);
x = _mm_sub_epi8(x, n);
x = _mm_add_epi8(x, _mm_srli_epi16(x, 4));
x = _mm_and_si128(popcount_mask2, x);
return x;
}
The following function is very slow compared to the OP's function except when hardware popcount is not available. On my Nehalem i5 CPU I measured the OP's function at around 8000-9000 rdtsc 'cycles' with hardware popcount enabled and around 40000 without. The SSE2 (gcc -msse2) function measured about 19000 cycles. It does work without requiring changes for different values of N.
int f6(
const uint16_t * __restrict A,
const uint16_t * __restrict B,
const uint16_t * __restrict C) {
int r, s, E = 0;
__m128i ABCr, ABCs, ABC[N];
union {
__m128i v;
uint8_t u[16];
} popcounts;
for (r = 0; r < N; ++r) {
ABC[r] = _mm_setr_epi16(A[r], B[r], C[r], 0, 0, 0, 0, 0);
}
for (r = 0; r < N; ++r) {
ABCr = ABC[r];
ABCs = popcnt8(ABCr);
popcounts.v = _mm_bslli_si128(ABCs, 1);
popcounts.v = _mm_add_epi8(popcounts.v, ABCs);
E += (popcounts.u[1])*(popcounts.u[3])*(popcounts.u[5]);
for (s = r+1; s < N; s++) {
ABCs = ABC展开收缩;
ABCs = _mm_and_si128(ABCs, ABCr);
ABCs = popcnt8(ABCs);
popcounts.v = _mm_bslli_si128(ABCs, 1);
popcounts.v = _mm_add_epi8(popcounts.v, ABCs);
E += (popcounts.u[1])*(popcounts.u[3])*(popcounts.u[5]);
}
}
return E;
}
While this function's performance won't impress anyone, I found writing and benchmarking it interesting and thought some others might find it interesting too. Since it's quite common to assume SSE2 support as a minimum and coding for multiple architectures can be very complex, I thought there was some value in sharing what I did. If the OP had asked for widely compatible code and assuming not more than SSE2 support, then this could have been a worthwhile improvement I think.
EDIT:
I made a faster SSE2 version of the function by reordering the calculation. It runs slightly faster than the hardware popcount enabled functions at about 5900 cycles. I know the OP wanted AVX2 but I think this approach is of interest if someone wants to create a manually vectorised version in AVX2 or AVX512.
SSE2 + AVX512:
_mm_popcnt_epi16 is an AVX512 intrinsic which if substituted in place of X_mm_popcnt_epi16 should give a decent performance gain I think, but I don't have any AVX512 supporting hardware to provide benchmarks.
static inline __m128i X_mm_popcnt_epi16(__m128i v) {
// Taken from https://stackoverflow.com/questions/6431692/tweaking-mits-bitcount-algorithm-to-count-words-in-parallel
v = _mm_add_epi16(_mm_and_si128(v, _mm_set1_epi16(0x5555)), _mm_and_si128(_mm_srli_epi16(v, 1), _mm_set1_epi16(0x5555)));
v = _mm_add_epi16(_mm_and_si128(v, _mm_set1_epi16(0x3333)), _mm_and_si128(_mm_srli_epi16(v, 2), _mm_set1_epi16(0x3333)));
v = _mm_add_epi16(_mm_and_si128(v, _mm_set1_epi16(0x0f0f)), _mm_and_si128(_mm_srli_epi16(v, 4), _mm_set1_epi16(0x0f0f)));
v = _mm_add_epi16(_mm_and_si128(v, _mm_set1_epi16(0x00ff)), _mm_srli_epi16(v, 8));
return v;
}
static inline __m128i getBrs(
const uint16_t * __restrict A,
const uint16_t * __restrict B,
const uint16_t * __restrict C, uint32_t r, uint32_t s) {
uint32_t i;
__m128i Evec, popA, popB, popC, temp, temp2, Ar, Br, Cr, As, Bs, Cs;
Evec = _mm_setzero_si128();
/* getBrs does sth like...
uint32_t EB = 0;
uint32_t j;
for (i = r; i<r+8; i++) {
for (j = s; j<s+8; j++) {
EB += __builtin_popcount(A[i] & A[j])*__builtin_popcount(B[i] & B[j])*__builtin_popcount(C[i] & C[j]);
}
}
*/
Ar = _mm_loadu_si128( (const __m128i*)&A[r]);
Br = _mm_loadu_si128( (const __m128i*)&B[r]);
Cr = _mm_loadu_si128( (const __m128i*)&C[r]);
As = _mm_loadu_si128( (const __m128i*)&A展开收缩);
Bs = _mm_loadu_si128( (const __m128i*)&B展开收缩);
Cs = _mm_loadu_si128( (const __m128i*)&C展开收缩);
for (i=0; i<8; i++) {
As = _mm_bsrli_si128(As, 2);
As = _mm_insert_epi16(As, A展开收缩, 7);
Bs = _mm_bsrli_si128(Bs, 2);
Bs = _mm_insert_epi16(Bs, B展开收缩, 7);
Cs = _mm_bsrli_si128(Cs, 2);
Cs = _mm_insert_epi16(Cs, C展开收缩, 7);
temp = _mm_and_si128(Ar, As);
popA = X_mm_popcnt_epi16(temp);
temp = _mm_and_si128(Br, Bs);
popB = X_mm_popcnt_epi16(temp);
temp = _mm_and_si128(Cr, Cs);
popC = X_mm_popcnt_epi16(temp);
temp = _mm_mullo_epi16(popA, popB);
temp2 = _mm_mullo_epi16(temp, popC);
Evec = _mm_add_epi16(Evec, temp2);
s++;
}
return _mm_madd_epi16(Evec, _mm_set1_epi16(1));
}
int f8(
const uint16_t * __restrict A,
const uint16_t * __restrict B,
const uint16_t * __restrict C) {
uint32_t r, i,j, Ediag = 0, E = 0;
__m128i Evec, popA, popB, popC, temp, temp2, Ar, Br, Cr;
Evec = _mm_setzero_si128();
union {
__m128i v;
uint32_t u32[4];
} popcounts;
/*
for (i = 0; i<N; i++) {
Ediag += __builtin_popcount(A[i] & A[i])*__builtin_popcount(B[i] & B[i])*__builtin_popcount(C[i] & C[i]);
}
*/
for (r = 0; r < 48; r+=8) {
Ar = _mm_loadu_si128( (const __m128i*)&A[r]);
Br = _mm_loadu_si128( (const __m128i*)&B[r]);
Cr = _mm_loadu_si128( (const __m128i*)&C[r]);
popA = X_mm_popcnt_epi16(Ar);
popB = X_mm_popcnt_epi16(Br);
popC = X_mm_popcnt_epi16(Cr);
temp = _mm_mullo_epi16(popA, popB);
temp2 = _mm_mullo_epi16(temp, popC);
Evec = _mm_add_epi16(Evec, temp2);
}
popcounts.v = _mm_madd_epi16(Evec, _mm_set1_epi16(1));;
Ediag = popcounts.u32[0] + popcounts.u32[1] + popcounts.u32[2] + popcounts.u32[3];
Evec = _mm_setzero_si128();
for (i = 0; i<N; i+=8) {
Evec = _mm_add_epi32(Evec, getBrs(A,B,C,i,i));
for (j = i+8; j<N; j+=8) {
temp = getBrs(A,B,C,i,j);
temp = _mm_add_epi32(temp, temp);
Evec = _mm_add_epi32(Evec, temp);
}
}
popcounts.v = Evec;
E = popcounts.u32[0] + popcounts.u32[1] + popcounts.u32[2] + popcounts.u32[3];
return (Ediag + E)/2;
}
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论