加速嵌套循环,执行三个数组中每对元素交集的人口统计乘积。

huangapple go评论85阅读模式
英文:

Speed up nested loops doing products of popcounts of intersections of pairs of elements within each of 3 arrays

问题

我有一个外表看似无害的函数 f,它被紧密循环调用并导致速度瓶颈。对于如何改进它,有什么见解吗?

#define N 48
// N = 47也是相关的
int f(
const uint16_t * __restrict A,
const uint16_t * __restrict B,
const uint16_t * __restrict C) {
	int E = 0;
	for (int r = 0; r < N; ++r) {
		for (int s = r; s < N; ++s) {
			int pa = __builtin_popcount(A[r] & A
展开收缩
);
int pb = __builtin_popcount(B[r] & B
展开收缩
);
int pc = __builtin_popcount(C[r] & C
展开收缩
);
E += pa*pb*pc; } } return E; }

我尝试过:

(a) 在线性时间内预处理 ABC 数组,并仅对 pa*pb*pc 非零的三元组进入双重循环。然而,由于 ABC 中位的分布是均匀的,几乎没有什么被过滤掉。

(b) 阻塞以最小化缓存未命中

(c) 将 ABC 重新打包为 uint64_t,并重新设计 popcount,使其一次处理4个输入

这些都没有帮助。看起来迭代次数(约1000次)是主要问题。我是不是漏掉了什么?

编辑。我可以假设目标处理器上有 AVX2。目前的相关编译器选项包括 -O3-mpopcnt-funroll-loops-mtune=native-march=native

英文:

I have a deceptively innocent-looking function f which is called in a tight loop and is causing a speed bottleneck. Any insights on how can I improve it?

#define N 48
// N = 47 is also relevant
int f(
const uint16_t * __restrict A,
const uint16_t * __restrict B,
const uint16_t * __restrict C) {
	int E = 0;
	for (int r = 0; r &lt; N; ++r) {
		for (int s = r; s &lt; N; ++s) {
			int pa = __builtin_popcount(A[r] &amp; A
展开收缩
); int pb = __builtin_popcount(B[r] &amp; B
展开收缩
); int pc = __builtin_popcount(C[r] &amp; C
展开收缩
); E += pa*pb*pc; } } return E; }

What I've tried:

(a) preprocessing the A, B, C arrays in linear time and entering the double loop only with those triplets for which pa*pb*pc would be non-zero. However, since the distribution of bits in A, B, C is uniform, almost nothing is filtered out.

(b) blocking to minimize cache misses

(c) repacking the A, B and C in uint64_ts and reworking popcount so that it processes 4 inputs at once

None of these helped. It seems that the sheer number of iterations (~1000) is the main issue. Is it something I'm missing here?

EDIT. I can assume that AVX2 is available on the target processor. Relevant compiler options currently include -O3, -mpopcnt, -funroll-loops, -mtune=native and -march=native

答案1

得分: 3

将下三角矩阵替代上三角矩阵,并分离对角线,导致了优化,使其在我的环境下加速了2.7倍(在Clang 14.0.3上使用-O3,Apple M1)。这使得可以使用矢量(NEON)指令和一些循环展开:

int f2(const uint16_t * __restrict a,
       const uint16_t * __restrict b,
       const uint16_t * __restrict c) {
    int e = 0;
    for (int r = 1; r < N; ++r)
        for (int s = 0; s < r; ++s) {
            int pa = __builtin_popcount(a[r] & a
展开收缩
);
int pb = __builtin_popcount(b[r] & b
展开收缩
);
int pc = __builtin_popcount(c[r] & c
展开收缩
);
e += pa * pb * pc; } for (int r = 0; r < N; ++r) { int pa = __builtin_popcount(a[r]); int pb = __builtin_popcount(b[r]); int pc = __builtin_popcount(c[r]); e += pa * pb * pc; } return e; }

我还尝试过使用表查找来代替popcount指令,但在M1以及Intel i7上(使用-march=native)速度更慢。在i7上,Clang 11与这个版本相比并没有表现出显著的改进(只有10%的提升)。

英文:

Doing the lower triangular instead of upper triangular matrices, and splitting off the diagonal, resulted in optimizations that sped it up by a factor of 2.7 for me (on Clang 14.0.3 with -O3, Apple M1). It enabled the use of vector (NEON) instructions and some loop unrolling:

int f2(const uint16_t * __restrict a,
       const uint16_t * __restrict b,
       const uint16_t * __restrict c) {
    int e = 0;
    for (int r = 1; r &lt; N; ++r)
        for (int s = 0; s &lt; r; ++s) {
            int pa = __builtin_popcount(a[r] &amp; a
展开收缩
);
int pb = __builtin_popcount(b[r] &amp; b
展开收缩
);
int pc = __builtin_popcount(c[r] &amp; c
展开收缩
);
e += pa * pb * pc; } for (int r = 0; r &lt; N; ++r) { int pa = __builtin_popcount(a[r]); int pb = __builtin_popcount(b[r]); int pc = __builtin_popcount(c[r]); e += pa * pb * pc; } return e; }

I also tried using a table lookup instead of the popcount instruction, but that was slower on the M1 as well as on an Intel i7 (with -march=native). Clang 11 on the i7 was not able to do significantly better with this version than with the original (just a 10% improvement).

答案2

得分: 3

未指定原始硬件支持,对于旧硬件的支持在这种情况下很重要。为了支持旧硬件,这里有一个函数,它只需要 SSE2 而不需要 popcounts 硬件支持。SSE2 版本使用了我在 https://stackoverflow.com/questions/17354971/fast-counting-the-number-of-set-bits-in-m128i-register 中找到的 popcnt8() 函数...

以下函数与 OP 的函数相比非常慢,除非硬件 popcount 不可用。在我的 Nehalem i5 CPU 上,我测得 OP 的函数在启用硬件 popcount 时大约为 8000-9000 个 rdtsc 'cycles',而在没有启用硬件 popcount 时大约为 40000 个。这个使用 SSE2(gcc -msse2)的函数大约需要 19000 个周期。它可以在不更改 N 的情况下运行...

虽然这个函数的性能不会让任何人印象深刻,但我发现编写和基准测试它很有趣,我想有些人可能也会觉得很有趣。由于普遍假定至少支持 SSE2,并且为多个架构编码可能非常复杂,我认为分享我做的事情是有价值的。如果 OP 要求广泛兼容的代码,并假设不超过 SSE2 支持,那么这可能是一个值得改进的方法。

编辑:

我通过重新排列计算制作了一个更快的 SSE2 版本的函数。它运行速度略快于启用硬件 popcount 的函数,大约需要 5900 个周期。我知道 OP 想要 AVX2,但我认为这种方法对于想要在 AVX2 或 AVX512 中创建手动矢量化版本的人是有趣的。

SSE2 + AVX512:

_mm_popcnt_epi16 是 AVX512 内置函数,如果用它替换 X_mm_popcnt_epi16 应该能获得不错的性能提升,但我没有任何支持 AVX512 的硬件来提供基准。

以下是 X_mm_popcnt_epi16 函数...

以下是 getBrs 函数...

以下是 f8 函数...

英文:

The OP didn't originally specify what hardware support could be assumed and in this case, the hardware makes a ton of difference. To support older hardware, here is a function which does the job requiring SSE2 only without requiring hardware support for popcounts. The SSE2 version uses the popcnt8() function I found at https://stackoverflow.com/questions/17354971/fast-counting-the-number-of-set-bits-in-m128i-register...

static inline __m128i popcnt8(__m128i x) {
    const __m128i popcount_mask1 = _mm_set1_epi8(0x77);
    const __m128i popcount_mask2 = _mm_set1_epi8(0x0F);
    __m128i n;
    // Count bits in each 4-bit field.
    n = _mm_srli_epi64(x, 1);
    n = _mm_and_si128(popcount_mask1, n);
    x = _mm_sub_epi8(x, n);
    n = _mm_srli_epi64(n, 1);
    n = _mm_and_si128(popcount_mask1, n);
    x = _mm_sub_epi8(x, n);
    n = _mm_srli_epi64(n, 1);
    n = _mm_and_si128(popcount_mask1, n);
    x = _mm_sub_epi8(x, n);
    x = _mm_add_epi8(x, _mm_srli_epi16(x, 4));
    x = _mm_and_si128(popcount_mask2, x);
    return x;
}

The following function is very slow compared to the OP's function except when hardware popcount is not available. On my Nehalem i5 CPU I measured the OP's function at around 8000-9000 rdtsc 'cycles' with hardware popcount enabled and around 40000 without. The SSE2 (gcc -msse2) function measured about 19000 cycles. It does work without requiring changes for different values of N.

int f6(
const uint16_t * __restrict A,
const uint16_t * __restrict B,
const uint16_t * __restrict C) {
    int r, s, E = 0;    
    __m128i ABCr, ABCs, ABC[N];
    union {
      __m128i v;
      uint8_t u[16];
    } popcounts;
    for (r = 0; r &lt; N; ++r) {
      ABC[r] = _mm_setr_epi16(A[r], B[r], C[r], 0, 0, 0, 0, 0);
    }
    for (r = 0; r &lt; N; ++r) {
        ABCr = ABC[r];
        ABCs = popcnt8(ABCr);
        popcounts.v = _mm_bslli_si128(ABCs, 1);
        popcounts.v = _mm_add_epi8(popcounts.v, ABCs);
        E += (popcounts.u[1])*(popcounts.u[3])*(popcounts.u[5]);
        for (s = r+1; s &lt; N; s++) {
            ABCs = ABC
展开收缩
; ABCs = _mm_and_si128(ABCs, ABCr); ABCs = popcnt8(ABCs); popcounts.v = _mm_bslli_si128(ABCs, 1); popcounts.v = _mm_add_epi8(popcounts.v, ABCs); E += (popcounts.u[1])*(popcounts.u[3])*(popcounts.u[5]); } } return E; }

While this function's performance won't impress anyone, I found writing and benchmarking it interesting and thought some others might find it interesting too. Since it's quite common to assume SSE2 support as a minimum and coding for multiple architectures can be very complex, I thought there was some value in sharing what I did. If the OP had asked for widely compatible code and assuming not more than SSE2 support, then this could have been a worthwhile improvement I think.

EDIT:

I made a faster SSE2 version of the function by reordering the calculation. It runs slightly faster than the hardware popcount enabled functions at about 5900 cycles. I know the OP wanted AVX2 but I think this approach is of interest if someone wants to create a manually vectorised version in AVX2 or AVX512.

SSE2 + AVX512:

_mm_popcnt_epi16 is an AVX512 intrinsic which if substituted in place of X_mm_popcnt_epi16 should give a decent performance gain I think, but I don't have any AVX512 supporting hardware to provide benchmarks.

static inline __m128i X_mm_popcnt_epi16(__m128i v) {
// Taken from https://stackoverflow.com/questions/6431692/tweaking-mits-bitcount-algorithm-to-count-words-in-parallel
    v = _mm_add_epi16(_mm_and_si128(v, _mm_set1_epi16(0x5555)), _mm_and_si128(_mm_srli_epi16(v, 1), _mm_set1_epi16(0x5555)));
    v = _mm_add_epi16(_mm_and_si128(v, _mm_set1_epi16(0x3333)), _mm_and_si128(_mm_srli_epi16(v, 2), _mm_set1_epi16(0x3333)));
    v = _mm_add_epi16(_mm_and_si128(v, _mm_set1_epi16(0x0f0f)), _mm_and_si128(_mm_srli_epi16(v, 4), _mm_set1_epi16(0x0f0f)));
    v = _mm_add_epi16(_mm_and_si128(v, _mm_set1_epi16(0x00ff)), _mm_srli_epi16(v, 8));
    return v;
}

static inline __m128i getBrs(
const uint16_t * __restrict A,
const uint16_t * __restrict B,
const uint16_t * __restrict C, uint32_t r, uint32_t s) {
  uint32_t i;  
  __m128i Evec, popA, popB, popC, temp, temp2, Ar, Br, Cr, As, Bs, Cs; 
  Evec = _mm_setzero_si128();
  /* getBrs does sth like...
  uint32_t EB = 0;  
  uint32_t j;  
  for (i = r; i&lt;r+8; i++) {
    for (j = s; j&lt;s+8; j++) {
      EB += __builtin_popcount(A[i] &amp; A[j])*__builtin_popcount(B[i] &amp; B[j])*__builtin_popcount(C[i] &amp; C[j]);
    }
  }
  */  
  Ar = _mm_loadu_si128( (const __m128i*)&amp;A[r]);
  Br = _mm_loadu_si128( (const __m128i*)&amp;B[r]);
  Cr = _mm_loadu_si128( (const __m128i*)&amp;C[r]);
  As = _mm_loadu_si128( (const __m128i*)&amp;A
展开收缩
); Bs = _mm_loadu_si128( (const __m128i*)&amp;B
展开收缩
); Cs = _mm_loadu_si128( (const __m128i*)&amp;C
展开收缩
); for (i=0; i&lt;8; i++) { As = _mm_bsrli_si128(As, 2); As = _mm_insert_epi16(As, A
展开收缩
, 7); Bs = _mm_bsrli_si128(Bs, 2); Bs = _mm_insert_epi16(Bs, B
展开收缩
, 7); Cs = _mm_bsrli_si128(Cs, 2); Cs = _mm_insert_epi16(Cs, C
展开收缩
, 7); temp = _mm_and_si128(Ar, As); popA = X_mm_popcnt_epi16(temp); temp = _mm_and_si128(Br, Bs); popB = X_mm_popcnt_epi16(temp); temp = _mm_and_si128(Cr, Cs); popC = X_mm_popcnt_epi16(temp); temp = _mm_mullo_epi16(popA, popB); temp2 = _mm_mullo_epi16(temp, popC); Evec = _mm_add_epi16(Evec, temp2); s++; } return _mm_madd_epi16(Evec, _mm_set1_epi16(1)); } int f8( const uint16_t * __restrict A, const uint16_t * __restrict B, const uint16_t * __restrict C) { uint32_t r, i,j, Ediag = 0, E = 0; __m128i Evec, popA, popB, popC, temp, temp2, Ar, Br, Cr; Evec = _mm_setzero_si128(); union { __m128i v; uint32_t u32[4]; } popcounts; /* for (i = 0; i&lt;N; i++) { Ediag += __builtin_popcount(A[i] &amp; A[i])*__builtin_popcount(B[i] &amp; B[i])*__builtin_popcount(C[i] &amp; C[i]); } */ for (r = 0; r &lt; 48; r+=8) { Ar = _mm_loadu_si128( (const __m128i*)&amp;A[r]); Br = _mm_loadu_si128( (const __m128i*)&amp;B[r]); Cr = _mm_loadu_si128( (const __m128i*)&amp;C[r]); popA = X_mm_popcnt_epi16(Ar); popB = X_mm_popcnt_epi16(Br); popC = X_mm_popcnt_epi16(Cr); temp = _mm_mullo_epi16(popA, popB); temp2 = _mm_mullo_epi16(temp, popC); Evec = _mm_add_epi16(Evec, temp2); } popcounts.v = _mm_madd_epi16(Evec, _mm_set1_epi16(1));; Ediag = popcounts.u32[0] + popcounts.u32[1] + popcounts.u32[2] + popcounts.u32[3]; Evec = _mm_setzero_si128(); for (i = 0; i&lt;N; i+=8) { Evec = _mm_add_epi32(Evec, getBrs(A,B,C,i,i)); for (j = i+8; j&lt;N; j+=8) { temp = getBrs(A,B,C,i,j); temp = _mm_add_epi32(temp, temp); Evec = _mm_add_epi32(Evec, temp); } } popcounts.v = Evec; E = popcounts.u32[0] + popcounts.u32[1] + popcounts.u32[2] + popcounts.u32[3]; return (Ediag + E)/2; }

huangapple
  • 本文由 发表于 2023年6月19日 22:01:28
  • 转载请务必保留本文链接:https://go.coder-hub.com/76507381.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定