2023年6月19日 22:01:28go评论160阅读模式

英文:

Speed up nested loops doing products of popcounts of intersections of pairs of elements within each of 3 arrays

问题

我有一个外表看似无害的函数 f，它被紧密循环调用并导致速度瓶颈。对于如何改进它，有什么见解吗？

#define N 48
// N = 47也是相关的
int f(
const uint16_t * __restrict A,
const uint16_t * __restrict B,
const uint16_t * __restrict C) {
	int E = 0;
	for (int r = 0; r < N; ++r) {
		for (int s = r; s < N; ++s) {
			int pa = __builtin_popcount(A[r] & A展开收缩);
			int pb = __builtin_popcount(B[r] & B展开收缩);
			int pc = __builtin_popcount(C[r] & C展开收缩);
			E += pa*pb*pc;
		}
	}
	return E;
}

我尝试过：

(a) 在线性时间内预处理 A，B，C 数组，并仅对 pa*pb*pc 非零的三元组进入双重循环。然而，由于 A，B，C 中位的分布是均匀的，几乎没有什么被过滤掉。

(b) 阻塞以最小化缓存未命中

这些都没有帮助。看起来迭代次数（约1000次）是主要问题。我是不是漏掉了什么？

编辑。我可以假设目标处理器上有 AVX2。目前的相关编译器选项包括 -O3，-mpopcnt，-funroll-loops，-mtune=native 和 -march=native。

英文:

I have a deceptively innocent-looking function f which is called in a tight loop and is causing a speed bottleneck. Any insights on how can I improve it?

#define N 48
// N = 47 is also relevant
int f(
const uint16_t * __restrict A,
const uint16_t * __restrict B,
const uint16_t * __restrict C) {
	int E = 0;
	for (int r = 0; r &lt; N; ++r) {
		for (int s = r; s &lt; N; ++s) {
			int pa = __builtin_popcount(A[r] &amp; A展开收缩);
			int pb = __builtin_popcount(B[r] &amp; B展开收缩);
			int pc = __builtin_popcount(C[r] &amp; C展开收缩);
			E += pa*pb*pc;
		}
	}
	return E;
}

What I've tried:

(a) preprocessing the A, B, C arrays in linear time and entering the double loop only with those triplets for which pa*pb*pc would be non-zero. However, since the distribution of bits in A, B, C is uniform, almost nothing is filtered out.

(b) blocking to minimize cache misses

None of these helped. It seems that the sheer number of iterations (~1000) is the main issue. Is it something I'm missing here?

EDIT. I can assume that AVX2 is available on the target processor. Relevant compiler options currently include -O3, -mpopcnt, -funroll-loops, -mtune=native and -march=native

答案1

得分: 3

将下三角矩阵替代上三角矩阵，并分离对角线，导致了优化，使其在我的环境下加速了2.7倍（在Clang 14.0.3上使用-O3，Apple M1）。这使得可以使用矢量（NEON）指令和一些循环展开：

int f2(const uint16_t * __restrict a,
       const uint16_t * __restrict b,
       const uint16_t * __restrict c) {
    int e = 0;
    for (int r = 1; r < N; ++r)
        for (int s = 0; s < r; ++s) {
            int pa = __builtin_popcount(a[r] & a展开收缩);
            int pb = __builtin_popcount(b[r] & b展开收缩);
            int pc = __builtin_popcount(c[r] & c展开收缩);
            e += pa * pb * pc;
        }
    for (int r = 0; r < N; ++r) {
        int pa = __builtin_popcount(a[r]);
        int pb = __builtin_popcount(b[r]);
        int pc = __builtin_popcount(c[r]);
        e += pa * pb * pc;
    }
    return e;
}

我还尝试过使用表查找来代替popcount指令，但在M1以及Intel i7上（使用-march=native）速度更慢。在i7上，Clang 11与这个版本相比并没有表现出显著的改进（只有10%的提升）。

英文:

Doing the lower triangular instead of upper triangular matrices, and splitting off the diagonal, resulted in optimizations that sped it up by a factor of 2.7 for me (on Clang 14.0.3 with -O3, Apple M1). It enabled the use of vector (NEON) instructions and some loop unrolling:

int f2(const uint16_t * __restrict a,
       const uint16_t * __restrict b,
       const uint16_t * __restrict c) {
    int e = 0;
    for (int r = 1; r &lt; N; ++r)
        for (int s = 0; s &lt; r; ++s) {
            int pa = __builtin_popcount(a[r] &amp; a展开收缩);
            int pb = __builtin_popcount(b[r] &amp; b展开收缩);
            int pc = __builtin_popcount(c[r] &amp; c展开收缩);
            e += pa * pb * pc;
        }
    for (int r = 0; r &lt; N; ++r) {
        int pa = __builtin_popcount(a[r]);
        int pb = __builtin_popcount(b[r]);
        int pc = __builtin_popcount(c[r]);
        e += pa * pb * pc;
    }
    return e;
}

I also tried using a table lookup instead of the popcount instruction, but that was slower on the M1 as well as on an Intel i7 (with -march=native). Clang 11 on the i7 was not able to do significantly better with this version than with the original (just a 10% improvement).

答案2

得分: 3

未指定原始硬件支持，对于旧硬件的支持在这种情况下很重要。为了支持旧硬件，这里有一个函数，它只需要 SSE2 而不需要 popcounts 硬件支持。SSE2 版本使用了我在 https://stackoverflow.com/questions/17354971/fast-counting-the-number-of-set-bits-in-m128i-register 中找到的 popcnt8() 函数...

以下函数与 OP 的函数相比非常慢，除非硬件 popcount 不可用。在我的 Nehalem i5 CPU 上，我测得 OP 的函数在启用硬件 popcount 时大约为 8000-9000 个 rdtsc 'cycles'，而在没有启用硬件 popcount 时大约为 40000 个。这个使用 SSE2（gcc -msse2）的函数大约需要 19000 个周期。它可以在不更改 N 的情况下运行...

虽然这个函数的性能不会让任何人印象深刻，但我发现编写和基准测试它很有趣，我想有些人可能也会觉得很有趣。由于普遍假定至少支持 SSE2，并且为多个架构编码可能非常复杂，我认为分享我做的事情是有价值的。如果 OP 要求广泛兼容的代码，并假设不超过 SSE2 支持，那么这可能是一个值得改进的方法。

编辑：

我通过重新排列计算制作了一个更快的 SSE2 版本的函数。它运行速度略快于启用硬件 popcount 的函数，大约需要 5900 个周期。我知道 OP 想要 AVX2，但我认为这种方法对于想要在 AVX2 或 AVX512 中创建手动矢量化版本的人是有趣的。

SSE2 + AVX512：

_mm_popcnt_epi16 是 AVX512 内置函数，如果用它替换 X_mm_popcnt_epi16 应该能获得不错的性能提升，但我没有任何支持 AVX512 的硬件来提供基准。

以下是 X_mm_popcnt_epi16 函数...

以下是 getBrs 函数...

以下是 f8 函数...

英文:

The OP didn't originally specify what hardware support could be assumed and in this case, the hardware makes a ton of difference. To support older hardware, here is a function which does the job requiring SSE2 only without requiring hardware support for popcounts. The SSE2 version uses the popcnt8() function I found at https://stackoverflow.com/questions/17354971/fast-counting-the-number-of-set-bits-in-m128i-register...

static inline __m128i popcnt8(__m128i x) {
    const __m128i popcount_mask1 = _mm_set1_epi8(0x77);
    const __m128i popcount_mask2 = _mm_set1_epi8(0x0F);
    __m128i n;
    // Count bits in each 4-bit field.
    n = _mm_srli_epi64(x, 1);
    n = _mm_and_si128(popcount_mask1, n);
    x = _mm_sub_epi8(x, n);
    n = _mm_srli_epi64(n, 1);
    n = _mm_and_si128(popcount_mask1, n);
    x = _mm_sub_epi8(x, n);
    n = _mm_srli_epi64(n, 1);
    n = _mm_and_si128(popcount_mask1, n);
    x = _mm_sub_epi8(x, n);
    x = _mm_add_epi8(x, _mm_srli_epi16(x, 4));
    x = _mm_and_si128(popcount_mask2, x);
    return x;
}

The following function is very slow compared to the OP's function except when hardware popcount is not available. On my Nehalem i5 CPU I measured the OP's function at around 8000-9000 rdtsc 'cycles' with hardware popcount enabled and around 40000 without. The SSE2 (gcc -msse2) function measured about 19000 cycles. It does work without requiring changes for different values of N.

int f6(
const uint16_t * __restrict A,
const uint16_t * __restrict B,
const uint16_t * __restrict C) {
    int r, s, E = 0;    
    __m128i ABCr, ABCs, ABC[N];
    union {
      __m128i v;
      uint8_t u[16];
    } popcounts;
    for (r = 0; r &lt; N; ++r) {
      ABC[r] = _mm_setr_epi16(A[r], B[r], C[r], 0, 0, 0, 0, 0);
    }
    for (r = 0; r &lt; N; ++r) {
        ABCr = ABC[r];
        ABCs = popcnt8(ABCr);
        popcounts.v = _mm_bslli_si128(ABCs, 1);
        popcounts.v = _mm_add_epi8(popcounts.v, ABCs);
        E += (popcounts.u[1])*(popcounts.u[3])*(popcounts.u[5]);
        for (s = r+1; s &lt; N; s++) {
            ABCs = ABC展开收缩;
            ABCs = _mm_and_si128(ABCs, ABCr);
            ABCs = popcnt8(ABCs);
            popcounts.v = _mm_bslli_si128(ABCs, 1);
            popcounts.v = _mm_add_epi8(popcounts.v, ABCs);
            E += (popcounts.u[1])*(popcounts.u[3])*(popcounts.u[5]);
        }
    }
    return E;
}

While this function's performance won't impress anyone, I found writing and benchmarking it interesting and thought some others might find it interesting too. Since it's quite common to assume SSE2 support as a minimum and coding for multiple architectures can be very complex, I thought there was some value in sharing what I did. If the OP had asked for widely compatible code and assuming not more than SSE2 support, then this could have been a worthwhile improvement I think.

EDIT:

I made a faster SSE2 version of the function by reordering the calculation. It runs slightly faster than the hardware popcount enabled functions at about 5900 cycles. I know the OP wanted AVX2 but I think this approach is of interest if someone wants to create a manually vectorised version in AVX2 or AVX512.

SSE2 + AVX512:

_mm_popcnt_epi16 is an AVX512 intrinsic which if substituted in place of X_mm_popcnt_epi16 should give a decent performance gain I think, but I don't have any AVX512 supporting hardware to provide benchmarks.

static inline __m128i X_mm_popcnt_epi16(__m128i v) {
// Taken from https://stackoverflow.com/questions/6431692/tweaking-mits-bitcount-algorithm-to-count-words-in-parallel
    v = _mm_add_epi16(_mm_and_si128(v, _mm_set1_epi16(0x5555)), _mm_and_si128(_mm_srli_epi16(v, 1), _mm_set1_epi16(0x5555)));
    v = _mm_add_epi16(_mm_and_si128(v, _mm_set1_epi16(0x3333)), _mm_and_si128(_mm_srli_epi16(v, 2), _mm_set1_epi16(0x3333)));
    v = _mm_add_epi16(_mm_and_si128(v, _mm_set1_epi16(0x0f0f)), _mm_and_si128(_mm_srli_epi16(v, 4), _mm_set1_epi16(0x0f0f)));
    v = _mm_add_epi16(_mm_and_si128(v, _mm_set1_epi16(0x00ff)), _mm_srli_epi16(v, 8));
    return v;
}

static inline __m128i getBrs(
const uint16_t * __restrict A,
const uint16_t * __restrict B,
const uint16_t * __restrict C, uint32_t r, uint32_t s) {
  uint32_t i;  
  __m128i Evec, popA, popB, popC, temp, temp2, Ar, Br, Cr, As, Bs, Cs; 
  Evec = _mm_setzero_si128();
  /* getBrs does sth like...
  uint32_t EB = 0;  
  uint32_t j;  
  for (i = r; i&lt;r+8; i++) {
    for (j = s; j&lt;s+8; j++) {
      EB += __builtin_popcount(A[i] &amp; A[j])*__builtin_popcount(B[i] &amp; B[j])*__builtin_popcount(C[i] &amp; C[j]);
    }
  }
  */  
  Ar = _mm_loadu_si128( (const __m128i*)&amp;A[r]);
  Br = _mm_loadu_si128( (const __m128i*)&amp;B[r]);
  Cr = _mm_loadu_si128( (const __m128i*)&amp;C[r]);
  As = _mm_loadu_si128( (const __m128i*)&amp;A展开收缩);
  Bs = _mm_loadu_si128( (const __m128i*)&amp;B展开收缩);
  Cs = _mm_loadu_si128( (const __m128i*)&amp;C展开收缩);
  for (i=0; i&lt;8; i++) {
    As = _mm_bsrli_si128(As, 2);
    As = _mm_insert_epi16(As, A展开收缩, 7);
    Bs = _mm_bsrli_si128(Bs, 2);
    Bs = _mm_insert_epi16(Bs, B展开收缩, 7);
    Cs = _mm_bsrli_si128(Cs, 2);
    Cs = _mm_insert_epi16(Cs, C展开收缩, 7);
    temp = _mm_and_si128(Ar, As);
    popA = X_mm_popcnt_epi16(temp); 
    temp = _mm_and_si128(Br, Bs);
    popB = X_mm_popcnt_epi16(temp); 
    temp = _mm_and_si128(Cr, Cs);
    popC = X_mm_popcnt_epi16(temp); 
    temp = _mm_mullo_epi16(popA, popB);
    temp2 = _mm_mullo_epi16(temp, popC);
    Evec = _mm_add_epi16(Evec, temp2);
    s++;
  }
  return _mm_madd_epi16(Evec, _mm_set1_epi16(1));
}

int f8(
const uint16_t * __restrict A,
const uint16_t * __restrict B,
const uint16_t * __restrict C) {
  uint32_t r, i,j, Ediag = 0, E = 0;
  __m128i Evec, popA, popB, popC, temp, temp2, Ar, Br, Cr; 
  Evec = _mm_setzero_si128();
  union {
    __m128i v;
    uint32_t u32[4];
  } popcounts;
  /*
  for (i = 0; i&lt;N; i++) {
    Ediag += __builtin_popcount(A[i] &amp; A[i])*__builtin_popcount(B[i] &amp; B[i])*__builtin_popcount(C[i] &amp; C[i]);
  }
  */
  for (r = 0; r &lt; 48; r+=8) {
      Ar = _mm_loadu_si128( (const __m128i*)&amp;A[r]);
      Br = _mm_loadu_si128( (const __m128i*)&amp;B[r]);
      Cr = _mm_loadu_si128( (const __m128i*)&amp;C[r]);
      popA = X_mm_popcnt_epi16(Ar); 
      popB = X_mm_popcnt_epi16(Br); 
      popC = X_mm_popcnt_epi16(Cr); 
      temp = _mm_mullo_epi16(popA, popB);
      temp2 = _mm_mullo_epi16(temp, popC);
      Evec = _mm_add_epi16(Evec, temp2);
  }
  popcounts.v = _mm_madd_epi16(Evec, _mm_set1_epi16(1));;
  Ediag = popcounts.u32[0] + popcounts.u32[1] + popcounts.u32[2] + popcounts.u32[3];
  Evec = _mm_setzero_si128();
  for (i = 0; i&lt;N; i+=8) {
    Evec = _mm_add_epi32(Evec, getBrs(A,B,C,i,i));
    for (j = i+8; j&lt;N; j+=8) {
      temp = getBrs(A,B,C,i,j);
      temp = _mm_add_epi32(temp, temp);
      Evec = _mm_add_epi32(Evec, temp);
    }
  }
  popcounts.v = Evec;
  E = popcounts.u32[0] + popcounts.u32[1] + popcounts.u32[2] + popcounts.u32[3];
  return (Ediag + E)/2;  
}

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

加速嵌套循环，执行三个数组中每对元素交集的人口统计乘积。

问题

答案1

答案2

补充10进制整数公式的证明在哪里？

从C函数中返回一个字符串指针数组

Is it possible in CMake to set properties of tests in a different directory scope than the one the test is defined in?

使用C语言在Python代码中进行更快的傅立叶变换？

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论