使用192/256位整数计算无符号64位整数向量的点积和的最快方法是什么?

huangapple go评论108阅读模式
英文:

Fastest way to sum dot product of vector of unsigned 64 bit integers using 192/256 bit integer?

问题

我需要计算两个向量的点积:uint64_t a[N],b[N];N<=60),其中包含64位无符号整数。准确地说,是这个循环:

unsigned __int128 ans = 0;
for(int i=0; i<N; ++i)
    ans += (unsigned __int128) a[i] * (unsigned __int128) b[i];

ans 将会溢出,因此结果必须保存在一个宽整数中,比如256位。但由于 N<=60,我们甚至可以将结果保存在160位(64*2 + 32)的整数中。

较慢的解决方案

  1. 手动处理溢出:
unsigned __int128 ans = 0;
uint64_t overflow = 0;
for(int i=0; i<N; ++i){
    auto temp = ans;
    ans += (unsigned __int128) a[i] * (unsigned __int128) b[i];
    if(ans < temp) overflow++;
}

这个方法较慢,因为添加了 if 语句,使程序变慢约2.2倍。

  1. 使用像 boost::multiprecision::uint256_t 或 GMP 这样的库。

可能较快的解决方案
如果我们在64位机器上采用汇编编程,那么可以使用3个64位寄存器进行加法运算,通过使用从低位到高位的 add,然后是 adcadc 指令。

但我不想采用汇编语言,因为它将很难维护,而且不具备可移植性。

我的目标是使其既快速又易于维护。

编辑:Peter 在他的评论中指出,clang 支持我使用 adc 的想法,而 gcc 仍然使用分支(手动处理溢出)。

英文:

I need to calculate dot product of two vectors: uint64_t a[N], b[N]; (N&lt;=60) containing 64-bit unsigned integers. Precisely this loop:

unsigned __int128 ans = 0;
for(int i=0;i&lt;N;++i)
    ans += (unsigned __int128) a[i] * (unsigned __int128) b[i];

ans will overflow and thus result must be kept in a wide integer like 256 bit. But since N<=60 we can keep the result even in 160 (64*2 + 32) bit integer.

Slow Solutions:

  1. Manually handling overflows:
unsigned __int128 ans = 0;
uint64_t overflow = 0;
for(int i=0;i&lt;N;++i){
    auto temp = ans;
    ans += (unsigned __int128) a[i] * (unsigned __int128) b[i];
    if(ans&lt;temp) overflow++;
}

This is slow because addition of if slows down the program ~ 2.2 times.

  1. Using library like boost::multiprecision::uint256_t or GMP.

Probably Fast Solution:
If we resort to assembly programming on 64 bit machine then addition can be performed using 3 64-bit registers by using add followed by adc and adc from lower to higher bits.

But I don't want to resort to ASM because it will be hard to maintain and it will not be portable.

My aim is to make it fast and maintainable.

EDIT: Peter points out in his comment that clang supports my idea for using adc while gcc still uses a branch (manual handling of overflow).

答案1

得分: 2

以下是代码部分的翻译:

你绝对不需要一个256位的累加器。N个整数的总和最多只能产生N个进位,因此对于长度为2^64的向量的点积,64位的溢出(overflow)进位计数器足够了。也就是说,你128位乘积的总宽度只需要是192 = 64 * 3或160 = 128 + 32位。也就是说,只需要一个额外的寄存器。

是的,对于这个问题,x86-64汇编的最佳选择是从一个数组中进行mov加载,然后使用另一个数组的内存源进行mulmulx,然后使用add + adc累加到ans,最后使用adc reg, 0来累积进位。

(可能需要一些循环展开,可能需要2个或更多的累加器(3个寄存器组)。如果展开适用于Intel CPU,可能要避免对mul内存源使用索引寻址方式,以便能够微型融合加载。不幸的是,GCC / clang不会生成相对于另一个数组的索引来最小化循环开销(1个指针增量),同时还避免对其中一个数组使用索引寻址方式的循环。因此,你不太可能从编译器中获得最佳的汇编代码。但是clang做得相当不错。)

clang9.0 使用-O3 -march=broadwell -fno-unroll-loops为你的代码生成了这个汇编。 Clang能够识别无符号a<b的习惯用法,以获得无符号a+=b的进位,即使用add reg, reg / adc reg, reg / adc reg, 0。(不幸的是,clang的循环展开会使movsetcmovzxadd代替adc reg, 0,从而破坏了循环展开的好处!这是一个应该报告的未优化问题。)

GCC实际上会在你的if()上分支,你可以通过将其写成无分支形式来解决,如overflow += sum<prod;。但这不会解决GCC的另一个主要优化问题:实际上使用cmp/sbb进行sum < prod的128位比较,而不仅仅是使用最后adc的CF输出。GCC知道如何优化uint64_t(https://stackoverflow.com/questions/6659414/efficient-128-bit-addition-using-carry-flag),但显然不知道如何处理__uint128_t的进位输出。

很可能不可能通过更改源代码来手动让GCC避免这个未优化问题,这是GCC开发人员需要在GCC中修复的错误。(因此,你应该在他们的bugzilla上报告它作为未优化问题:https://gcc.gnu.org/bugzilla/;包括这个问答的链接)。试图完全手动处理uint64_t块可能会更糟:中间块有进位输入和进位输出。在C中正确编写这个是困难的,而且GCC不会将其优化为adc

即使使用overflow += __builtin_add_overflow(sum, prod, &sum);也没有完全帮助,我们仍然会得到相同的cmp/sbb/adc GCC代码生成! https://gcc.gnu.org/onlinedocs/gcc/Integer-Overflow-Builtins.html说:“编译器将尝试在可能的情况下使用硬件指令来实现这些内置函数,比如在加法后根据溢出条件进行条件跳转,根据进位等等。”但显然对于128位整数,它不知道如何处理。

总之,不需要使用SIMD,因为最宽的元素大小为64位,即使是AVX512也只有64x64 => 128位乘法。除非你能够将其转换为double,在这种情况下,AVX512可以运行得非常快。

英文:

You definitely do not need a 256-bit accumulator. The sum of N integers can only produce at most N carry-outs, so a 64-bit overflow carry counter is sufficient for a dot product of vectors that are 2^64 elements long. i.e. the total width of your sum of 128-bit products only needs to be 192 = 64*3 or 160 = 128 + 32 bits. i.e. one extra register.


Yes, the optimal x86-64 asm for this is mov-load from on array, mul or mulx with a memory source from the other array, then add + adc into ans, and adc reg, 0 to accumulate the carry-out.

(Perhaps with some loop unrolling, possibly with 2 or more accumulators (sets of 3 registers). If unrolling for Intel CPUs, probably avoiding an indexed addressing mode for the mul memory source, so it can micro-fuse the load. GCC / clang unfortunately don't make loops that index one array relative to the other to minimize loop overhead (1 pointer increment) while also avoiding an indexed addressing mode for one of the arrays, so you aren't going to get optimal asm from a compiler. But clang is pretty good.)

clang9.0 generates that for your code with -O3 -march=broadwell -fno-unroll-loops. Clang recognizes the usual a&lt;b idiom for carry-out of unsigned a+=b even with 128-bit integers, leading to add reg,reg / adc reg,reg / adc reg,0 (Unfortunately clang's loop-unrolling de-optimizes to mov ; setc ; movzx ; add instead of adc reg,0, defeating the benefit of loop unrolling! That's a missed-optimization bug that should get reported.)

GCC actually branches on your if(), which you can fix by writing it branchlessly as
overflow += sum&lt;prod;
. But that doesn't help with GCC's other major missed optimization: actually doing a 128-bit compare with cmp/sbb for sum &lt; prod instead of just using the CF output of the last adc. GCC knows how to optimize that for uint64_t (https://stackoverflow.com/questions/6659414/efficient-128-bit-addition-using-carry-flag) but apparently not for carry-out from __uint128_t.

It's probably not possible to hand-hold gcc into avoiding that missed-optimization by changing your source, that's a bug that GCC developers will have to fix in GCC. (So you should report it as a missed-optimization bug on their bugzilla https://gcc.gnu.org/bugzilla/; include a link to this Q&A). Trying to go full-manual with uint64_t chunks yourself would be even worse: the middle chunk has carry-in and carry-out. That's hard to write correctly in C, and GCC won't optimize it to adc.

Even using overflow += __builtin_add_overflow(sum, prod, &amp;sum); doesn't fully help, giving us the same cmp/sbb/adc GCC code-gen! https://gcc.gnu.org/onlinedocs/gcc/Integer-Overflow-Builtins.html says "The compiler will attempt to use hardware instructions to implement these built-in functions where possible, like conditional jump on overflow after addition, conditional jump on carry etc." but apparently it just doesn't know how for 128-bit integers.

#include &lt;stdint.h&gt;

static const int N = 2048;
using u128 = unsigned __int128;
u128 dp(uint64_t a[], uint64_t b[], uint64_t *overflow_out)
{
    u128 sum = 0;
    uint64_t overflow = 0;
    for(int i=0;i&lt;N;++i){
        u128 prod = a[i] * (unsigned __int128) b[i];
        //sum += prod;
        //overflow += sum&lt;prod;       // gcc uses adc after a 3x mov / cmp / sbb; clang is still good
        overflow += __builtin_add_overflow(sum, prod, &amp;sum);  // gcc less bad, clang still good
    }

    *overflow_out = overflow;
    return sum;
}

Clang compiles this nicely (Godbolt):

# clang9.0 -O3 -march=broadwell -fno-unroll-loops  -Wall -Wextra
     ... zero some regs, copy RDX to R8 (because MUL uses that implicitly)

.LBB0_1:                                # =&gt;This Inner Loop Header: Depth=1
        mov     rax, qword ptr [rsi + 8*rcx]
        mul     qword ptr [rdi + 8*rcx]
        add     r10, rax
        adc     r9, rdx                 # sum += prod
        adc     r11, 0                  # overflow += carry_out
        inc     rcx
        cmp     rcx, 2048
        jne     .LBB0_1               # }while(i &lt; N);

        mov     qword ptr [r8], r11   # store the carry count
        mov     rax, r10
        mov     rdx, r9               # return sum in RDX:RAX
        ret

Note that you don't need ADOX / ADCX or MULX to make this efficient. Out-of-order exec can interleave multiple short FLAGS dependency chains.

You could use another 3 registers for another 192-bit accumulator (sum and carryout), but that's probably unnecessary.

This looks like 9 uops for the front-end (assuming mul can't keep an indexed addressing mode micro-fused (unlamination), but that cmp/jcc will macro-fuse) so it will run at best 1 multiply per 2.25 clock cycles. Haswell and earlier run adc reg,reg as 2 uops, but Broadwell improved that 3-input instruction (FLAGS + 2 regs) to 1 uop. adc reg,0 is special-cased as 1 uop on SnB-family.

The loop-carried dependency chains are only 1 cycle long each: add into r10, adc into r9, and adc into r11. The FLAGS input for those ADC instructions are part of a short non-loop-carried dependency chain that out-of-order execution will eat for breakfast.

Ice Lake with its 5-wide pipeline will run this somewhat faster, like maybe 1.8 cycles per iteration, assuming it still unlaminates mul's memory operand.

Zen has a pipeline that's 5 instructions wide, or 6 uops wide if it includes any multi-uop instructions. So it might run this at 2 cycles per iteration, bottlenecked on its 2c throughput for full multiply. (Normal imul r64, r64 non-widening multiply is 1/clock, 3c latency on Zen, same as Intel.)

Zen 2 speeds up mul m64 and mulx to 1/clock (https://www.uops.info/html-tp/ZEN2/MUL_M64-Measurements.html), and might be the fastest CPU for this loop on a clock-for-clock basis.


With some unrolling and indexing one array relative to the other, a hand-optimized version of this could approach a cost per multiply of 6 uops = mov-load (1 uop) + mul mem (2 uops) + add + 2x adc (3 uops), so on Broadwell about 1.5 cycles / multiply.

It's still going to bottleneck on the front-end, and minimum loop overhead of 2 uops (pointer increment, and cmp/jcc) means an unroll by 4 could give us 4 multiplies per 6.5 cycles instead of per 6 cycles, or 92% of the fully-unrolled theoretical max. (Assuming no stalls for memory or sub-optimal scheduling, or at least that out-of-order exec is sufficient to absorb that and not stall the front-end. The back-end should be able to keep up with the front-end here, on Haswell and later, so the ROB should have room to absorb some minor bubbles, but probably not L3 or TLB misses.)


I don't think there's any benefit to using SIMD here; the widest element size for adds is 64-bit, and even AVX512 only has 64x64 => 128-bit multiply. Unless you can convert to double in which case AVX512 could run this very much faster.

huangapple
  • 本文由 发表于 2020年1月3日 16:38:23
  • 转载请务必保留本文链接:https://go.coder-hub.com/59575408.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定