英文:
Does Zen 4 core have 48 flops per cycle for 32-bit precision fp?
问题
由于AMD Zen 4只对矢量数据执行256位宽的操作,以下来自chipsandcheese的Zen 4文章的图表显示有6个FP流水线(4个ALU和2个内存):
每个FMA执行1次乘法和1次加法,而fadd只执行1次加法。这是否意味着从理论上讲,它可以执行总共2次乘法和4次加法,即每次256位的6次操作?
假设所有的4次加法和2次乘法都可以在同一周期内发出,这是否意味着每个周期可以计算256位(或只有32位精度的8个浮点数)x 6 = 48个元素(或每GHz 48 gflops/s)?
假设所有的操作数都在寄存器中,应该有足够的带宽将数据传送到FPU(L1带宽表示每周期读取2x256位只足够用于每周期8个flop,但寄存器必须更快),但FPU的吞吐量并未明确显示。
这与Intel的第11/12/13代相比如何?例如,一些工作站Xeon处理器具有2个每个512位的FPU,但没有专用的“add”?比较具有不同乘法和加法比例的CPU对flops-to-flops公平吗?看起来AMD在以下方面更好:
d += a * b + c;
// 或
d += a * b;
e += c;
而Intel在以下方面更好:
d = a * b + c;
// 或
d += a * b;
每GFLOPS计算。Intel的flops值在矩阵乘法和混合方面看起来更好。AMD的flops值在链式矩阵加法和乘法以及某些带有浮点累加器和矩阵乘法的循环中看起来更好。
因此,在执行矩阵乘法时,Zen 4是否有效地每个周期执行32个flops?
英文:
Since amd zen 4 has only 256bit wide operations on vector data, the following diagram from chipsandcheese's Zen 4 article shows 6 FP pipelines (4 ALU and 2 memory):
Each FMA does 1 multiplication and 1 add while fadd does only 1 add. So does this mean theoretically it can do a total of 2 multiplications and 4 adds = 6 operations of 256 bits each?
Assuming all 4adds and 2 muls can be issued in same cycle, can this mean 256bits (or just 8 floats of 32bit precision) x 6 = 48 elements are computed per cycle (or 48 gflops/s per GHz)?
Assuming all operands are in registers, there should be enough bandwidth to get the data to fpu (the L1 bandwidth says 2x256 bits per cycle for reading is only enough for 8 flops per cycle but registers must be much faster), but the fpu throughput isn't clearly shown.
How does this compare to Intel 11/12/13 gen? For example, some workstation xeons had 2x fpu of 512bits each but no dedicated "add"s? Is it fair to compare cpus with different ratios of muls and adds for flops-to-flops? Looks like amd is better on:
d += a * b + c;
// or
d += a * b;
e += c;
while intel is better on:
d = a * b + c;
// or
d+=a*b;
per gflops. Intel's flops value looks better for matrix multiplication and blending. AMD's flops value looks better for chained matrix add & multiplication and some loop with float accumulator & matrix multiplication.
So when doing matrix multiplication, is zen 4 effectively 32 flops per cycle?
答案1
得分: 1
是的,Zen 4理论上每个时钟周期的最大吞吐量为48 FLOP,如果您在同一个循环中同时使用加法和FMAs的话。
我猜通常情况下,这在您有许多短向量点积但不是矩阵乘法时最有用,因此每个清理循环都需要进行一些洗牌和加法操作。乱序执行可以将该工作与FMAs重叠。
而在不使用FMAs的代码中,每个时钟周期仍然有2个乘法和2个加法,这对于优化不太好的代码来说可能非常有用。(许多现实生活中的代码都没有经过良好的优化。您有多少次看到人们建议不要担心性能?)
此外,使用一些洗牌和其他非FP数学向量工作的混合,可以在各个端口上很好地运行,并仍然为FP加法和乘法留出一些空间。
据我所知,Zen 4可以同时使两个FMA和两个FP-ADD单元保持繁忙,因此每个时钟周期有2个矢量FMAs和2个矢量vaddps
。所以这是6倍矢量宽度的FLOP。不过,将其称为在同一周期内发出(和分派到执行单元)的“4个加法和2个乘法”是没有意义的,因为CPU将它们视为2个FMA和2个ADD操作,而不是6个单独的微操作。
那么在执行矩阵乘法时,Zen 4是否有效地每个周期32个FLOP?
是的,标准的矩阵乘法全部都是FMAs,几乎不需要额外的FP加法吞吐量。
也许使用Strassen算法的一些大矩阵乘法会导致工作负载中每次乘法都有多于1次的加法,如果您可以安排使加法工作与乘法重叠。
或者可能在同一个物理核心上运行另一个线程来执行加法工作,如果您可以安排好,而不会因为竞争L1d缓存占用和带宽而使情况变得更糟。出于这个原因,一些高性能计算工作负载对于SMT / 超线程的负向扩展有时是有道理的,但部分原因是因为经过良好调整的单线程可以使用单个核心的所有FP吞吐量。但如果在Zen 4上不是这种情况,那就有一些理论上的提升空间。
但这需要您的FMA代码每个FMA都需要不到1次加载,否则如果子矩阵加法线程试图同时进行加载+加载+添加+存储,而子矩阵乘法线程每个时钟执行2次加载+2次FMAs,那么加载/存储微操作将成为瓶颈。
例如,一些工作站Xeon处理器有2个512位的FPU,但没有专用的“加法”吗?
是的,像一些Xeon可扩展处理器一样,具有第二个512位FMA单元的Intel CPU可以在足够优化代码的情况下每时钟周期维持2个512位FMAs(例如,不会因为加载+存储或FMA延迟而成为瓶颈),因此这给您每周期64个FLOP。
Alder Lake / Sapphire Rapids重新添加了用于FP加法的单独执行单元,但它们位于与FMA单元相同的端口上,因此对于那些在单独的vaddps
/ vaddpd
的延迟上成为瓶颈的操作来说,好处较低。 (但与Haswell不同,它们有两个,因此吞吐量仍然是每时钟2个。)
英文:
Yes, 48 FLOP / cycle theoretical max throughput on Zen 4 if you have a use for adds and FMAs in the same loop.
I'd guess that usually this is most useful when you have many short-vector dot products that aren't matmuls, so each cleanup loop needs to do some shuffling and adding. Out-of-order exec can overlap that work with FMAs.
And in code not using FMAs, you still have 2 mul + 2 add per clock, which is potentially quite useful for less well optimized code. (A lot of real-life code is not well optimized. How many times have you seen people give advice to not worry about performance?)
Also with a mix of shuffles and other non-FP-math vector work, that can run on a good mix of ports and still leave some room for FP adds and multiplies.
AFAIK, Zen 4 can keep both FMA and both FP-ADD units busy at the same time, so yes, 2 vector FMAs and 2 vector vaddps
every cycle. So that's 6x vector-width FLOPs. It doesn't make sense to call it "4adds and 2 muls" being issued (and dispatched to execution units) in the same cycle, though, since the CPU sees them as 2 FMA and 2 ADD operations, not 6 separate uops.
> So when doing matrix multiplication, is zen 4 effectively 32 flops per cycle?
Yes, standard matmal is all FMAs, little to no use for extra FP-add throughput.
Maybe some large-matrix multiplies using Strassen's algorithm would result in a workload with more than 1 addition per multiply, if you can arrange it such that the adding work overlaps with multiplying.
Or possibly run another thread on the same physical core doing the adding work, if you can arrange that without making things worse by competing for L1d cache footprint and bandwidth. HPC workloads sometimes scale negatively with SMT / hyperthreading for that reason, but partly that's because a well tuned single thread can use all the FP throughput from a single core. But if that's not the case on Zen 4, there's some theoretical room for gains.
However, that would require your FMA code to need less than 1 load per FMA, otherwise load/store uops will be the bottleneck if a submatrix-add thread is trying to load+load+add+store at the same time as a submatrix-multiply thread is doing 2 loads + 2 FMAs per clock.
> For example, some workstation xeons had 2x fpu of 512bits each but no dedicated "add"s?
And yes, Intel CPUs with a second 512-bit FMA unit (like some Xeon scalable processors) can sustain 2x 512-bit FMAs per clock if you optimize your code well enough (e.g. not bottlenecking on loads+stores or FMA latency), so that gives you 2x 16 single-precision FMAs = 64 FLOP/cycle.
Alder Lake / Sapphire Rapids re-added separate execution units for FP-add, but they're on the same ports as the FMA units, so the benefit is lower latency for things that bottleneck on the latency of separate vaddps
/ vaddpd
, like in Haswell. (But unlike Haswell, there are two of them, so the throughput is still 2/clock.)
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论