所有支持 AVX2 的 CPU 也支持 BMI2 或 popcnt 吗?

huangapple go评论299阅读模式
英文:

Do all CPUs that support AVX2 also support BMI2 or popcnt?

问题

我从这里了解到,支持AVX不意味着支持BMI1。那么AVX2呢:所有支持AVX2的CPU是否也支持BMI2?此外,支持AVX2是否意味着支持popcnt?

在Google上搜索了很多,但找不到明确的答案。我找到的最接近的答案是https://stackoverflow.com/questions/44827477/does-avx-support-imply-bmi1-support。

英文:

From here, I learned that the support of AVX doesn't imply the support of BMI1. So how about AVX2: Do all CPUs that support AVX2 also support BMI2? Further, does the support of AVX2 imply the support of popcnt?

Searched all over Google and cannot locate a definite answer. The closest thing I got is https://stackoverflow.com/questions/44827477/does-avx-support-imply-bmi1-support.

答案1

得分: 3

所有真实硬件都具有AVX2指令集,同时也具有BMI2指令集。

AMD Zen 2及更早版本的pdep/pext执行速度非常慢,因此如果您正在进行CPU检测以设置函数指针,用于在循环内使用这两个指令的函数,您将需要检查这些CPU型号,而不是检查BMI2的可用性。其他BMI2指令如果支持,就没有问题。

几乎所有支持AVX2的硬件也支持FMA指令,但不是全部。

BMI1/2和FMA3是-march=x86-64-v3特性级别的一部分(基本上是Haswell,但不包括TSX、AES-NI、rdrand和其他一些东西。参考链接)。


很可能未来所有的CPU都会同时支持AVX2和BMI2,或者都不支持,至少在商业上有影响的主流CPU中如此。尽管pdeppext需要一个与其他指令不同的执行单元,但它们需要大量晶体管来执行(这相当于AVX-512的位运算版本vpcompressb/vpexpandb)。或者需要慢速微码。

AVX2和BMI2有单独的特性位,因此模拟器或虚拟机可以禁用BMI2而保留AVX2,所以最好同时检查两者(并且操作系统已启用AVX:在使用CPUID检查支持xgetbv之后)。甚至可能会在尝试运行BMI2指令时触发模拟器的错误(不像虚拟机:没有控制寄存器位会导致CPU硬件在正常支持BMI2指令时触发故障,不像SSE/AVX/AVX-512)。

除非您希望在循环内使用pdep/pext,否则不需要单独的不带BMI2的AVX2版本的函数。如果有人设置了一个奇怪的模拟器或虚拟机,阻止您的代码使用其AVX2函数,因为缺少BMI2,那就是他们的问题,不太可能发生意外。

到目前为止的CPU

  • 英特尔Haswell:引入了AVX2和BMI2。(也是英特尔的第一款支持BMI1的CPU)。
  • 英特尔Gracemont(Alder Lake E-cores):支持AVX2和BMI2。首款低功耗silvermont系列的CPU,具备AVX1或BMI1。
  • AMD Excavator:AMD的第一款支持AVX2的CPU也是第一款支持BMI2的CPU。(但pdep/pext非常慢)
  • AMD Zen 3:第一款具有可用的pdep/pext的AMD CPU(与英特尔相同,1个微操作,3个周期延迟,1个周期吞吐量)。
  • VIA Nano C QuadCore C4650(Isiah),2015年发布:支持AVX2 + BMI2。(特别是没有FMA31)。我认为这是VIA的第一款支持AVX2的CPU。
  • ZHAOXIN KaiXian ZX-C+ C4580:支持AVX2 + BMI2(pdep/pext速度慢,但可能没有AMD糟糕?InstLatx64没有说明他们测试的输入是什么,这可能只是一种非常特殊的情况,比如0)。基于VIA Nano C。
  • Centaur CNS:支持AVX512、AVX2、BMI2(pdep/pext速度快)

AMD Zen 2及更早版本的pdep/pext执行速度非常慢

AMD在Zen 3之前(因此包括Excavator、Zen 1和Zen 2)的pdeppext执行速度极慢,取决于数据的数量,例如,https://uops.info/测得Zen 1&2上64位的pext需要133个微操作,吞吐量为52个周期。

对于支持它们的CPU来说,所有其他BMI/BMI2指令速度都很快,比如在Zen 4之前的AMD上的blsr最多需要2个微操作,或者在英特尔上只需要一个微操作。


AVX1意味着popcnt

AVX1意味着SSE4.2,而SSE4.2至少在实际上意味着popcnt

popcnt有自己的特性位,因此CPU可以支持popcnt而没有SSE4.2的支持,但实际上并没有发生相反的情况。足够多的软件都假设SSE4.2意味着popcnt,如果CPU违反了这一假设,那将是CPU的问题,而不是软件的问题。这不是一个真实的情况;与SSE4.2字符串指令相比,popcnt的实现成本较低。

  • 链接列出了Excavator。他们没有列出任何VIA或Zhaoxin的CPU。

脚注1:Mysticial发表评论

VIA Isaiah C4650具有AVX2但没有FMA3。这导致了许多假设在AVX2存在时也存在FMA3的程序出现错误

英文:

All real hardware with AVX2 has also had BMI2.

AMD Zen 2 and earlier have unusably slow pdep/pext, so you'll want to check for those CPU models instead of availability of BMI2 if you're doing CPU detection to set up function pointers, for functions that use either instruction inside loops. Other BMI2 instructions are fine if supported.

Almost all AVX2 hardware has FMA as well, but not quite.

BMI1/2 and FMA3 are part of the -march=x86-64-v3 feature level (essentially Haswell, but without TSX, AES-NI, rdrand and some other stuff.
https://en.wikipedia.org/wiki/X86-64#Microarchitecture_levels).


It's fairly likely all future CPUs will have both AVX2+BMI2, or neither, at least in commercially-relevant mainstream CPUs, although pdep and pext do need a significant amount of transistors for an execution unit separate from anything else needed for any other instruction. (A bitwise version of AVX-512 vpcompressb/vpexpandb.) Or slow microcode.

AVX2 and BMI2 have separate feature bits so an emulator or VM could disable BMI2 while leaving AVX2 enabled, so it's a good idea to check both. (And that the OS has enabled AVX: xgetbv after using CPUID to check that xgetbv is supported). An emulator might even fault if you try to run BMI2 instructions (unlike a VM: there's no control-register bit that will make the CPU hardware fault on BMI2 instructions it normally supports, unlike SSE/AVX/AVX-512.)

You don't need a separate AVX2 without BMI2 version of your functions, unless you wanted to use pdep/pext inside a loop. If someone sets up a weird emulator or VM that stops your code from using its AVX2 functions because it lacks BMI2, that's their problem, and is unlikely to happen by accident.

CPUs so far

  • Intel Haswell: introduced AVX2 and BMI2. (Also Intel's first BMI1 CPU).
  • Intel Gracemont (Alder Lake E-cores): AVX2 and BMI2. First low-power silvermont-family with AVX1 or BMI1.
  • AMD Excavator: AMD's first AVX2 CPU was also their first BMI2 CPU. (With horribly slow microcoded pdep / pext)
  • AMD Zen 3: the first AMD with usable pdep / pext (same as Intel, 1 uop with 3c latency, 1c throughput).
  • VIA Nano C QuadCore C4650 (Isiah) from 2015: AVX2 + BMI2. (Notably without FMA3<sup>1</sup>). I think this was VIA's first AVX2 CPU.
  • ZHAOXIN KaiXian ZX-C+ C4580: AVX2 + BMI2 (slow pdep / pext, but maybe not as bad as AMD? InstLatx64 doesn't say what inputs they tested with, and this might just be a very special case like 0). Based on VIA Nano C.
  • Centaur CNS: AVX512, AVX2, BMI2 (fast pdep/pext)

Unusably slow pdep / pext on AMD Zen 2 and earlier

AMD before Zen 3 (so Excavator, Zen 1, and Zen 2) have disastrously slow pdep and pext where the number of uops depends on the data, e.g. https://uops.info/ measured 64-bit pext at 133 uops on Zen 1&2 with one per 52 cycle throughput.

All other BMI/BMI2 instructions are fast on CPUs that support them, at most 2 uops for stuff like blsr on AMD before Zen 4, or single-uop on Intel.


AVX1 implies popcnt

AVX1 implies SSE4.2, and SSE4.2 at least de-facto implies popcnt.

popcnt does have its own feature bit so CPUs can have popcnt without SSE4.2 support, but in practice the opposite hasn't happened. And enough software assumes that SSE4.2 implies popcnt that if a CPU violated that assumption, it would be the CPUs fault, not software. It's not really a plausible situation; popcnt is cheap to implement compared to SSE4.2 string instructions.


Footnote 1: Mysticial commented

> The VIA Isaiah C4650 has AVX2 but not FMA3. Breaks a lot of programs that assume FMA3 in the presence of AVX2
>
> Btw, I spoke to one of the VIA architects at Hot Chips about it. And he was pissed that they they allowed that to happen. IIRC, he hinted that they should've either turned off the CPUID for AVX2 or microcoded the FMA.

huangapple
  • 本文由 发表于 2023年6月8日 09:33:18
  • 转载请务必保留本文链接:https://go.coder-hub.com/76428057.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定