Why is Pipelined Processor identified as a SISD?

huangapple go评论64阅读模式
英文:

Why is Pipelined Processor identified as a SISD?

问题

I'm studying pipelined processor and i noticed that a pipelined processor (single core processor) is identified as a Single instruction - single data (flynn taxonomy). I was wondering why?

作为一个流水线处理器,为什么叫做“单指令”,如果管道中有多于一个指令?应该被称为“多指令 - 单数据”吗?

Also what does "single data" mean? because every stage of the pipeline works on different data.. so i would say that pipelined processor is a "Multiple Data", UNLESS "single data" refers to the program. Im so confused guys.

还有,“单数据”是什么意思?因为管道的每个阶段都在处理不同的数据...所以我会说流水线处理器是“多数据”,除非“单数据”指的是程序。我很困惑,希望有人可以给我解释一下。
Thanks a lot.

英文:

I'm studying pipelined processor and i noticed that a pipelined processor (single core processor) is identified as a Single instruction - single data (flynn taxonomy). I was wondering why?

As a pipelined processor, why "Single instruction", if there is more than one instruction in the pipeline? Shouldn't be named Multiple Instruction - single data?

Also what does "single data" mean? because every stage of the pipeline works on different data.. so i would say that pipelined processor is a "Multiple Data", UNLESS "single data" refers to the program. Im so confused guys.

I hope someone can enlighten me the way.
Thanks a lot.

答案1

得分: 1

  • Single Data in SISD refers to the amount of data operated on by each instruction.

  • Single Instruction in SISD refers to there being only one instruction stream, a program that executes as if each instruction ran one at a time, in order.

The MIMD term is normally reserved for multi-threading, although you could maybe use it to describe an unrolled loop running on a modern CPU (that can execute 4 to 6 instructions in parallel every clock cycle, with hundreds in flight in the pipeline).

Flynn's taxonomy is concerned with execution units

A scalar pipeline still only has one ALU so is firmly in the SISD category. Keeping it fed with work more of the time is obviously good, but doesn't really change the fact that your CPU reads one instruction stream and executes them as if they each fully completed before the next one started (e.g. stalling for hazards that bypass-forwarding can't avoid.)

So in terms of the programming model, you don't have to explicitly write a parallel program that's aware of how the CPU is executing<sup>1</sup>, unlike with SIMD where you have to explicitly process data in chunks. Or MIMD multi-threading where you have to coordinate multiple threads accessing parts of your data (and worse, coordinating places where they access the same data, if you need any reductions like the sum of an array).

Being aware of latencies when scheduling instructions can improve performance for a scalar pipelined CPU, but that's a detail the taxonomy isn't concerned with. Scalar pipelines often don't need much (instruction-level parallelism aka ILP) to avoid stalls: integer ALU ops normally have 1 cycle latency so can run back to back. The earliest pipelined CPUs (like MIPS R2000) had load results ready for the ALU to use with a gap of only 1 extra cycle, so a bit of instruction scheduling and maybe unrolling + software pipelining can help, especially if you have chains of floating-point instructions with higher latencies.

It's assumed you or the compiler does whatever simple local optimizations are necessary for the specific CPU to feed its execution unit(s) with work most cycles, like loop unrolling and even software pipelining to avoid hazards. A serial dependency like sum += arr[i] might just make your code execute slower (depending on latency of addition vs. how tight the loop is); it doesn't force you to restructure your code to still run using the same instructions.

Where the taxonomy really loses touch with later developments in computer architecture (it was proposed in 1966 and 1972 papers) is superscalar pipelines, which can look at a serial instruction stream and find instructions that don't depend on each other (instruction-level parallelism aka ILP), and execute them in parallel, hopefully achieving greater than 1 instruction per clock on average, and much higher peak throughput when not stalled. Such a pipeline will have multiple execution units; for example AMD Zen 2 has four integer ALUs and four floating-point execution units, and a front-end that can issue/rename up to 6 micro-ops per clock cycle.

(This is a simplifications; it's actually four execution ports; different functional units share the same port, like the integer divider is separate hardware from the integer popcount / lzcnt/tzcnt unit or the simple add/sub/and/or/xor unit.)

See also Modern Microprocessors A 90-Minute Guide! for more about superscalar out-of-order exec pipelines, and https://www.realworldtech.com/sandy-bridge/. Out-of-order exec is very good for hiding latencies from cache misses, and also for finding ILP across loop iterations without needing huge amounts of software unrolling and software-pipelining.

Optimizing for wider pipelines makes it more important to unroll a loop like sum += arr[i] with multiple accumulators like sum0 += arr[i+0] / sum1 += arr[i+1] etc. to hide latency, as in https://stackoverflow.com/questions/63095394/latency-bounds-and-throughput-bounds-for-processors-for-operations-that-must-occ and https://stackoverflow.com/q/45113527

If you bottleneck on one serial dependency chain, you're running 1 addition per 4 clock cycles (for example on Intel Skylake for floating point), vs. 2 per clock cycle, so that's a factor of 8 in throughput vs. latency from exposing ILP to the CPU pipeline. And that's not even considering SIMD, where each instruction can do 4 or 8 FP adds per instruction (unit of work that the CPU pipeline has to track), so that's another factor of 8 in array-summing throughput from using SIMD vector instructions. i.e. it costs the same to do 1 float at a time as it does to do 8 floats at a time. (With modern-style short-vector fixed-length SIMD, not like older Cray-style vector processors where you tell the CPU about a whole array and it internally loops over it, without the loop overhead that would suck so much on a scalar pipeline.

See also Duncan's taxonomy which is a 1990 update to Flynn's but still predates widespread superscalar pipelined CPUs.) An early superscalar pipeline was Intel Pentium in 1993 - dual-issue in-order: it decoded 2 instructions per clock cycle, and could send them both through its pipeline in lockstep if they were compatible.

Footnote 1: Except when the architecture makes some pipeline details architecturally visible, for example MIPS I having load delay slots: instead of stalling, you'd get unpredictable values if the instruction after a load reads the result. And branch-delay slots, where the instruction after a jump/branch executes even when it's taken.

Further thoughts:

According to Wikipedia's article on Flynn's taxonomy, MISD (only?) describes things like redundant / fault-tolerant systems, like in aviation flight control computers, where the same data is processed multiple times. (e.g. 3 or 4 times, with each flight computer "voting" on the right thing to do, so a fault in one can be detected when it's different from the others.)

Wikipedia also mentions a "pipelined" processor in the context of Flynn's taxonomy, but they're talking only in terms of an implementation strategy for SIMD (like separate pipelines or "lanes" within an ALU for an instruction like addps xmm0, xmm1 that does four FP additions.) That sense of "pipelined" is unrelated to what you're talking about, pipelining the fetch / decode / etc. of other instructions to happen in parallel with the execution unit.

Flynn's taxonomy does get a bit fuzzy when applied to modern CPUs with multiple execution units, where each execution unit might internally have multiple SIMD lanes.

An instruction like addps xmm0, xmm1 provides a control input to one of the SIMD FPUs, which internally activates four FP-adders, one for each lane (element) of the vector. So there's your SIMD. The execution unit is pipelined, able to start new

英文:
  • Single Data in SISD refers to the amount of data operated on by each instruction.

  • Single Instruction in SISD refers to there being only one instruction stream, a program that executes as if each instruction ran one at a time, in order.

    The MIMD term is normally reserved for multi-threading, although you could maybe use it to describe an unrolled loop running on a modern CPU (that can execute 4 to 6 instructions in parallel every clock cycle, with hundreds in flight in the pipeline).

Flynn's taxonomy is concerned with execution units

A scalar pipeline still only has one ALU so is firmly in the SISD category. Keeping it fed with work more of the time is obviously good, but doesn't really change the fact that your CPU reads one instruction stream and executes them as if they each fully completed before the next one started (e.g. stalling for hazards that bypass-forwarding can't avoid.)

So in terms of the programming model, you don't have to explicitly write a parallel program that's aware of how the CPU is executing<sup>1</sup>, unlike with SIMD where you have to explicitly process data in chunks. Or MIMD multi-threading where you have to coordinate multiple threads accessing parts of your data (and worse, coordinating places where they access the same data, if you need any reductions like the sum of an array).

Being aware of latencies when scheduling instructions can improve performance for a scalar pipelined CPU, but that's a detail the taxonomy isn't concerned with.
Scalar pipelines often don't need much (instruction-level parallelism aka ILP) to avoid stalls: integer ALU ops normally have 1 cycle latency so can run back to back. The earliest pipelined CPUs (like MIPS R2000) had load results ready for the ALU to use with a gap of only 1 extra cycle so a bit of instruction scheduling and maybe unrolling + software pipelining can help, especially if you have chains of floating-point instructions with higher latencies.

It's assumed you or the compiler does whatever simple local optimizations are necessary for the specific CPU to feed its execution unit(s) with work most cycles, like loop unrolling and even software pipelining to avoid hazards. A serial dependency like sum += arr[i] might just make your code execute slower (depending on latency of addition vs. how tight the loop is); it doesn't force you to restructure your code to still run using the same instructions.

Where the taxonomy really loses touch with later developments in computer architecture (it was proposed in 1966 and 1972 papers) is superscalar pipelines, which can look at a serial instruction stream and find instructions that don't depend on each other (instruction-level parallelism aka ILP), and execute them in parallel, hopefully achieving greater than 1 instruction per clock on average, and much higher peak throughput when not stalled. Such a pipeline will have multiple execution units; for example AMD Zen 2 has four integer ALUs and four floating-point execution units, and a front-end that can issue/rename up to 6 micro-ops per clock cycle.

(This is a simplifications; it's actually four execution ports; different functional units share the same port, like the integer divider is separate hardware from the integer popcount / lzcnt/tzcnt unit or the simple add/sub/and/or/xor unit.)

See also Modern Microprocessors
A 90-Minute Guide!
for more about superscalar out-of-order exec pipelines, and https://www.realworldtech.com/sandy-bridge/. Out-of-order exec is very good for hiding latencies from cache misses, and also for finding ILP across loop iterations without needing huge amounts of sofware unrolling and software-pipelining.

Optimizing for wider pipelines makes it more important to unroll a loop like sum += arr[i] with multiple accumulators like sum0 += arr[i+0] / sum1 += arr[i+1] etc. to hide latency, as in https://stackoverflow.com/questions/63095394/latency-bounds-and-throughput-bounds-for-processors-for-operations-that-must-occ and https://stackoverflow.com/q/45113527

If you bottleneck on one serial dependency chain, you're running 1 addition per 4 clock cycles (for example on Intel Skylake for floating point), vs. 2 per clock cycle, so that's a factor of 8 in throughput vs. latency from exposing ILP to the CPU pipeline. And that's not even considering SIMD, where each instruction can do 4 or 8 FP adds per instruction (unit of work that the CPU pipeline has to track), so that's another factor of 8 in array-summing throughput from using SIMD vector instructions. i.e. it costs the same to do 1 float at a time as it does to do 8 floats at a time. (With modern-style short-vector fixed-length SIMD, not like older Cray-style vector processors where you tell the CPU about a whole array and it internally loops over it, without the loop overhead that would suck so much on a scalar pipeline.

See also Duncan's taxonomy which is a 1990 update to Flynn's but still predates widespread superscalar pipelined CPUs.) An early superscalar pipeline was Intel Pentium in 1993 - dual-issue in-order: it decoded 2 instructions per clock cycle, and could send them both through its pipeline in lockstep if they were compatible.

Footnote 1: Except when the architecture makes some pipeline details architecturally visible, for example MIPS I having load delay slots: instead of stalling, you'd get unpredictable values if the instruction after a load reads the result. And branch-delay slots, where the instruction after a jump/branch executes even when it's taken.


Further thoughts:

According to Wikipedia's article on Flynn's taxonomy, MISD (only?) describes things like redundant / fault-tolerant systems, like in aviation flight control computers, where the same data is processed multiple times. (e.g. 3 or 4 times, with each flight computer "voting" on the right thing to do, so a fault in one can be detected when it's different from the others.)

Wikipedia also mentions a "pipelined" processor in the context of Flynn's taxonomy, but they're talking only in terms of an implementation strategy for SIMD (like separate pipelines or "lanes" within an ALU for an instruction like addps xmm0, xmm1 that does four FP additions.) That sense of "pipelined" is unrelated to what you're talking about, pipelining the fetch / decode / etc. of other instructions to happen in parallel with the execution unit.


Flynn's taxonomy does get a bit fuzzy when applied to modern CPUs with multiple execution units, where each execution unit might internally have multiple SIMD lanes.

An instruction like addps xmm0, xmm1 provides a control input to one of the SIMD FPUs, which internally activates four FP-adders, one for each lane (element) of the vector. So there's your SIMD. The execution unit is pipelined, able to start new independent work every cycle, but only produce a result after 4 cycles, for example.

But this might be happening in the same clock cycle as a mulps xmm2, xmm3 in another execution unit, which is also SIMD.

Traditionally, we don't call this MIMD when both SIMD FPUs are part of the same CPU core, and those addps and mulps instructions were part of the same single-threaded program, or thread of execution in a multi-threaded program.

At least I think that's the case. Really the term MIMD basically stops being useful at this level of CPU complexity so it's better to just talk about what you want to say than to figure out which abstract label applies.

huangapple
  • 本文由 发表于 2023年5月14日 01:10:17
  • 转载请务必保留本文链接:https://go.coder-hub.com/76244005.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定