Parallel Processing & Distributed Systems - Chapter 5: Pipeline

Pipelining concepts
The DLX architecture
A simple DLX pipeline
Pipeline Hazards and Solution to overcome
Reference:
Computer Architecture: A Quantitative Approach,
John L Hennessy & David a Patterson, Chapter 6
21 trang | Chuyên mục: Xử Lý Song Song | Chia sẻ: dkS00TYs | Lượt xem: 1434 | Lượt tải: 1
Tóm tắt nội dung Parallel Processing & Distributed Systems - Chapter 5: Pipeline, để xem tài liệu hoàn chỉnh bạn click vào nút "TẢI VỀ" ở trên
ThoaiNam
Khoa Coâng Ngheä Thoâng Tin – Ñaïi Hoïc Baùch Khoa Tp.HCM
 Pipelining concepts
 The DLX architecture
 A simple DLX pipeline
 Pipeline Hazards and Solution to overcome
Reference: 
Computer Architecture: A Quantitative Approach, 
John L Hennessy & David a Patterson, Chapter 6
Khoa Coâng Ngheä Thoâng Tin – Ñaïi Hoïc Baùch Khoa Tp.HCM
 A technique to make fast CPUs by overlapping execution 
of multiple instructions
Instruction i S1 S2 S3 S4
S1 S2 S3 S4
S1 S2 S3 S4
S1 S2 S3 S4
S1 S2 S3 S4
Instruction i+1
Instruction i+2
Instruction i+3
Instruction i+4
Instruction # 1 2 3 4 5 6 7 8
Cycles
Khoa Coâng Ngheä Thoâng Tin – Ñaïi Hoïc Baùch Khoa Tp.HCM
 Pipeline throughput
– Determined by how often an instruction exists the pipeline
– Depends on the overhead of clock skew and setup
– Depends on the time required for the slowest pipe stage
 Pipeline stall
– Delay the execution of some instructions and all 
succeeding instructions
– “Slow down” the pipeline
 Pipeline Designer’s goal
– Balance the length of pipeline stages
– Reduce / Avoid pipeline stalls
Khoa Coâng Ngheä Thoâng Tin – Ñaïi Hoïc Baùch Khoa Tp.HCM
Pipeline speedup
Average instruction time without pipeline
Average instruction time with pipeline
=
CPI without pipelining * Clock cycle without pipelining
CPI with pipelining * Clock cycle with pipelining
Ideal CPI * Pipeline depth=CPI without pipelining
CPI with pipelining Ideal CPI + Pipeline stall clock cycles per instruction=
Pipeline speedup = Ideal CPI * Pipeline depth
Ideal CPI + Pipeline stall clock cycles per instruction
( CPI = number of Cycles Per Instruction)
=
Khoa Coâng Ngheä Thoâng Tin – Ñaïi Hoïc Baùch Khoa Tp.HCM
 A mythical computer which architecture is based on 
most frequently used primitives in programs
 Used to demonstrate and study computer 
architecture organizations and techniques
 A DLX instruction consists of 5 execution stages
– IF – instruction fetch
– ID – instruction decode and register fetch
– EX – execution and effective address calculation
– MEM – memory access
– WB – write back
Khoa Coâng Ngheä Thoâng Tin – Ñaïi Hoïc Baùch Khoa Tp.HCM
Instruction i IF ID EX MEM
IF ID EX MEM
IF ID EX MEM
IF ID EX MEM
Instruction i+1
Instruction i+2
Instruction i+3
Instruction # 1 2 3 4 5 6 7 8
Cycles
WB
WB
WB
WB
 Fetch a new instruction on each clock cycle
 An instruction step = a pipe stage
Khoa Coâng Ngheä Thoâng Tin – Ñaïi Hoïc Baùch Khoa Tp.HCM
 Are situations that prevent the next 
instruction in the instruction stream from 
executing during its designated cycles
 Leads to pipeline stalls
 Reduce pipeline performance
 Are classified into 3 types
– Structural hazards
– Data hazards
– Control hazards
Khoa Coâng Ngheä Thoâng Tin – Ñaïi Hoïc Baùch Khoa Tp.HCM
 Due to resource conflicts
 Instances of structural hazards
– Some functional unit is not fully pipelined
» a sequence of instructions that all use that unit cannot
be sequentially initiated 
– Some resource has not been duplicated enough. Eg:
» Has only 1 register-file write port while needing 2 write 
in a cycle
» Using a single memory pipeline for data and instruction
 Why we allow this type of hazards?
– To reduce cost. 
– To reduce the latency of the unit
Khoa Coâng Ngheä Thoâng Tin – Ñaïi Hoïc Baùch Khoa Tp.HCM
 Occurs when the order of access to operands is 
changed by the pipeline, making data unavailable for 
next instruction
 Example: consider these 2 instructions
ADD R1, R2, R3 ( R2 + R3  R1)
SUB R4, R1, R5 ( R1 – R5  R4)
ADD instruction IF ID EX MEM
IF ID EX MEMSUB instruction
Instruction # 1 2 3 4 5 6 7 8
Cycles
WB
WB
Data written here
Data read here  instruction is stalled 2 cycles
Khoa Coâng Ngheä Thoâng Tin – Ñaïi Hoïc Baùch Khoa Tp.HCM
 Forwarding (bypassing/short-circuiting) techniques
– Reduce the delay time between 2 depended instructions
– The ALU result is fed back to the ALU input latches
– Forwarding hardware check and forward the necessary result 
to the ALU input for the 2 next instructions
ADD R1, R2, R3
SUB R4, R1, R5
AND R6, R1, R7
OR R8,R1,R9
XOR R1, R10, R11
IF ID EX MEM
IF ID EX MEM
WB
WB
IF ID EX MEM WB
IF ID EX MEM WB
IF ID EX MEM WB
No stall
No stall
No stall
Khoa Coâng Ngheä Thoâng Tin – Ñaïi Hoïc Baùch Khoa Tp.HCM
 RAW(Read After Write)
– Instruction j tries to read a source before instruction i writes it 
– Most common types
 WAR(Write After Read)
– Instruction j tries to write a destination before instruction i read it to 
execute
– Can not happen in DLX pipeline. Why?
 WAW(Write After Write)
– Instruction j tries to write a operand before instruction i updates it
– The writes end up in the wrong order
 Is RAR (Read After Read) a hazard?
Khoa Coâng Ngheä Thoâng Tin – Ñaïi Hoïc Baùch Khoa Tp.HCM
 Pipeline scheduling (Instruction scheduling)
– Use compiler to rearrange the generated code to eliminate hazard. 
Example:
c=a+b
d=e-f
LW Ra, a
LW Rb, b
ADD Rc, Ra, Rb
SW c, Rc
LW Re, e
LW Rf, f
SUB Rd, Re, Rf
SW d, Rd
LW Ra, a
LW Rb, b
LW Re, e 
ADD Rc, Ra, Rb
LW Rf, f 
SW c, Rc
SUB Rd, Re, Rf
SW d, Rd
Source code
Generated code
Generated and rearranged code 
(no hazard)
Data hazards
Khoa Coâng Ngheä Thoâng Tin – Ñaïi Hoïc Baùch Khoa Tp.HCM
 Occurs when a branch/jump instruction is taken
 Causes great performance loss
 Example: 
Instruction i+6 stall stall stall IF
Instruction i+5 stall stall stall IF ID
Instruction i+4 stall stall stall IF ID EX…
Instruction i+3 stall stall stall IF ID EX MEM..
Instruction i+2 stall stall stall IF ID EX MEM WB 
Instruction i+1 IF stall stall IF ID EX MEM WB 
Branch instruction IF ID EX MEM WB
The PC register changed here
Unnecessary instruction loaded
Khoa Coâng Ngheä Thoâng Tin – Ñaïi Hoïc Baùch Khoa Tp.HCM
 Predict whether the branch is taken or not
 Compute the branch target address earlier
 Use many schemes
– Pipeline freezing 
– Predict-not-taken scheme
– Predict-taken scheme (N/A in DLX)
– Delayed branch
Khoa Coâng Ngheä Thoâng Tin – Ñaïi Hoïc Baùch Khoa Tp.HCM
 Hold any instruction after the branch until the 
branch destination is known
 Simple but not efficient 
Khoa Coâng Ngheä Thoâng Tin – Ñaïi Hoïc Baùch Khoa Tp.HCM
 Predict the branch as not taken and allow 
execution to continue
– Must not change the machine state till the 
branch outcome is known
 If the branch is not taken: no penalty
 If the branch is taken:
– Restart the fetch at the branch target 
– Stall one cycle
Khoa Coâng Ngheä Thoâng Tin – Ñaïi Hoïc Baùch Khoa Tp.HCM
 Example
Instruction i+4 stall IF ID EX MEM
Instruction i+3 stall IF ID EX MEM WB
Instruction i+2 stall IF ID EX MEM WB 
Instruction i+1 IF IF ID EX MEM WB 
Taken branch instruction IF ID EX MEM WB
Instruction Fetch restarted
Right instruction fetched
Khoa Coâng Ngheä Thoâng Tin – Ñaïi Hoïc Baùch Khoa Tp.HCM
If R2=0 then
 Change the order of execution so that the 
next instruction is always valid and useful
 “From before” approach
ADD R1, R2, R3
If R2=0 then
Delay slot ADD R1, R2, R3
becomes
Khoa Coâng Ngheä Thoâng Tin – Ñaïi Hoïc Baùch Khoa Tp.HCM
SUB R4,R5,R6
ADD R1, R2, R3
If R1=0 then
 “From target” approach
Delay slot
becomes ADD R1, R2, R3
If R1=0 then
SUB R4,R5,R6
Khoa Coâng Ngheä Thoâng Tin – Ñaïi Hoïc Baùch Khoa Tp.HCM
 “From fall through” approach
ADD R1, R2, R3
If R1=0 then
SUB R4,R5,R6
Delay slot
becomes
ADD R1, R2, R3
If R1=0 then
SUB R4,R5,R6
File đính kèm:
Parallel Processing & Distributed Systems - Chapter 5 Pipeline.pdf