SlideShare a Scribd company logo
Asst. Prof. Dr.Sumek Wisayataksin
Computer Architectures
Chapter 5 : Pipelining Processor
Increasing Process Speed
▪ Cooking Process
Process 1 :
Material Preparing
Meat/Veggie Chop
5 mins
Process 3 :
Garnishing Food
5 mins
Process 4 :
Serving
3 mins
Process 2 :
Cooking
10 mins
It takes 23 mins to finish 1 dish !!!
Parallel Processing
3 times speed up
Average
23/3 = 7.67 min/dish
Chef 1
Chef 2
Chef 3
Parallel Processing Limitation
▪ The processors (Chefs) must be an expert of all process from 1-4.
▪ You need to have many expert chefs to speed up the process.
▪ Every processor (Chef) must have the same skills
Pipelining Process
Time
Dish2
Dish1
Dish3
Dish4
10 mins
10 mins
10 mins
10 mins
10 mins
10 mins
4 Processes
Average
10 mins/dish
Balancing Pipeline stages
▪ Since the bottleneck is Process 2 (10 mins), we can break it into 2
subprocess
Time
Dish2
Dish1
Dish3
Dish4
Dish5
5 mins
5 Processes
Average
5 mins/dish
What is pipelining
▪ Pipelining is a set of data processing elements connected in series, where
the output of one element is the input of the next one.
▪ The elements of a pipeline are often executed in parallel or in time-sliced
fashion.
▪ Some amount of buffer storage is often inserted between elements.
Pipelining and Maintaining Correctness
• We can break the execution of instructions into stages and overlap the
execution of different instructions.
• We need to ensure that the results produced by our new pipelined processor
are no different to the unpipelined one.
• What sort of situations could cause problems once we pipeline our
processor?
Basic Processor Architecture
Program
Counter
(PC)
R
E
G
0
R
E
G
N
……
C
T
R
L
Data bus
Address bus
ALU
Data Registers
I
R
Z
C
N
V
FLAG
Control Bus
5-stage Pipelined Processor
PC
+
+4
Program
Memory
Regfile
(R0-R14)
CTRL
Box Rn
Rm
Vn
Vm
Rs Vs
Shift ALU
Data
Memory
+
+offset
Flag
Vd
+
Vidx
Addr Read
Inst
Addr
Read
Data
Write
Data
Stage 1
Fetch
Stage 2
Decode
Stage 3
Execute
Stage 4
Memory
Stage 5
Writeback
#imm
Signals from CTRL Box
Stage 1 : Fetch (Read Instructions)
• The next instruction is fetched from the memory address
that is currently stored in the program counter (PC) and
stored into the instruction register.
• At the end of the fetch operation, the PC points to the next
instruction that will be read at the next cycle.
• The next instruction might be from branch address.
Stage 1 : Fetch
1.The CPU sends the contents of the PC to address bus
and sends a read command on the control bus
2.In response to the read command (with address equal to
PC), the program memory returns the data stored at the
memory location indicated by the PC on the data bus
3.The CPU copies the data from the data bus into its
Instruction Register (IR)
4.The PC is incremented, so that it points to the next
instruction. This step prepares the CPU for the next cycle.
Stage 2 : Decode (Inst Interpretation)
▪ During this stage, the encoded
instruction presented in the instruction
register is interpreted by the decoder.
▪ Register values and immediate
constants are also read in this
stage
Stage 3 : Execute (Calculation)
▪ The CPU sends the decoded instruction
as a set of control signals to the
corresponding computer components.
▪ If the instruction involves arithmetic or
logic, the ALU is utilized and data to be
computed.
▪ This is the only stage of the instruction
cycle that is compute data. Flags are
updated in this stage.
▪ The branch address (for jump) is also
processed here.
Stage 4 : Memory (Access RAM)
▪ This stage access the data memory
(LDR/STR)
▪ This process must be after Execute stage
because address can be pre-indexed.
▪ The address from ALU calculation will be
sent to the address bus.
▪ At the end of stage, the data memory
return/store 32-bit data to/from the data
bus
Stage 5 : Writeback (Write Output)
▪ This stage, it writes the data
register (R0-R12).
▪ Results from Execute and
Memory stage are stored in the
register selected by the
instruction (Destination register)
Regfile
(R0-R14)
Rn
Rm
Vn
Vm
Rs Vs
Vd
Vidx
Decode
Execute
Memory
Writeback
Data Processing Execution Process
0x00 : ADD R2, R0, R1 // R2 = R0+R1
0x04 : ORR R3, R0, #3 // R3 = R0 | 0x03
0x08 : SUB R4, R1, R7 LSR #4 // R4 = R1 – (R7>>4)
0x0C : EOR R5, R0, R6 LSL R7 // R5 = R0 ^ (R6<<R7)
PC Instruction
ADD R2, R0, R1
+
+4
Program
Memory
Regfile
(R0-R14)
CTRL
Box Rn
Rm
Vn
Vm
Rs Vs
Shift ALU
Data
Memory
+
+offset
Flag
Vd
+
Vidx
Addr
Read
Inst
Addr
Read
Data
Write
Data
#imm
Instruction
Register
Stage 1
Fetch
0x00
0x00
ADD R2, R0, R1
0x04
0x00
Stage 3
Execute
R0
R1
ADD
R0+R1
Stage 4
Memory
R0+R1
Stage 2
Decode
Rn=0 Rm=1
ADD R2, R0, R1
Vn=R0
Vm=R1
Stage 5
Writeback
R0+R1
R2=R0+R1
PC
ORR R3, R0, #3
+
+4
Program
Memory
Regfile
(R0-R14)
CTRL
Box Rn
Rm
Vn
Vm
Rs Vs
Shift ALU
Data
Memory
+
+offset
Flag
Vd
+
Vidx
Addr
Read
Inst
Addr
Read
Data
Write
Data
#imm
Instruction
Register
Stage 1
Fetch
0x04
0x04
ORR R3, R0, #3
0x08
0x04
Stage 3
Execute
R0
imm=3
ORR
R0|3
Stage 4
Memory
R0|3
Stage 2
Decode
Rn=0 imm=3
ORR R3, R0, #3
Vn=R0
Stage 5
Writeback
R0|3
R3=R0|3
PC
SUB R4, R1, R7 LSR #4
+
+4
Program
Memory
Regfile
(R0-R14)
CTRL
Box Rn
Rm
Vn
Vm
Rs Vs
Shift ALU
Data
Memory
+
+offset
Flag
Vd
+
Vidx
Addr
Read
Inst
Addr
Read
Data
Write
Data
#imm
Instruction
Register
Stage 1
Fetch
0x08
0x08
SUB R4, R1, R7 LSR #4
0x0C
0x08
Stage 4
Memory
R1-(R7>>4)
Stage 5
Writeback
R1-(R7>>4)
R4=R1-(R7>>4)
Stage 2
Decode
Rn=1
Rm=7
SUB R4, R1, R7 LSR #4
Vn=R1
imm=4
Vm=R7
Stage 3
Execute
R1
imm=4 SUB
R1-(R7>>4)
R7
R7>>4
PC
EOR R5, R0, R6 LSL R7
+
+4
Program
Memory
CTRL
Box Rn
Rm
Vn
Vm
Rs Vs
Shift ALU
Data
Memory
+
+offset
Flag
Vd
+
Vidx
Addr
Read
Inst
Addr
Read
Data
Write
Data
#imm
Instruction
Register
Stage 1
Fetch
0x0C
0x0C
EOR R5, R0, R6 LSL R7
0x10
0x0C
Stage 4
Memory
R0^(R6<<R7)
Stage 5
Writeback
R0^(R6<<R7)
R5=R0^(R6<<R7)
Stage 3
Execute
R0
R7 EOR
R0^(R6<<R7)
R6
R6<<R7
Stage 2
Decode
Rn=0
Rm=6
EOR R5, R0, R6 LSL R7
Vn=R0
Rs=7
Vm=R6
Vs=R7
PC
Memory Access Execution Process
0x10 : LDR R2, [R0,#4] // R2 = mem[R0+4]
0x14 : LDR R4, [R0], #4 // R4 = mem[R0]
R0 = R0+4
0x18 : STR R3, [R0,R2]! // mem[R0+R2] = R3
R0 = R0+R2
0x1C : STR R5, [R1], R9 // mem[R1] = R5
R1 = R1+R5
PC Instruction
LDR R2, [R0,#4]
+
+4
Program
Memory
Regfile
(R0-R14)
CTRL
Box Rn
Rm
Vn
Vm
Rs Vs
Shift ALU
Data
Memory
+
+offset
Flag
Vd
+
Vidx
Addr
Read
Inst
Addr
Read
Data
Write
Data
#imm
Instruction
Register
Stage 1
Fetch
0x10
0x10
LDR R2, [R0,#4]
0x14
0x10
Stage 3
Execute
R0
imm=4
ADD
R0+4
Stage 2
Decode
Rn=0 imm=4
LDR R2, [R0,#4]
Vn=R0
Stage 5
Writeback
Mem[R0+4]
R2=Mem[R0+4]
Stage 4
Memory
Mem[R0+4]
R0+4
PC
LDR R4, [R0], #4
+
+4
Program
Memory
Regfile
(R0-R14)
CTRL
Box Rn
Rm
Vn
Vm
Rs Vs
Shift ALU
Data
Memory
+
+offset
Flag
Vd
+
Vidx
Addr
Read
Inst
Addr
Read
Data
Write
Data
#imm
Instruction
Register
Stage 1
Fetch
0x14
0x14
LDR R4, [R0],#4
0x18
0x14
Stage 3
Execute
R0
imm=4
BYPASS
R0
Stage 2
Decode
Rn=0 imm=4
LDR R4, [R0],#4
Vn=R0
Stage 4
Memory
Mem[R0]
R0
R0+4
Stage 5
Writeback
Mem[R0]
R4=Mem[R0] R0+4
R0=R0+4
PC
STR R3, [R0,R2]!
+
+4
Program
Memory
CTRL
Box Rn
Rm
Vn
Vm
Rs Vs
Shift ALU
Data
Memory
+
+offset
Flag
Vd
+
Vidx
Addr
Read
Inst
Addr
Read
Data
Write
Data
#imm
Instruction
Register
Stage 1
Fetch
0x18
0x18
STR R3, [R0,R2]!
0x1C
0x18
Stage 5
Writeback
R0+R2
R0=R0+R2
Stage 2
Decode
Rn=0 Rm=2
STR R3, [R0,R2]!
Vn=R0
Rs=3
Vm=R2
Vs=R3
Stage 3
Execute
R0
R3
ADD
R0+R2
R2
Stage 4
Memory
R3
R0+R2
R0+R2
0
PC
STR R5, [R1], R9
+
+4
Program
Memory
CTRL
Box Rn
Rm
Vn
Vm
Rs Vs
Shift ALU
Data
Memory
+
+offset
Flag
Vd
+
Vidx
Addr
Read
Inst
Addr
Read
Data
Write
Data
#imm
Instruction
Register
Stage 1
Fetch
0x1C
0x1C
STR R5, [R1], R9
0x20
0x1C
Stage 5
Writeback
R1+R9
R1=R1+R9
Stage 2
Decode
Rn=1 Rm=9
STR R5, [R1], R9
Vn=R1
Rs=5
Vm=R9
Vs=R5
Stage 3
Execute
R1
R5
BYPASS
R1
R9
Stage 4
Memory
R5
R1
R1+R9
R9
PC
The pipeline flow
FE DE EX ME WB
CLK 1 INST 0x00
CLK 2 INST 0x04 INST 0x00
CLK 3 INST 0x08 INST 0x04 INST 0x00
CLK 4 INST 0x0C INST 0x08 INST 0x04 INST 0x00
CLK 5 INST 0x10 INST 0x0C INST 0x08 INST 0x04 INST 0x00
The pipeline in Action
Clock
Cycle 1 2 3 4 5 6 7 8 9
FETCH
FETCH
FETCH
FETCH
EXE
DEC
DEC
DEC
DEC EXE
EXE
EXE
MEM
MEM
MEM
MEM
WB
WB
WB
WB
FETCH DEC EXE MEM WB
On clock cycle 5, the pipeline contains all the
instructions listed in different pipeline stages
1. LDR R1, [R2]
2. ADD R2, R2, R3
3. STR R4, [R5], #4
4. SUB R0, R0, #1
5. LDR R5, [R3]
Branch Execution Process
0x20 : CMP R3, #0 // R3-0 and Update Flag
0x24 : BEQ JUMP1
0x28 : ADD R7, R7, #1
0x2c : . . . .
0x30 : . . . .
0x34 : . . . .
0x38 : JUMP1: SUB R7, R7, #1
PC Instruction
BEQ JUMP1 (Condition Meet)
+
+4
Program
Memory
CTRL
Box Rn
Rm
Vn
Vm
Rs Vs
Shift ALU
Data
Memory
+
+offset
Flag
Vd
+
Vidx
Addr
Read
Inst
Addr
Read
Data
Write
Data
#imm
Instruction
Register
Stage 1
Fetch
0x24
0x24
BEQ JUMP1
0x28
0x24
Stage 2
Decode
BEQ JUMP1
0x28
(Offset*4)+4
Stage 3
Execute
0x38
PC
Stage 4
Memory
MEET
0x38
0x38
MEET
0x38
0x28
0x38
SUB R7, R7, #1
0x3C
BEQ JUMP1 (Condition Not Meet)
+
+4
Program
Memory
CTRL
Box Rn
Rm
Vn
Vm
Rs Vs
Shift ALU
Data
Memory
+
+offset
Flag
Vd
+
Vidx
Addr
Read
Inst
Addr
Read
Data
Write
Data
#imm
Instruction
Register
Stage 1
Fetch
0x24
0x24
BEQ JUMP1
0x28
0x24
Stage 2
Decode
BEQ JUMP1
0x28
(Offset*4)+4
Stage 3
Execute
0x38
PC
Stage 4
Memory
NOT MEET
0x38
0x38
NOT MEET
0x28
0x28
0x28
ADD R7, R7, #1
0x2C
Data Hazard
Clock
Cycle 1 2 3 4 5 6 6
FETCH
FETCH EXE
DEC
DEC EXE MEM
MEM
WB
WB
1. LDR R1, [R2]
2. ADD R1, R1, R3
R1 is updated here
R1 = R1+R3 is
executed before LDR
instruction is written
The result of ADD R1, R1, R3 is wrong.
This is Called “Data Hazard”
Data Dependencies - True Data Dependencies
Clock
Cycle 1 2 3 4 5 6 7 8
FETCH
FETCH EXE
DEC
DEC EXE MEM
MEM
WB
WB
1. LDR R1, [R2]
2. ADD R1, R1, R3
R1 updated here
STALL STALL
Wait R1 Updated
FETCH
FETCH EXE
DEC
DEC EXE MEM
MEM
WB
WB
1. LDR R1, [R2]
3. ADD R1, R1, R3
R1 updated here
STALL
Wait R1 Updated
2. ADD R2, R2, R3 FETCH DEC EXE MEM WB
3. ADD R2, R2, R3 FETCH DEC
STALL EXE MEM
STALL
Data Dependencies
True Data Dependencies
True data dependence (green arrow). Also
known as a Read-After-Write (RAW)
dependence.
If we imagine two instructions, i followed by j
If j uses the result of i, we say that j is
data dependent on i. This is an example of
a true data dependence (a read after a
write).
This type of data dependency causes the
pipeline hazards
SUB R1, R1, #2
ADD R1, R1, R3
ADD R2, R2, R3
Write
Read
Data dependencies
Name dependencies may also
exist when two instructions refer to
the same register. Unlike true data
dependencies, it doesn’t affect the
pipeline stall.
▪ Output dependencies (red arrow)
▪ Anti-dependence (green arrow)
SUB R1, R1, #2
ADD R1, R1, R3
LDR R3, [R2]
SUB R2, R3, #1
Write
Write
Data dependencies
▪ Output dependence (red
arrow). Also known as a Write-
After-Write (WAW) dependence
▪ We need to ensure we don’t
reorder writes to the same
register. This would mean
subsequent instructions could
receive the wrong data value.
▪ This type doesn’t cause data
hazard, if the execution order
doesn’t change
SUB R1, R1, #2
ADD R1, R1, R3
Write
Write
Data dependencies
▪ Anti-dependence (green arrows).
Also known as a Write-After-Read
(WAR) dependence
▪ we need to be careful not to
overwrite a register whose current
value is still required by an earlier
instruction.
▪ This type doesn’t cause data
hazard, if the execution order
doesn’t change
ADD R1, R1, R3
LDR R3, [R2]
SUB R2, R3, #1
Write
Read
Write
Read
Data Hazards
▪ Data Hazards occur when an instruction depends on the result of previous
instruction and that result of instruction has not yet been computed. whenever
two different instructions use the same storage. the location must appear as if
it is executed in sequential order.
FETCH
FETCH EXE
DEC
DEC EXE MEM
MEM
WB
WB
1. LDR R1, [R2]
2. ADD R1, R1, R3
R1 updated here
STALL STALL
Wait R1 Updated
3. ADD R2, R2, R3 FETCH DEC
STALL EXE MEM
STALL
An ideal Pipeline
In the ideal case, when our pipeline never stalls, our CPI will equal 1 (and IPC = 1).
If we need to stall the pipeline, our CPI will increase, e.g.:
If we must stall for 1 cycle for 20% of instructions, and 3 cycles for 5% of instructions, our new
CPI would be:
Pipeline CPI = ideal pipeline CPI + pipeline stalls per instruction
= 1 + 1*0.20 + 3*0.05 = 1.35
Note: to make the best use of pipelining, we should avoid stalling as much as possible. (without
increasing our clock period!) Remember:
Time = instructions executed × Clocks Per Instruction (CPI) × clock period
Hazard Elimination Methods
1. Pipeline bubbling/ Stall
Instructions are fetched, control logic determines whether a hazard
could/will occur. If this is true, then the control logic inserts “STALL”
into the pipeline. Thus, before the next instruction (which would
cause the hazard) executes, the prior one will have had sufficient
time to finish and prevent the hazard.
FETCH
FETCH EXE
DEC
DEC EXE MEM
MEM
WB
WB
1. LDR R1, [R2]
2. ADD R1, R1, R3
R1 updated here
STALL STALL
Wait R1 Updated
3. ADD R2, R2, R3 FETCH DEC
STALL EXE MEM
STALL
Hazard Elimination Methods
2. Out-of-Order Execution
We can avoid data hazard by reorder instructions. However, the reorder
program must not break the data dependencies
▪ True data dependencies – Read after Write (RAW)
▪ Output Dependencies – Write after Write (WAW)
▪ Anti-Dependencies – Write After Read (WAR)
Note – Read after Read (RAR) reordering does not affect data
dependencies.
Hazard Elimination Methods
3. Data forwarding
a technique to resolve the data hazards by passing a value directly
from one stage to an earlier stage in order to avoid stalling.
FE DE EX ME WB
Data forwarding from ME to DE
No Data forwarding (Memory)
FETCH
FETCH EXE
DEC
DEC EXE MEM
MEM
WB
WB
1. LDR R1, [R2]
2. ADD R1, R1, R3
R1 updated here
STALL STALL
Wait R1 Updated
FE DE EX ME WB
Clock
Cycle 1 2 3 4 5 6 7 8
No Data forwarding (Memory)
FETCH
FETCH EXE
DEC
DEC EXE MEM
MEM
WB
WB
1. LDR R1, [R2]
2. ADD R1, R1, R3 STALL
Wait MEM Read
Clock
Cycle 1 2 3 4 5 6 7 8
FE DE EX ME WB
Data forwarding from ME to DE
No Data forwarding (Execute)
FETCH
FETCH EXE
DEC
DEC EXE MEM
MEM
WB
WB
1. MVN R1, R1
2. ADD R1, R1, #1
R1 updated here
STALL STALL
Wait R1 Updated
FE DE EX ME WB
Clock
Cycle 1 2 3 4 5 6 7 8
Data forwarding (Execute)
FETCH
FETCH EXE
DEC
DEC EXE MEM
MEM
WB
WB
1. MVN R1, R1
2. ADD R1, R1, #1
FE DE EX ME WB
Clock
Cycle 1 2 3 4 5 6 7 8
Data forwarding from ME to DE
Data forwarding Disadvantage
▪ Data forwarding reduces the pipeline stall. However, the forwarding path increase the logic
delay significantly.
▪ Bad data forwarding will reduce the overall clock speed due to the longer path.
▪ Processor designer must compromise between data forwarding and Pipeline hazard to
optimize the processor performance.
DE EX
DE EX
Delay
Delay Delay
Exercise 5.1
▪ Draw Pipeline stage diagram of these instructions with pipeline
stall to avoid data hazard.
MOV R0, #0x30
MOV R1, #0x54
MOV R2, #0xAA
ADD R3, R1, R2
SUB R4, R0, #1
Exercise 5.1
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
MOV R0, #0x30
MOV R1, #0x54
MOV R2, #0xAA
ADD R3, R1, R2
SUB R4, R0, #1
Exercise 5.2
▪ Draw Pipeline stage diagram of these instructions with pipeline
stall to avoid data hazard.
MOV R0, #0x41000000
MOV R1, #0x12
MOV R2, #0x45
STRB R1, [R0], #1
STRB R2, [R0], #1
ADD R3, R3, R1
SUB R4, R4, R2
Exercise 5.2
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
MOV R0, #0x41000000
MOV R1, #0x12
MOV R2, #0x45
STRB R1, [R0], #1
STRB R2, [R0], #1
ADD R3, R3, R1
SUB R4, R4, R2
Exercise 5.3
▪ Draw Pipeline stage diagram of these instructions with pipeline
stall after reordering instructions to avoid data hazard.
MOV R0, #0x41000000
MOV R1, #0x12
MOV R2, #0x45
STRB R1, [R0], #1
STRB R2, [R0], #1
ADD R3, R3, R1
SUB R4, R4, R2
Maximizing CPU Efficiency: A Comprehensive Exploration of Pipelining in Computer Architecture.pdf
Exercise 5.3
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
Control Hazard (Branching Hazard)
▪ Control hazard occurs whenever the pipeline makes decision to
jump or not.
CMP R0, #0
BEQ ZERO
ONE: ADD R1, R1, #1
ZERO: ADD R2, R2, #1
(R0==0)
Jump to
ZERO
Jump to
ONE
Yes No
Control Hazard (Branching Hazard)
FETCH
FETCH EXE
DEC EXE MEM
MEM
WB
WB
1. CMP R0, #0
2. BEQ ZERO
3. ZERO ADD R2, R2, #1 FETCH
Clock
Cycle 1 2 3 4 5 6 7 8
CASE R==0
1. CMP R0, #0
2. BEQ ZERO
3. ONE ADD R1, R1, #1
CASE R!=0
STALL
STALL
DEC
EXE MEM
DEC
FETCH
FETCH EXE
DEC EXE MEM
MEM
WB
WB
FETCH
STALL
STALL
DEC
EXE MEM
DEC
condition as well as Jump address gets
sent in MEM, therefore, we have to wait till
it happens.
Control Hazard
What happens immediately after we fetch a conditional branch
instruction?
1. We must determine if the branch is taken or not.
2. If the branch is taken, we must compute and branch to the target
address.
(R0==0)
Jump to
ZERO
Jump to
ONE
Yes No
How Would We Like to Handle Branches?
- In a pipelined processor, we would like to calculate the next
value of the program counter (PC) in parallel with reading our
instruction memory.
- In practice, this is non-trivial to support as branches may not be
decoded and evaluated until later in the pipeline.
FE DE EX ME WB
PC
+
+4
Program
Memory
Regfile
(R0-R14)
CTRL
Box Rn
Rm
Vn
Vm
Rs Vs
Shift ALU
+
+offset
Flag
Vd
Vidx
Addr
Read
Inst
Stage 1
Fetch
Stage 2
Decode
Stage 3
Execute
Stage 4
Memory
#imm
How Would We Like to Handle Branches?
How to speed up the branch instruction
• Option 1: Assume a branch is not taken.
• Option 2: Evaluate the branch earlier in the pipeline.
• Option 3: Branch prediction.
Assume the Branch Is Not Taken
FETCH
FETCH EXE
DEC EXE MEM
MEM
WB
WB
1. CMP R0, #0
2. BEQ ZERO
3. ONE ADD R1, R1, #1 FETCH
Clock
Cycle 1 2 3 4 5 6 7 8
Fetch the next
Instruction before
branch decision
1. CMP R0, #0
2. BEQ ZERO
3. ZERO ADD R2, R2, #1
If Branch is taken,
Flush the pipeline
DEC
EXE MEM
DEC
FETCH
FETCH EXE
DEC EXE MEM
MEM
WB
WB
FETCH
FLUSH
FLUSH
DEC
EXE MEM
DEC
Assume a Branch Is Not Taken
If we evaluate the branch in the execute stage, we would lose two
cycles every time we encountered a taken branch.
If we assume 20% of instructions are branches, and 60% are taken,
and an otherwise perfect CPI of 1:
CPI contribution from taken branches = 0.2*2 * 0.6 = 0.24
Our new CPI = 1.0 + branch stalls = 1.24
Why don’t we assume branch is taken ?
MOV R3, #0
LOOP ADD R3, R3 ,#1
ADD R1, R1, #2
SUB R2, R2, #4
CMP R3, #10
BNE LOOP
CMP R3, #10
BNE LOOP
LOOP SUB R1, R1, R0
FETCH
FETCH EXE
DEC EXE MEM
MEM
WB
WB
FETCH
STALL
STALL
DEC
EXE MEM
DEC
We don’t know
where to jump to
until offset address
is calculated.
- From the left code, it seems that
branch is taken more than not taken
(9:1)
- Why don’t we assume branch is
taken instead ?
Evaluate the branch earlier in the pipeline
▪ The same idea as “data forwarding”, the branch is evaluated early
in DEC stage and Flag is bypassed
FE DE EX ME WB
Flag Calculate
Branch Address
Evaluate the branch earlier in the pipeline
FE DE EX ME WB
Flag Calculate
CMP R3, #10
BNE LOOP
LOOP ADD R3, R3, R1
FETCH
FETCH EXE
DEC EXE MEM
MEM
WB
WB
FETCH
STALL
DEC
EXE MEM
DEC
Branch Address
Evaluate the branch earlier in the pipeline
- This technique causes the longer delay in the circuit. It affects the clock speed
of the whole system and consider to be not worth.
FE DE EX
Flag Calculate
Branch Address
Dynamic Branch Prediction
▪ Predict branch direction before branch is executed.
▪ Prediction based on branch history
▪ Start fetching instruction at predicted direction
Branch Prediction Buffer (BPB)
▪ A simple dynamic branch
prediction algorithm is to
predict a branch taken
only if it was the latest
time we executed it. We
can use a simple table of 1-
bit entries to store our
predictions
0x00000064 : BNE LOOP
0x64 = 0b01100100
0001 1
*
*
BPB
0x0000004C
*
*
Jump Address
0 = Not Taken
1 = Taken
PC
+
+4
Program
Memory
Regfile
(R0-R14)
CTRL
Box Rn
Rm
Vn
Vm
Rs Vs
Shift ALU
+
+offset
Flag
Vd
Vidx
Addr
Read
Inst
Stage1 Fetch Stage2 Decode Stage3 Execute Stage4 Memory
#imm
How Would We Like to Handle Branches?
BPB Jump Add
Decision
1-Bit Branch Prediction (Not Taken)
0x00000040 : CMP R0, #0
0x00000044 : BEQ ZERO
0x00000048 : ONE ADD R1, R1, #1
0x0000004C : ZERO ADD R2, R2, #1
0x44 = 0b01000100
0001
0 = Not Taken
1 = Taken
0
*
*
BPB
0x0000004C
*
*
Jump Address
1-Bit Branch Prediction
FETCH
FETCH EXE
DEC EXE MEM
MEM
WB
WB
1. CMP R0, #0
2. BEQ ZERO
3. ONE ADD R1, R1, #1 FETCH
Clock
Cycle 1 2 3 4 5 6 7 8
Fetch the next
Instruction before
the prediction
DEC
EXE MEM
DEC
1. CMP R0, #0
2. BEQ ZERO
3. ZERO ADD R2, R2, #1
If Prediction is
Wrong
FETCH
FETCH EXE
DEC EXE MEM
MEM
WB
WB
FETCH
DEC
EXE MEM
DEC
PC+4
B_addr
FETCH DEC
PC+4
1-Bit Branch Prediction (Taken)
0x00000040 : CMP R0, #0
0x00000044 : BEQ ZERO
0x00000048 : ONE ADD R1, R1, #1
0x0000004C : ZERO ADD R2, R2, #1
0x44 = 0b01000100
0001
0 = Not Taken
1 = Taken
0
*
*
BPB
0x0000004C
*
*
Jump Address
1-Bit Branch Prediction
FETCH
FETCH EXE
DEC EXE MEM
MEM
WB
WB
1. CMP R0, #0
2. BEQ ZERO
3. ZERO ADD R2, R2, #1
Clock
Cycle 1 2 3 4 5 6 7 8
Fetch the next
Instruction after
the prediction
DEC
1. CMP R0, #0
2. BEQ ZERO
3. ONE ADD R1, R1, #1
If Prediction is
Wrong
FETCH
FETCH EXE
DEC EXE MEM
MEM
WB
WB
FETCH
DEC
EXE MEM
DEC
FETCH
0x4C
PC+4
EXE MEM WB
DEC
FETCH
0x4C
DEC
Exercise 5.4
▪ Draw Pipeline stage diagram of these instructions with pipeline
stall using Branch Not-Taken method to avoid branch hazard.
MOV R0, #0x41000000
MOV R1, #0x0
MOV R2, #0x82
B1 ADD R1, R1, #1
ADD R3, R3, R0
SUB R4, R4, R2
CMP R1, #0x2
BNE B1
NOP
Maximizing CPU Efficiency: A Comprehensive Exploration of Pipelining in Computer Architecture.pdf
Exercise 5.5
▪ Draw Pipeline stage diagram of these instructions with pipeline
stall using Branch Prediction to avoid branch hazard.
MOV R0, #0x41000000
MOV R1, #0x0
MOV R2, #0x82
B1 ADD R1, R1, #1
ADD R3, R3, R0
SUB R4, R4, R2
CMP R1, #0x2
BNE B1
NOP
Exercise 5.6
▪ Calculate the stall cycles caused by B1 branch in Exercise 5.4
MOV R0, #0x41000000
MOV R1, #0x0
MOV R2, #0x82
B1 ADD R1, R1, #1
ADD R3, R3, R0
SUB R4, R4, R2
CMP R1, #0x10
BNE B1
NOP
Exercise 5.7
▪ Calculate prediction success rate (Hit rate) of B1 branch in these
program, if the initial prediction is Not-Taken and calculate the stall
cycles caused by B1 branch in Exercise 5.5
MOV R0, #0x41000000
MOV R1, #0x0
MOV R2, #0x82
B1 ADD R1, R1, #1
ADD R3, R3, R0
SUB R4, R4, R2
CMP R1, #0x10
BNE B1
NOP
Maximizing CPU Efficiency: A Comprehensive Exploration of Pipelining in Computer Architecture.pdf

More Related Content

Similar to Maximizing CPU Efficiency: A Comprehensive Exploration of Pipelining in Computer Architecture.pdf (20)

PPTX
PROCESSOR AND CONTROL UNIT
Amirthavalli Senthil
 
PPT
basic-processing-unit computer organ.ppt
ssuser702532
 
PPT
Computer Organization for third semester Vtu SyllabusModule 4.ppt
ShilpaKc3
 
PPT
chapter3 - Basic Processing Unit.ppt
PoliceNiranjanReddy
 
PDF
Pipeline and data hazard
Waed Shagareen
 
PPTX
Multicycle Datapath and Control: Enhancing CPU Efficiency through Sequential ...
khadeeja786khadeeja1
 
PPT
Control unit-implementation
WBUTTUTORIALS
 
PDF
Basic_Processing_Unit.pdf
UmamaheswariV4
 
PDF
BCS302-DDCO-basic processing unit-Module 5- VTU 2022 scheme-DDCO-pdf
ciyamala kushbu
 
PPT
Unit2 control unit
Ashim Saha
 
PPT
UNIT-3(Processor Organization)_computer organization.ppt
anushachalla14
 
PPT
chapter3 - Basic Processing base Unit.ppt
meghaasha6700
 
PPTX
heuring_jordan_----------ch02(1) (1).pptx
christinamary2620
 
PPT
Basic processing unit by aniket bhute
Aniket Bhute
 
PPTX
Pipelining And Vector Processing
TheInnocentTuber
 
PDF
Computer Organization
JasmineShaik14
 
PDF
Pipelining
PriyanshiPanwar14
 
PDF
COMPUTER ORGANIZATION NOTES Unit 7
Dr.MAYA NAYAK
 
PDF
310471266 chapter-7-notes-computer-organization
srinoni
 
PPTX
BASICS OF MIPS ARCHITECTURE AND THEIR INSTRUCTION SET
Saritha Sri
 
PROCESSOR AND CONTROL UNIT
Amirthavalli Senthil
 
basic-processing-unit computer organ.ppt
ssuser702532
 
Computer Organization for third semester Vtu SyllabusModule 4.ppt
ShilpaKc3
 
chapter3 - Basic Processing Unit.ppt
PoliceNiranjanReddy
 
Pipeline and data hazard
Waed Shagareen
 
Multicycle Datapath and Control: Enhancing CPU Efficiency through Sequential ...
khadeeja786khadeeja1
 
Control unit-implementation
WBUTTUTORIALS
 
Basic_Processing_Unit.pdf
UmamaheswariV4
 
BCS302-DDCO-basic processing unit-Module 5- VTU 2022 scheme-DDCO-pdf
ciyamala kushbu
 
Unit2 control unit
Ashim Saha
 
UNIT-3(Processor Organization)_computer organization.ppt
anushachalla14
 
chapter3 - Basic Processing base Unit.ppt
meghaasha6700
 
heuring_jordan_----------ch02(1) (1).pptx
christinamary2620
 
Basic processing unit by aniket bhute
Aniket Bhute
 
Pipelining And Vector Processing
TheInnocentTuber
 
Computer Organization
JasmineShaik14
 
Pipelining
PriyanshiPanwar14
 
COMPUTER ORGANIZATION NOTES Unit 7
Dr.MAYA NAYAK
 
310471266 chapter-7-notes-computer-organization
srinoni
 
BASICS OF MIPS ARCHITECTURE AND THEIR INSTRUCTION SET
Saritha Sri
 

Recently uploaded (20)

PPTX
原版一样(Acadia毕业证书)加拿大阿卡迪亚大学毕业证办理方法
Taqyea
 
PPTX
UNIT DAA PPT cover all topics 2021 regulation
archu26
 
PPTX
Break Statement in Programming with 6 Real Examples
manojpoojary2004
 
PDF
PORTFOLIO Golam Kibria Khan — architect with a passion for thoughtful design...
MasumKhan59
 
PDF
International Journal of Information Technology Convergence and services (IJI...
ijitcsjournal4
 
PPTX
ISO/IEC JTC 1/WG 9 (MAR) Convenor Report
Kurata Takeshi
 
PPTX
265587293-NFPA 101 Life safety code-PPT-1.pptx
chandermwason
 
PDF
Unified_Cloud_Comm_Presentation anil singh ppt
anilsingh298751
 
PPTX
MPMC_Module-2 xxxxxxxxxxxxxxxxxxxxx.pptx
ShivanshVaidya5
 
PPTX
NEUROMOROPHIC nu iajwojeieheueueueu.pptx
knkoodalingam39
 
PDF
MOBILE AND WEB BASED REMOTE BUSINESS MONITORING SYSTEM
ijait
 
PDF
Introduction to Productivity and Quality
মোঃ ফুরকান উদ্দিন জুয়েল
 
PPTX
site survey architecture student B.arch.
sri02032006
 
PPTX
EC3551-Transmission lines Demo class .pptx
Mahalakshmiprasannag
 
PPTX
Electron Beam Machining for Production Process
Rajshahi University of Engineering & Technology(RUET), Bangladesh
 
PPTX
Thermal runway and thermal stability.pptx
godow93766
 
PDF
BioSensors glucose monitoring, cholestrol
nabeehasahar1
 
PPTX
Benefits_^0_Challigi😙🏡💐8fenges[1].pptx
akghostmaker
 
PPTX
REINFORCEMENT AS CONSTRUCTION MATERIALS.pptx
mohaiminulhaquesami
 
PDF
ARC--BUILDING-UTILITIES-2-PART-2 (1).pdf
IzzyBaniquedBusto
 
原版一样(Acadia毕业证书)加拿大阿卡迪亚大学毕业证办理方法
Taqyea
 
UNIT DAA PPT cover all topics 2021 regulation
archu26
 
Break Statement in Programming with 6 Real Examples
manojpoojary2004
 
PORTFOLIO Golam Kibria Khan — architect with a passion for thoughtful design...
MasumKhan59
 
International Journal of Information Technology Convergence and services (IJI...
ijitcsjournal4
 
ISO/IEC JTC 1/WG 9 (MAR) Convenor Report
Kurata Takeshi
 
265587293-NFPA 101 Life safety code-PPT-1.pptx
chandermwason
 
Unified_Cloud_Comm_Presentation anil singh ppt
anilsingh298751
 
MPMC_Module-2 xxxxxxxxxxxxxxxxxxxxx.pptx
ShivanshVaidya5
 
NEUROMOROPHIC nu iajwojeieheueueueu.pptx
knkoodalingam39
 
MOBILE AND WEB BASED REMOTE BUSINESS MONITORING SYSTEM
ijait
 
Introduction to Productivity and Quality
মোঃ ফুরকান উদ্দিন জুয়েল
 
site survey architecture student B.arch.
sri02032006
 
EC3551-Transmission lines Demo class .pptx
Mahalakshmiprasannag
 
Electron Beam Machining for Production Process
Rajshahi University of Engineering & Technology(RUET), Bangladesh
 
Thermal runway and thermal stability.pptx
godow93766
 
BioSensors glucose monitoring, cholestrol
nabeehasahar1
 
Benefits_^0_Challigi😙🏡💐8fenges[1].pptx
akghostmaker
 
REINFORCEMENT AS CONSTRUCTION MATERIALS.pptx
mohaiminulhaquesami
 
ARC--BUILDING-UTILITIES-2-PART-2 (1).pdf
IzzyBaniquedBusto
 
Ad

Maximizing CPU Efficiency: A Comprehensive Exploration of Pipelining in Computer Architecture.pdf

  • 1. Asst. Prof. Dr.Sumek Wisayataksin Computer Architectures Chapter 5 : Pipelining Processor
  • 2. Increasing Process Speed ▪ Cooking Process Process 1 : Material Preparing Meat/Veggie Chop 5 mins Process 3 : Garnishing Food 5 mins Process 4 : Serving 3 mins Process 2 : Cooking 10 mins It takes 23 mins to finish 1 dish !!!
  • 3. Parallel Processing 3 times speed up Average 23/3 = 7.67 min/dish Chef 1 Chef 2 Chef 3
  • 4. Parallel Processing Limitation ▪ The processors (Chefs) must be an expert of all process from 1-4. ▪ You need to have many expert chefs to speed up the process. ▪ Every processor (Chef) must have the same skills
  • 5. Pipelining Process Time Dish2 Dish1 Dish3 Dish4 10 mins 10 mins 10 mins 10 mins 10 mins 10 mins 4 Processes Average 10 mins/dish
  • 6. Balancing Pipeline stages ▪ Since the bottleneck is Process 2 (10 mins), we can break it into 2 subprocess Time Dish2 Dish1 Dish3 Dish4 Dish5 5 mins 5 Processes Average 5 mins/dish
  • 7. What is pipelining ▪ Pipelining is a set of data processing elements connected in series, where the output of one element is the input of the next one. ▪ The elements of a pipeline are often executed in parallel or in time-sliced fashion. ▪ Some amount of buffer storage is often inserted between elements.
  • 8. Pipelining and Maintaining Correctness • We can break the execution of instructions into stages and overlap the execution of different instructions. • We need to ensure that the results produced by our new pipelined processor are no different to the unpipelined one. • What sort of situations could cause problems once we pipeline our processor?
  • 9. Basic Processor Architecture Program Counter (PC) R E G 0 R E G N …… C T R L Data bus Address bus ALU Data Registers I R Z C N V FLAG Control Bus
  • 10. 5-stage Pipelined Processor PC + +4 Program Memory Regfile (R0-R14) CTRL Box Rn Rm Vn Vm Rs Vs Shift ALU Data Memory + +offset Flag Vd + Vidx Addr Read Inst Addr Read Data Write Data Stage 1 Fetch Stage 2 Decode Stage 3 Execute Stage 4 Memory Stage 5 Writeback #imm Signals from CTRL Box
  • 11. Stage 1 : Fetch (Read Instructions) • The next instruction is fetched from the memory address that is currently stored in the program counter (PC) and stored into the instruction register. • At the end of the fetch operation, the PC points to the next instruction that will be read at the next cycle. • The next instruction might be from branch address.
  • 12. Stage 1 : Fetch 1.The CPU sends the contents of the PC to address bus and sends a read command on the control bus 2.In response to the read command (with address equal to PC), the program memory returns the data stored at the memory location indicated by the PC on the data bus 3.The CPU copies the data from the data bus into its Instruction Register (IR) 4.The PC is incremented, so that it points to the next instruction. This step prepares the CPU for the next cycle.
  • 13. Stage 2 : Decode (Inst Interpretation) ▪ During this stage, the encoded instruction presented in the instruction register is interpreted by the decoder. ▪ Register values and immediate constants are also read in this stage
  • 14. Stage 3 : Execute (Calculation) ▪ The CPU sends the decoded instruction as a set of control signals to the corresponding computer components. ▪ If the instruction involves arithmetic or logic, the ALU is utilized and data to be computed. ▪ This is the only stage of the instruction cycle that is compute data. Flags are updated in this stage. ▪ The branch address (for jump) is also processed here.
  • 15. Stage 4 : Memory (Access RAM) ▪ This stage access the data memory (LDR/STR) ▪ This process must be after Execute stage because address can be pre-indexed. ▪ The address from ALU calculation will be sent to the address bus. ▪ At the end of stage, the data memory return/store 32-bit data to/from the data bus
  • 16. Stage 5 : Writeback (Write Output) ▪ This stage, it writes the data register (R0-R12). ▪ Results from Execute and Memory stage are stored in the register selected by the instruction (Destination register) Regfile (R0-R14) Rn Rm Vn Vm Rs Vs Vd Vidx Decode Execute Memory Writeback
  • 17. Data Processing Execution Process 0x00 : ADD R2, R0, R1 // R2 = R0+R1 0x04 : ORR R3, R0, #3 // R3 = R0 | 0x03 0x08 : SUB R4, R1, R7 LSR #4 // R4 = R1 – (R7>>4) 0x0C : EOR R5, R0, R6 LSL R7 // R5 = R0 ^ (R6<<R7) PC Instruction
  • 18. ADD R2, R0, R1 + +4 Program Memory Regfile (R0-R14) CTRL Box Rn Rm Vn Vm Rs Vs Shift ALU Data Memory + +offset Flag Vd + Vidx Addr Read Inst Addr Read Data Write Data #imm Instruction Register Stage 1 Fetch 0x00 0x00 ADD R2, R0, R1 0x04 0x00 Stage 3 Execute R0 R1 ADD R0+R1 Stage 4 Memory R0+R1 Stage 2 Decode Rn=0 Rm=1 ADD R2, R0, R1 Vn=R0 Vm=R1 Stage 5 Writeback R0+R1 R2=R0+R1 PC
  • 19. ORR R3, R0, #3 + +4 Program Memory Regfile (R0-R14) CTRL Box Rn Rm Vn Vm Rs Vs Shift ALU Data Memory + +offset Flag Vd + Vidx Addr Read Inst Addr Read Data Write Data #imm Instruction Register Stage 1 Fetch 0x04 0x04 ORR R3, R0, #3 0x08 0x04 Stage 3 Execute R0 imm=3 ORR R0|3 Stage 4 Memory R0|3 Stage 2 Decode Rn=0 imm=3 ORR R3, R0, #3 Vn=R0 Stage 5 Writeback R0|3 R3=R0|3 PC
  • 20. SUB R4, R1, R7 LSR #4 + +4 Program Memory Regfile (R0-R14) CTRL Box Rn Rm Vn Vm Rs Vs Shift ALU Data Memory + +offset Flag Vd + Vidx Addr Read Inst Addr Read Data Write Data #imm Instruction Register Stage 1 Fetch 0x08 0x08 SUB R4, R1, R7 LSR #4 0x0C 0x08 Stage 4 Memory R1-(R7>>4) Stage 5 Writeback R1-(R7>>4) R4=R1-(R7>>4) Stage 2 Decode Rn=1 Rm=7 SUB R4, R1, R7 LSR #4 Vn=R1 imm=4 Vm=R7 Stage 3 Execute R1 imm=4 SUB R1-(R7>>4) R7 R7>>4 PC
  • 21. EOR R5, R0, R6 LSL R7 + +4 Program Memory CTRL Box Rn Rm Vn Vm Rs Vs Shift ALU Data Memory + +offset Flag Vd + Vidx Addr Read Inst Addr Read Data Write Data #imm Instruction Register Stage 1 Fetch 0x0C 0x0C EOR R5, R0, R6 LSL R7 0x10 0x0C Stage 4 Memory R0^(R6<<R7) Stage 5 Writeback R0^(R6<<R7) R5=R0^(R6<<R7) Stage 3 Execute R0 R7 EOR R0^(R6<<R7) R6 R6<<R7 Stage 2 Decode Rn=0 Rm=6 EOR R5, R0, R6 LSL R7 Vn=R0 Rs=7 Vm=R6 Vs=R7 PC
  • 22. Memory Access Execution Process 0x10 : LDR R2, [R0,#4] // R2 = mem[R0+4] 0x14 : LDR R4, [R0], #4 // R4 = mem[R0] R0 = R0+4 0x18 : STR R3, [R0,R2]! // mem[R0+R2] = R3 R0 = R0+R2 0x1C : STR R5, [R1], R9 // mem[R1] = R5 R1 = R1+R5 PC Instruction
  • 23. LDR R2, [R0,#4] + +4 Program Memory Regfile (R0-R14) CTRL Box Rn Rm Vn Vm Rs Vs Shift ALU Data Memory + +offset Flag Vd + Vidx Addr Read Inst Addr Read Data Write Data #imm Instruction Register Stage 1 Fetch 0x10 0x10 LDR R2, [R0,#4] 0x14 0x10 Stage 3 Execute R0 imm=4 ADD R0+4 Stage 2 Decode Rn=0 imm=4 LDR R2, [R0,#4] Vn=R0 Stage 5 Writeback Mem[R0+4] R2=Mem[R0+4] Stage 4 Memory Mem[R0+4] R0+4 PC
  • 24. LDR R4, [R0], #4 + +4 Program Memory Regfile (R0-R14) CTRL Box Rn Rm Vn Vm Rs Vs Shift ALU Data Memory + +offset Flag Vd + Vidx Addr Read Inst Addr Read Data Write Data #imm Instruction Register Stage 1 Fetch 0x14 0x14 LDR R4, [R0],#4 0x18 0x14 Stage 3 Execute R0 imm=4 BYPASS R0 Stage 2 Decode Rn=0 imm=4 LDR R4, [R0],#4 Vn=R0 Stage 4 Memory Mem[R0] R0 R0+4 Stage 5 Writeback Mem[R0] R4=Mem[R0] R0+4 R0=R0+4 PC
  • 25. STR R3, [R0,R2]! + +4 Program Memory CTRL Box Rn Rm Vn Vm Rs Vs Shift ALU Data Memory + +offset Flag Vd + Vidx Addr Read Inst Addr Read Data Write Data #imm Instruction Register Stage 1 Fetch 0x18 0x18 STR R3, [R0,R2]! 0x1C 0x18 Stage 5 Writeback R0+R2 R0=R0+R2 Stage 2 Decode Rn=0 Rm=2 STR R3, [R0,R2]! Vn=R0 Rs=3 Vm=R2 Vs=R3 Stage 3 Execute R0 R3 ADD R0+R2 R2 Stage 4 Memory R3 R0+R2 R0+R2 0 PC
  • 26. STR R5, [R1], R9 + +4 Program Memory CTRL Box Rn Rm Vn Vm Rs Vs Shift ALU Data Memory + +offset Flag Vd + Vidx Addr Read Inst Addr Read Data Write Data #imm Instruction Register Stage 1 Fetch 0x1C 0x1C STR R5, [R1], R9 0x20 0x1C Stage 5 Writeback R1+R9 R1=R1+R9 Stage 2 Decode Rn=1 Rm=9 STR R5, [R1], R9 Vn=R1 Rs=5 Vm=R9 Vs=R5 Stage 3 Execute R1 R5 BYPASS R1 R9 Stage 4 Memory R5 R1 R1+R9 R9 PC
  • 27. The pipeline flow FE DE EX ME WB CLK 1 INST 0x00 CLK 2 INST 0x04 INST 0x00 CLK 3 INST 0x08 INST 0x04 INST 0x00 CLK 4 INST 0x0C INST 0x08 INST 0x04 INST 0x00 CLK 5 INST 0x10 INST 0x0C INST 0x08 INST 0x04 INST 0x00
  • 28. The pipeline in Action Clock Cycle 1 2 3 4 5 6 7 8 9 FETCH FETCH FETCH FETCH EXE DEC DEC DEC DEC EXE EXE EXE MEM MEM MEM MEM WB WB WB WB FETCH DEC EXE MEM WB On clock cycle 5, the pipeline contains all the instructions listed in different pipeline stages 1. LDR R1, [R2] 2. ADD R2, R2, R3 3. STR R4, [R5], #4 4. SUB R0, R0, #1 5. LDR R5, [R3]
  • 29. Branch Execution Process 0x20 : CMP R3, #0 // R3-0 and Update Flag 0x24 : BEQ JUMP1 0x28 : ADD R7, R7, #1 0x2c : . . . . 0x30 : . . . . 0x34 : . . . . 0x38 : JUMP1: SUB R7, R7, #1 PC Instruction
  • 30. BEQ JUMP1 (Condition Meet) + +4 Program Memory CTRL Box Rn Rm Vn Vm Rs Vs Shift ALU Data Memory + +offset Flag Vd + Vidx Addr Read Inst Addr Read Data Write Data #imm Instruction Register Stage 1 Fetch 0x24 0x24 BEQ JUMP1 0x28 0x24 Stage 2 Decode BEQ JUMP1 0x28 (Offset*4)+4 Stage 3 Execute 0x38 PC Stage 4 Memory MEET 0x38 0x38 MEET 0x38 0x28 0x38 SUB R7, R7, #1 0x3C
  • 31. BEQ JUMP1 (Condition Not Meet) + +4 Program Memory CTRL Box Rn Rm Vn Vm Rs Vs Shift ALU Data Memory + +offset Flag Vd + Vidx Addr Read Inst Addr Read Data Write Data #imm Instruction Register Stage 1 Fetch 0x24 0x24 BEQ JUMP1 0x28 0x24 Stage 2 Decode BEQ JUMP1 0x28 (Offset*4)+4 Stage 3 Execute 0x38 PC Stage 4 Memory NOT MEET 0x38 0x38 NOT MEET 0x28 0x28 0x28 ADD R7, R7, #1 0x2C
  • 32. Data Hazard Clock Cycle 1 2 3 4 5 6 6 FETCH FETCH EXE DEC DEC EXE MEM MEM WB WB 1. LDR R1, [R2] 2. ADD R1, R1, R3 R1 is updated here R1 = R1+R3 is executed before LDR instruction is written The result of ADD R1, R1, R3 is wrong. This is Called “Data Hazard”
  • 33. Data Dependencies - True Data Dependencies Clock Cycle 1 2 3 4 5 6 7 8 FETCH FETCH EXE DEC DEC EXE MEM MEM WB WB 1. LDR R1, [R2] 2. ADD R1, R1, R3 R1 updated here STALL STALL Wait R1 Updated FETCH FETCH EXE DEC DEC EXE MEM MEM WB WB 1. LDR R1, [R2] 3. ADD R1, R1, R3 R1 updated here STALL Wait R1 Updated 2. ADD R2, R2, R3 FETCH DEC EXE MEM WB 3. ADD R2, R2, R3 FETCH DEC STALL EXE MEM STALL
  • 34. Data Dependencies True Data Dependencies True data dependence (green arrow). Also known as a Read-After-Write (RAW) dependence. If we imagine two instructions, i followed by j If j uses the result of i, we say that j is data dependent on i. This is an example of a true data dependence (a read after a write). This type of data dependency causes the pipeline hazards SUB R1, R1, #2 ADD R1, R1, R3 ADD R2, R2, R3 Write Read
  • 35. Data dependencies Name dependencies may also exist when two instructions refer to the same register. Unlike true data dependencies, it doesn’t affect the pipeline stall. ▪ Output dependencies (red arrow) ▪ Anti-dependence (green arrow) SUB R1, R1, #2 ADD R1, R1, R3 LDR R3, [R2] SUB R2, R3, #1 Write Write
  • 36. Data dependencies ▪ Output dependence (red arrow). Also known as a Write- After-Write (WAW) dependence ▪ We need to ensure we don’t reorder writes to the same register. This would mean subsequent instructions could receive the wrong data value. ▪ This type doesn’t cause data hazard, if the execution order doesn’t change SUB R1, R1, #2 ADD R1, R1, R3 Write Write
  • 37. Data dependencies ▪ Anti-dependence (green arrows). Also known as a Write-After-Read (WAR) dependence ▪ we need to be careful not to overwrite a register whose current value is still required by an earlier instruction. ▪ This type doesn’t cause data hazard, if the execution order doesn’t change ADD R1, R1, R3 LDR R3, [R2] SUB R2, R3, #1 Write Read Write Read
  • 38. Data Hazards ▪ Data Hazards occur when an instruction depends on the result of previous instruction and that result of instruction has not yet been computed. whenever two different instructions use the same storage. the location must appear as if it is executed in sequential order. FETCH FETCH EXE DEC DEC EXE MEM MEM WB WB 1. LDR R1, [R2] 2. ADD R1, R1, R3 R1 updated here STALL STALL Wait R1 Updated 3. ADD R2, R2, R3 FETCH DEC STALL EXE MEM STALL
  • 39. An ideal Pipeline In the ideal case, when our pipeline never stalls, our CPI will equal 1 (and IPC = 1). If we need to stall the pipeline, our CPI will increase, e.g.: If we must stall for 1 cycle for 20% of instructions, and 3 cycles for 5% of instructions, our new CPI would be: Pipeline CPI = ideal pipeline CPI + pipeline stalls per instruction = 1 + 1*0.20 + 3*0.05 = 1.35 Note: to make the best use of pipelining, we should avoid stalling as much as possible. (without increasing our clock period!) Remember: Time = instructions executed × Clocks Per Instruction (CPI) × clock period
  • 40. Hazard Elimination Methods 1. Pipeline bubbling/ Stall Instructions are fetched, control logic determines whether a hazard could/will occur. If this is true, then the control logic inserts “STALL” into the pipeline. Thus, before the next instruction (which would cause the hazard) executes, the prior one will have had sufficient time to finish and prevent the hazard. FETCH FETCH EXE DEC DEC EXE MEM MEM WB WB 1. LDR R1, [R2] 2. ADD R1, R1, R3 R1 updated here STALL STALL Wait R1 Updated 3. ADD R2, R2, R3 FETCH DEC STALL EXE MEM STALL
  • 41. Hazard Elimination Methods 2. Out-of-Order Execution We can avoid data hazard by reorder instructions. However, the reorder program must not break the data dependencies ▪ True data dependencies – Read after Write (RAW) ▪ Output Dependencies – Write after Write (WAW) ▪ Anti-Dependencies – Write After Read (WAR) Note – Read after Read (RAR) reordering does not affect data dependencies.
  • 42. Hazard Elimination Methods 3. Data forwarding a technique to resolve the data hazards by passing a value directly from one stage to an earlier stage in order to avoid stalling. FE DE EX ME WB Data forwarding from ME to DE
  • 43. No Data forwarding (Memory) FETCH FETCH EXE DEC DEC EXE MEM MEM WB WB 1. LDR R1, [R2] 2. ADD R1, R1, R3 R1 updated here STALL STALL Wait R1 Updated FE DE EX ME WB Clock Cycle 1 2 3 4 5 6 7 8
  • 44. No Data forwarding (Memory) FETCH FETCH EXE DEC DEC EXE MEM MEM WB WB 1. LDR R1, [R2] 2. ADD R1, R1, R3 STALL Wait MEM Read Clock Cycle 1 2 3 4 5 6 7 8 FE DE EX ME WB Data forwarding from ME to DE
  • 45. No Data forwarding (Execute) FETCH FETCH EXE DEC DEC EXE MEM MEM WB WB 1. MVN R1, R1 2. ADD R1, R1, #1 R1 updated here STALL STALL Wait R1 Updated FE DE EX ME WB Clock Cycle 1 2 3 4 5 6 7 8
  • 46. Data forwarding (Execute) FETCH FETCH EXE DEC DEC EXE MEM MEM WB WB 1. MVN R1, R1 2. ADD R1, R1, #1 FE DE EX ME WB Clock Cycle 1 2 3 4 5 6 7 8 Data forwarding from ME to DE
  • 47. Data forwarding Disadvantage ▪ Data forwarding reduces the pipeline stall. However, the forwarding path increase the logic delay significantly. ▪ Bad data forwarding will reduce the overall clock speed due to the longer path. ▪ Processor designer must compromise between data forwarding and Pipeline hazard to optimize the processor performance. DE EX DE EX Delay Delay Delay
  • 48. Exercise 5.1 ▪ Draw Pipeline stage diagram of these instructions with pipeline stall to avoid data hazard. MOV R0, #0x30 MOV R1, #0x54 MOV R2, #0xAA ADD R3, R1, R2 SUB R4, R0, #1
  • 49. Exercise 5.1 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 MOV R0, #0x30 MOV R1, #0x54 MOV R2, #0xAA ADD R3, R1, R2 SUB R4, R0, #1
  • 50. Exercise 5.2 ▪ Draw Pipeline stage diagram of these instructions with pipeline stall to avoid data hazard. MOV R0, #0x41000000 MOV R1, #0x12 MOV R2, #0x45 STRB R1, [R0], #1 STRB R2, [R0], #1 ADD R3, R3, R1 SUB R4, R4, R2
  • 51. Exercise 5.2 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 MOV R0, #0x41000000 MOV R1, #0x12 MOV R2, #0x45 STRB R1, [R0], #1 STRB R2, [R0], #1 ADD R3, R3, R1 SUB R4, R4, R2
  • 52. Exercise 5.3 ▪ Draw Pipeline stage diagram of these instructions with pipeline stall after reordering instructions to avoid data hazard. MOV R0, #0x41000000 MOV R1, #0x12 MOV R2, #0x45 STRB R1, [R0], #1 STRB R2, [R0], #1 ADD R3, R3, R1 SUB R4, R4, R2
  • 54. Exercise 5.3 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
  • 55. Control Hazard (Branching Hazard) ▪ Control hazard occurs whenever the pipeline makes decision to jump or not. CMP R0, #0 BEQ ZERO ONE: ADD R1, R1, #1 ZERO: ADD R2, R2, #1 (R0==0) Jump to ZERO Jump to ONE Yes No
  • 56. Control Hazard (Branching Hazard) FETCH FETCH EXE DEC EXE MEM MEM WB WB 1. CMP R0, #0 2. BEQ ZERO 3. ZERO ADD R2, R2, #1 FETCH Clock Cycle 1 2 3 4 5 6 7 8 CASE R==0 1. CMP R0, #0 2. BEQ ZERO 3. ONE ADD R1, R1, #1 CASE R!=0 STALL STALL DEC EXE MEM DEC FETCH FETCH EXE DEC EXE MEM MEM WB WB FETCH STALL STALL DEC EXE MEM DEC condition as well as Jump address gets sent in MEM, therefore, we have to wait till it happens.
  • 57. Control Hazard What happens immediately after we fetch a conditional branch instruction? 1. We must determine if the branch is taken or not. 2. If the branch is taken, we must compute and branch to the target address. (R0==0) Jump to ZERO Jump to ONE Yes No
  • 58. How Would We Like to Handle Branches? - In a pipelined processor, we would like to calculate the next value of the program counter (PC) in parallel with reading our instruction memory. - In practice, this is non-trivial to support as branches may not be decoded and evaluated until later in the pipeline. FE DE EX ME WB
  • 59. PC + +4 Program Memory Regfile (R0-R14) CTRL Box Rn Rm Vn Vm Rs Vs Shift ALU + +offset Flag Vd Vidx Addr Read Inst Stage 1 Fetch Stage 2 Decode Stage 3 Execute Stage 4 Memory #imm How Would We Like to Handle Branches?
  • 60. How to speed up the branch instruction • Option 1: Assume a branch is not taken. • Option 2: Evaluate the branch earlier in the pipeline. • Option 3: Branch prediction.
  • 61. Assume the Branch Is Not Taken FETCH FETCH EXE DEC EXE MEM MEM WB WB 1. CMP R0, #0 2. BEQ ZERO 3. ONE ADD R1, R1, #1 FETCH Clock Cycle 1 2 3 4 5 6 7 8 Fetch the next Instruction before branch decision 1. CMP R0, #0 2. BEQ ZERO 3. ZERO ADD R2, R2, #1 If Branch is taken, Flush the pipeline DEC EXE MEM DEC FETCH FETCH EXE DEC EXE MEM MEM WB WB FETCH FLUSH FLUSH DEC EXE MEM DEC
  • 62. Assume a Branch Is Not Taken If we evaluate the branch in the execute stage, we would lose two cycles every time we encountered a taken branch. If we assume 20% of instructions are branches, and 60% are taken, and an otherwise perfect CPI of 1: CPI contribution from taken branches = 0.2*2 * 0.6 = 0.24 Our new CPI = 1.0 + branch stalls = 1.24
  • 63. Why don’t we assume branch is taken ? MOV R3, #0 LOOP ADD R3, R3 ,#1 ADD R1, R1, #2 SUB R2, R2, #4 CMP R3, #10 BNE LOOP CMP R3, #10 BNE LOOP LOOP SUB R1, R1, R0 FETCH FETCH EXE DEC EXE MEM MEM WB WB FETCH STALL STALL DEC EXE MEM DEC We don’t know where to jump to until offset address is calculated. - From the left code, it seems that branch is taken more than not taken (9:1) - Why don’t we assume branch is taken instead ?
  • 64. Evaluate the branch earlier in the pipeline ▪ The same idea as “data forwarding”, the branch is evaluated early in DEC stage and Flag is bypassed FE DE EX ME WB Flag Calculate Branch Address
  • 65. Evaluate the branch earlier in the pipeline FE DE EX ME WB Flag Calculate CMP R3, #10 BNE LOOP LOOP ADD R3, R3, R1 FETCH FETCH EXE DEC EXE MEM MEM WB WB FETCH STALL DEC EXE MEM DEC Branch Address
  • 66. Evaluate the branch earlier in the pipeline - This technique causes the longer delay in the circuit. It affects the clock speed of the whole system and consider to be not worth. FE DE EX Flag Calculate Branch Address
  • 67. Dynamic Branch Prediction ▪ Predict branch direction before branch is executed. ▪ Prediction based on branch history ▪ Start fetching instruction at predicted direction
  • 68. Branch Prediction Buffer (BPB) ▪ A simple dynamic branch prediction algorithm is to predict a branch taken only if it was the latest time we executed it. We can use a simple table of 1- bit entries to store our predictions 0x00000064 : BNE LOOP 0x64 = 0b01100100 0001 1 * * BPB 0x0000004C * * Jump Address 0 = Not Taken 1 = Taken
  • 69. PC + +4 Program Memory Regfile (R0-R14) CTRL Box Rn Rm Vn Vm Rs Vs Shift ALU + +offset Flag Vd Vidx Addr Read Inst Stage1 Fetch Stage2 Decode Stage3 Execute Stage4 Memory #imm How Would We Like to Handle Branches? BPB Jump Add Decision
  • 70. 1-Bit Branch Prediction (Not Taken) 0x00000040 : CMP R0, #0 0x00000044 : BEQ ZERO 0x00000048 : ONE ADD R1, R1, #1 0x0000004C : ZERO ADD R2, R2, #1 0x44 = 0b01000100 0001 0 = Not Taken 1 = Taken 0 * * BPB 0x0000004C * * Jump Address
  • 71. 1-Bit Branch Prediction FETCH FETCH EXE DEC EXE MEM MEM WB WB 1. CMP R0, #0 2. BEQ ZERO 3. ONE ADD R1, R1, #1 FETCH Clock Cycle 1 2 3 4 5 6 7 8 Fetch the next Instruction before the prediction DEC EXE MEM DEC 1. CMP R0, #0 2. BEQ ZERO 3. ZERO ADD R2, R2, #1 If Prediction is Wrong FETCH FETCH EXE DEC EXE MEM MEM WB WB FETCH DEC EXE MEM DEC PC+4 B_addr FETCH DEC PC+4
  • 72. 1-Bit Branch Prediction (Taken) 0x00000040 : CMP R0, #0 0x00000044 : BEQ ZERO 0x00000048 : ONE ADD R1, R1, #1 0x0000004C : ZERO ADD R2, R2, #1 0x44 = 0b01000100 0001 0 = Not Taken 1 = Taken 0 * * BPB 0x0000004C * * Jump Address
  • 73. 1-Bit Branch Prediction FETCH FETCH EXE DEC EXE MEM MEM WB WB 1. CMP R0, #0 2. BEQ ZERO 3. ZERO ADD R2, R2, #1 Clock Cycle 1 2 3 4 5 6 7 8 Fetch the next Instruction after the prediction DEC 1. CMP R0, #0 2. BEQ ZERO 3. ONE ADD R1, R1, #1 If Prediction is Wrong FETCH FETCH EXE DEC EXE MEM MEM WB WB FETCH DEC EXE MEM DEC FETCH 0x4C PC+4 EXE MEM WB DEC FETCH 0x4C DEC
  • 74. Exercise 5.4 ▪ Draw Pipeline stage diagram of these instructions with pipeline stall using Branch Not-Taken method to avoid branch hazard. MOV R0, #0x41000000 MOV R1, #0x0 MOV R2, #0x82 B1 ADD R1, R1, #1 ADD R3, R3, R0 SUB R4, R4, R2 CMP R1, #0x2 BNE B1 NOP
  • 76. Exercise 5.5 ▪ Draw Pipeline stage diagram of these instructions with pipeline stall using Branch Prediction to avoid branch hazard. MOV R0, #0x41000000 MOV R1, #0x0 MOV R2, #0x82 B1 ADD R1, R1, #1 ADD R3, R3, R0 SUB R4, R4, R2 CMP R1, #0x2 BNE B1 NOP
  • 77. Exercise 5.6 ▪ Calculate the stall cycles caused by B1 branch in Exercise 5.4 MOV R0, #0x41000000 MOV R1, #0x0 MOV R2, #0x82 B1 ADD R1, R1, #1 ADD R3, R3, R0 SUB R4, R4, R2 CMP R1, #0x10 BNE B1 NOP
  • 78. Exercise 5.7 ▪ Calculate prediction success rate (Hit rate) of B1 branch in these program, if the initial prediction is Not-Taken and calculate the stall cycles caused by B1 branch in Exercise 5.5 MOV R0, #0x41000000 MOV R1, #0x0 MOV R2, #0x82 B1 ADD R1, R1, #1 ADD R3, R3, R0 SUB R4, R4, R2 CMP R1, #0x10 BNE B1 NOP