Maximizing CPU Efficiency: A Comprehensive Exploration of Pipelining in Computer Architecture.pdf

Asst. Prof. Dr.Sumek Wisayataksin
Computer Architectures
Chapter 5 : Pipelining Processor

Increasing Process Speed
▪ Cooking Process
Process 1 :
Material Preparing
Meat/Veggie Chop
5 mins
Process 3 :
Garnishing Food
5 mins
Process 4 :
Serving
3 mins
Process 2 :
Cooking
10 mins
It takes 23 mins to finish 1 dish !!!

Parallel Processing
3 times speed up
Average
23/3 = 7.67 min/dish
Chef 1
Chef 2
Chef 3

Parallel Processing Limitation
▪ The processors (Chefs) must be an expert of all process from 1-4.
▪ You need to have many expert chefs to speed up the process.
▪ Every processor (Chef) must have the same skills

Pipelining Process
Time
Dish2
Dish1
Dish3
Dish4
10 mins
10 mins
10 mins
10 mins
10 mins
10 mins
4 Processes
Average
10 mins/dish

Balancing Pipeline stages
▪ Since the bottleneck is Process 2 (10 mins), we can break it into 2
subprocess
Time
Dish2
Dish1
Dish3
Dish4
Dish5
5 mins
5 Processes
Average
5 mins/dish

What is pipelining
▪ Pipelining is a set of data processing elements connected in series, where
the output of one element is the input of the next one.
▪ The elements of a pipeline are often executed in parallel or in time-sliced
fashion.
▪ Some amount of buffer storage is often inserted between elements.

Pipelining and Maintaining Correctness
• We can break the execution of instructions into stages and overlap the
execution of different instructions.
• We need to ensure that the results produced by our new pipelined processor
are no different to the unpipelined one.
• What sort of situations could cause problems once we pipeline our
processor?

Basic Processor Architecture
Program
Counter
(PC)
R
E
G
0
R
E
G
N
……
C
T
R
L
Data bus
Address bus
ALU
Data Registers
I
R
Z
C
N
V
FLAG
Control Bus

5-stage Pipelined Processor
PC
+
+4
Program
Memory
Regfile
(R0-R14)
CTRL
Box Rn
Rm
Vn
Vm
Rs Vs
Shift ALU
Data
Memory
+
+offset
Flag
Vd
+
Vidx
Addr Read
Inst
Addr
Read
Data
Write
Data
Stage 1
Fetch
Stage 2
Decode
Stage 3
Execute
Stage 4
Memory
Stage 5
Writeback
#imm
Signals from CTRL Box

Stage 1 : Fetch (Read Instructions)
• The next instruction is fetched from the memory address
that is currently stored in the program counter (PC) and
stored into the instruction register.
• At the end of the fetch operation, the PC points to the next
instruction that will be read at the next cycle.
• The next instruction might be from branch address.

Stage 1 : Fetch
1.The CPU sends the contents of the PC to address bus
and sends a read command on the control bus
2.In response to the read command (with address equal to
PC), the program memory returns the data stored at the
memory location indicated by the PC on the data bus
3.The CPU copies the data from the data bus into its
Instruction Register (IR)
4.The PC is incremented, so that it points to the next
instruction. This step prepares the CPU for the next cycle.

Stage 2 : Decode (Inst Interpretation)
▪ During this stage, the encoded
instruction presented in the instruction
register is interpreted by the decoder.
▪ Register values and immediate
constants are also read in this
stage

Stage 3 : Execute (Calculation)
▪ The CPU sends the decoded instruction
as a set of control signals to the
corresponding computer components.
▪ If the instruction involves arithmetic or
logic, the ALU is utilized and data to be
computed.
▪ This is the only stage of the instruction
cycle that is compute data. Flags are
updated in this stage.
▪ The branch address (for jump) is also
processed here.

Stage 4 : Memory (Access RAM)
▪ This stage access the data memory
(LDR/STR)
▪ This process must be after Execute stage
because address can be pre-indexed.
▪ The address from ALU calculation will be
sent to the address bus.
▪ At the end of stage, the data memory
return/store 32-bit data to/from the data
bus

Stage 5 : Writeback (Write Output)
▪ This stage, it writes the data
register (R0-R12).
▪ Results from Execute and
Memory stage are stored in the
register selected by the
instruction (Destination register)
Regfile
(R0-R14)
Rn
Rm
Vn
Vm
Rs Vs
Vd
Vidx
Decode
Execute
Memory
Writeback

Data Processing Execution Process
0x00 : ADD R2, R0, R1 // R2 = R0+R1
0x04 : ORR R3, R0, #3 // R3 = R0 | 0x03
0x08 : SUB R4, R1, R7 LSR #4 // R4 = R1 – (R7>>4)
0x0C : EOR R5, R0, R6 LSL R7 // R5 = R0 ^ (R6<<R7)
PC Instruction

ADD R2, R0, R1
+
+4
Program
Memory
Regfile
(R0-R14)
CTRL
Box Rn
Rm
Vn
Vm
Rs Vs
Shift ALU
Data
Memory
+
+offset
Flag
Vd
+
Vidx
Addr
Read
Inst
Addr
Read
Data
Write
Data
#imm
Instruction
Register
Stage 1
Fetch
0x00
0x00
ADD R2, R0, R1
0x04
0x00
Stage 3
Execute
R0
R1
ADD
R0+R1
Stage 4
Memory
R0+R1
Stage 2
Decode
Rn=0 Rm=1
ADD R2, R0, R1
Vn=R0
Vm=R1
Stage 5
Writeback
R0+R1
R2=R0+R1
PC

ORR R3, R0, #3
+
+4
Program
Memory
Regfile
(R0-R14)
CTRL
Box Rn
Rm
Vn
Vm
Rs Vs
Shift ALU
Data
Memory
+
+offset
Flag
Vd
+
Vidx
Addr
Read
Inst
Addr
Read
Data
Write
Data
#imm
Instruction
Register
Stage 1
Fetch
0x04
0x04
ORR R3, R0, #3
0x08
0x04
Stage 3
Execute
R0
imm=3
ORR
R0|3
Stage 4
Memory
R0|3
Stage 2
Decode
Rn=0 imm=3
ORR R3, R0, #3
Vn=R0
Stage 5
Writeback
R0|3
R3=R0|3
PC

SUB R4, R1, R7 LSR #4
+
+4
Program
Memory
Regfile
(R0-R14)
CTRL
Box Rn
Rm
Vn
Vm
Rs Vs
Shift ALU
Data
Memory
+
+offset
Flag
Vd
+
Vidx
Addr
Read
Inst
Addr
Read
Data
Write
Data
#imm
Instruction
Register
Stage 1
Fetch
0x08
0x08
0x0C
0x08
Stage 4
Memory
R1-(R7>>4)
Stage 5
Writeback
R1-(R7>>4)
R4=R1-(R7>>4)
Stage 2
Decode
Rn=1
Rm=7
Vn=R1
imm=4
Vm=R7
Stage 3
Execute
R1
imm=4 SUB
R1-(R7>>4)
R7
R7>>4
PC

EOR R5, R0, R6 LSL R7
+
+4
Program
Memory
CTRL
Box Rn
Rm
Vn
Vm
Rs Vs
Shift ALU
Data
Memory
+
+offset
Flag
Vd
+
Vidx
Addr
Read
Inst
Addr
Read
Data
Write
Data
#imm
Instruction
Register
Stage 1
Fetch
0x0C
0x0C
0x10
0x0C
Stage 4
Memory
R0^(R6<<R7)
Stage 5
Writeback
R0^(R6<<R7)
R5=R0^(R6<<R7)
Stage 3
Execute
R0
R7 EOR
R0^(R6<<R7)
R6
R6<<R7
Stage 2
Decode
Rn=0
Rm=6
Vn=R0
Rs=7
Vm=R6
Vs=R7
PC

Memory Access Execution Process
0x10 : LDR R2, [R0,#4] // R2 = mem[R0+4]
0x14 : LDR R4, [R0], #4 // R4 = mem[R0]
R0 = R0+4
0x18 : STR R3, [R0,R2]! // mem[R0+R2] = R3
R0 = R0+R2
0x1C : STR R5, [R1], R9 // mem[R1] = R5
R1 = R1+R5
PC Instruction

LDR R2, [R0,#4]
+
+4
Program
Memory
Regfile
(R0-R14)
CTRL
Box Rn
Rm
Vn
Vm
Rs Vs
Shift ALU
Data
Memory
+
+offset
Flag
Vd
+
Vidx
Addr
Read
Inst
Addr
Read
Data
Write
Data
#imm
Instruction
Register
Stage 1
Fetch
0x10
0x10
LDR R2, [R0,#4]
0x14
0x10
Stage 3
Execute
R0
imm=4
ADD
R0+4
Stage 2
Decode
Rn=0 imm=4
LDR R2, [R0,#4]
Vn=R0
Stage 5
Writeback
Mem[R0+4]
R2=Mem[R0+4]
Stage 4
Memory
Mem[R0+4]
R0+4
PC

LDR R4, [R0], #4
+
+4
Program
Memory
Regfile
(R0-R14)
CTRL
Box Rn
Rm
Vn
Vm
Rs Vs
Shift ALU
Data
Memory
+
+offset
Flag
Vd
+
Vidx
Addr
Read
Inst
Addr
Read
Data
Write
Data
#imm
Instruction
Register
Stage 1
Fetch
0x14
0x14
LDR R4, [R0],#4
0x18
0x14
Stage 3
Execute
R0
imm=4
BYPASS
R0
Stage 2
Decode
Rn=0 imm=4
LDR R4, [R0],#4
Vn=R0
Stage 4
Memory
Mem[R0]
R0
R0+4
Stage 5
Writeback
Mem[R0]
R4=Mem[R0] R0+4
R0=R0+4
PC

STR R3, [R0,R2]!
+
+4
Program
Memory
CTRL
Box Rn
Rm
Vn
Vm
Rs Vs
Shift ALU
Data
Memory
+
+offset
Flag
Vd
+
Vidx
Addr
Read
Inst
Addr
Read
Data
Write
Data
#imm
Instruction
Register
Stage 1
Fetch
0x18
0x18
STR R3, [R0,R2]!
0x1C
0x18
Stage 5
Writeback
R0+R2
R0=R0+R2
Stage 2
Decode
Rn=0 Rm=2
STR R3, [R0,R2]!
Vn=R0
Rs=3
Vm=R2
Vs=R3
Stage 3
Execute
R0
R3
ADD
R0+R2
R2
Stage 4
Memory
R3
R0+R2
R0+R2
0
PC

STR R5, [R1], R9
+
+4
Program
Memory
CTRL
Box Rn
Rm
Vn
Vm
Rs Vs
Shift ALU
Data
Memory
+
+offset
Flag
Vd
+
Vidx
Addr
Read
Inst
Addr
Read
Data
Write
Data
#imm
Instruction
Register
Stage 1
Fetch
0x1C
0x1C
STR R5, [R1], R9
0x20
0x1C
Stage 5
Writeback
R1+R9
R1=R1+R9
Stage 2
Decode
Rn=1 Rm=9
STR R5, [R1], R9
Vn=R1
Rs=5
Vm=R9
Vs=R5
Stage 3
Execute
R1
R5
BYPASS
R1
R9
Stage 4
Memory
R5
R1
R1+R9
R9
PC

The pipeline flow
FE DE EX ME WB
CLK 1 INST 0x00
CLK 2 INST 0x04 INST 0x00
CLK 3 INST 0x08 INST 0x04 INST 0x00
CLK 4 INST 0x0C INST 0x08 INST 0x04 INST 0x00
CLK 5 INST 0x10 INST 0x0C INST 0x08 INST 0x04 INST 0x00

The pipeline in Action
Clock
Cycle 1 2 3 4 5 6 7 8 9
FETCH
FETCH
FETCH
FETCH
EXE
DEC
DEC
DEC
DEC EXE
EXE
EXE
MEM
MEM
MEM
MEM
WB
WB
WB
WB
FETCH DEC EXE MEM WB
On clock cycle 5, the pipeline contains all the
instructions listed in different pipeline stages
1. LDR R1, [R2]
2. ADD R2, R2, R3
3. STR R4, [R5], #4
4. SUB R0, R0, #1
5. LDR R5, [R3]

Branch Execution Process
0x20 : CMP R3, #0 // R3-0 and Update Flag
0x24 : BEQ JUMP1
0x28 : ADD R7, R7, #1
0x2c : . . . .
0x30 : . . . .
0x34 : . . . .
0x38 : JUMP1: SUB R7, R7, #1
PC Instruction

BEQ JUMP1 (Condition Meet)
+
+4
Program
Memory
CTRL
Box Rn
Rm
Vn
Vm
Rs Vs
Shift ALU
Data
Memory
+
+offset
Flag
Vd
+
Vidx
Addr
Read
Inst
Addr
Read
Data
Write
Data
#imm
Instruction
Register
Stage 1
Fetch
0x24
0x24
BEQ JUMP1
0x28
0x24
Stage 2
Decode
BEQ JUMP1
0x28
(Offset*4)+4
Stage 3
Execute
0x38
PC
Stage 4
Memory
MEET
0x38
0x38
MEET
0x38
0x28
0x38
SUB R7, R7, #1
0x3C

BEQ JUMP1 (Condition Not Meet)
+
+4
Program
Memory
CTRL
Box Rn
Rm
Vn
Vm
Rs Vs
Shift ALU
Data
Memory
+
+offset
Flag
Vd
+
Vidx
Addr
Read
Inst
Addr
Read
Data
Write
Data
#imm
Instruction
Register
Stage 1
Fetch
0x24
0x24
BEQ JUMP1
0x28
0x24
Stage 2
Decode
BEQ JUMP1
0x28
(Offset*4)+4
Stage 3
Execute
0x38
PC
Stage 4
Memory
NOT MEET
0x38
0x38
NOT MEET
0x28
0x28
0x28
ADD R7, R7, #1
0x2C

Data Hazard
Clock
Cycle 1 2 3 4 5 6 6
FETCH
FETCH EXE
DEC
DEC EXE MEM
MEM
WB
WB
1. LDR R1, [R2]
2. ADD R1, R1, R3
R1 is updated here
R1 = R1+R3 is
executed before LDR
instruction is written
The result of ADD R1, R1, R3 is wrong.
This is Called “Data Hazard”

Data Dependencies - True Data Dependencies
Clock
Cycle 1 2 3 4 5 6 7 8
FETCH
FETCH EXE
DEC
DEC EXE MEM
MEM
WB
WB
1. LDR R1, [R2]
2. ADD R1, R1, R3
R1 updated here
STALL STALL
Wait R1 Updated
FETCH
FETCH EXE
DEC
DEC EXE MEM
MEM
WB
WB
1. LDR R1, [R2]
3. ADD R1, R1, R3
R1 updated here
STALL
Wait R1 Updated
2. ADD R2, R2, R3 FETCH DEC EXE MEM WB
3. ADD R2, R2, R3 FETCH DEC
STALL EXE MEM
STALL

Data Dependencies
True Data Dependencies
True data dependence (green arrow). Also
known as a Read-After-Write (RAW)
dependence.
If we imagine two instructions, i followed by j
If j uses the result of i, we say that j is
data dependent on i. This is an example of
a true data dependence (a read after a
write).
This type of data dependency causes the
pipeline hazards
SUB R1, R1, #2
ADD R1, R1, R3
ADD R2, R2, R3
Write
Read

Data dependencies
Name dependencies may also
exist when two instructions refer to
the same register. Unlike true data
dependencies, it doesn’t affect the
pipeline stall.
▪ Output dependencies (red arrow)
▪ Anti-dependence (green arrow)
SUB R1, R1, #2
ADD R1, R1, R3
LDR R3, [R2]
SUB R2, R3, #1
Write
Write

Data dependencies
▪ Output dependence (red
arrow). Also known as a Write-
After-Write (WAW) dependence
▪ We need to ensure we don’t
reorder writes to the same
register. This would mean
subsequent instructions could
receive the wrong data value.
▪ This type doesn’t cause data
hazard, if the execution order
doesn’t change
SUB R1, R1, #2
ADD R1, R1, R3
Write
Write

Data dependencies
▪ Anti-dependence (green arrows).
Also known as a Write-After-Read
(WAR) dependence
▪ we need to be careful not to
overwrite a register whose current
value is still required by an earlier
instruction.
▪ This type doesn’t cause data
hazard, if the execution order
doesn’t change
ADD R1, R1, R3
LDR R3, [R2]
SUB R2, R3, #1
Write
Read
Write
Read

Data Hazards
▪ Data Hazards occur when an instruction depends on the result of previous
instruction and that result of instruction has not yet been computed. whenever
two different instructions use the same storage. the location must appear as if
it is executed in sequential order.
FETCH
FETCH EXE
DEC
DEC EXE MEM
MEM
WB
WB
1. LDR R1, [R2]
2. ADD R1, R1, R3
R1 updated here
STALL STALL
Wait R1 Updated
STALL EXE MEM
STALL

An ideal Pipeline
In the ideal case, when our pipeline never stalls, our CPI will equal 1 (and IPC = 1).
If we need to stall the pipeline, our CPI will increase, e.g.:
If we must stall for 1 cycle for 20% of instructions, and 3 cycles for 5% of instructions, our new
CPI would be:
Pipeline CPI = ideal pipeline CPI + pipeline stalls per instruction
= 1 + 1*0.20 + 3*0.05 = 1.35
Note: to make the best use of pipelining, we should avoid stalling as much as possible. (without
increasing our clock period!) Remember:
Time = instructions executed × Clocks Per Instruction (CPI) × clock period

Hazard Elimination Methods
1. Pipeline bubbling/ Stall
Instructions are fetched, control logic determines whether a hazard
could/will occur. If this is true, then the control logic inserts “STALL”
into the pipeline. Thus, before the next instruction (which would
cause the hazard) executes, the prior one will have had sufficient
time to finish and prevent the hazard.
FETCH
FETCH EXE
DEC
DEC EXE MEM
MEM
WB
WB
1. LDR R1, [R2]
2. ADD R1, R1, R3
R1 updated here
STALL STALL
Wait R1 Updated
STALL EXE MEM
STALL

2. Out-of-Order Execution
We can avoid data hazard by reorder instructions. However, the reorder
program must not break the data dependencies
▪ True data dependencies – Read after Write (RAW)
▪ Output Dependencies – Write after Write (WAW)
▪ Anti-Dependencies – Write After Read (WAR)
Note – Read after Read (RAR) reordering does not affect data
dependencies.

3. Data forwarding
a technique to resolve the data hazards by passing a value directly
from one stage to an earlier stage in order to avoid stalling.
FE DE EX ME WB
Data forwarding from ME to DE

No Data forwarding (Memory)
FETCH
FETCH EXE
DEC
DEC EXE MEM
MEM
WB
WB
1. LDR R1, [R2]
2. ADD R1, R1, R3
R1 updated here
STALL STALL
Wait R1 Updated
FE DE EX ME WB
Clock
Cycle 1 2 3 4 5 6 7 8

No Data forwarding (Memory)
FETCH
FETCH EXE
DEC
DEC EXE MEM
MEM
WB
WB
1. LDR R1, [R2]
2. ADD R1, R1, R3 STALL
Wait MEM Read
Clock
Cycle 1 2 3 4 5 6 7 8
FE DE EX ME WB

No Data forwarding (Execute)
FETCH
FETCH EXE
DEC
DEC EXE MEM
MEM
WB
WB
1. MVN R1, R1
2. ADD R1, R1, #1
R1 updated here
STALL STALL
Wait R1 Updated
FE DE EX ME WB
Clock
Cycle 1 2 3 4 5 6 7 8

Data forwarding (Execute)
FETCH
FETCH EXE
DEC
DEC EXE MEM
MEM
WB
WB
1. MVN R1, R1
2. ADD R1, R1, #1
FE DE EX ME WB
Clock
Cycle 1 2 3 4 5 6 7 8

Data forwarding Disadvantage
▪ Data forwarding reduces the pipeline stall. However, the forwarding path increase the logic
delay significantly.
▪ Bad data forwarding will reduce the overall clock speed due to the longer path.
▪ Processor designer must compromise between data forwarding and Pipeline hazard to
optimize the processor performance.
DE EX
DE EX
Delay
Delay Delay

Exercise 5.1
▪ Draw Pipeline stage diagram of these instructions with pipeline
stall to avoid data hazard.
MOV R0, #0x30
MOV R1, #0x54
MOV R2, #0xAA
ADD R3, R1, R2
SUB R4, R0, #1

Exercise 5.1
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
MOV R0, #0x30
MOV R1, #0x54
MOV R2, #0xAA
ADD R3, R1, R2
SUB R4, R0, #1

Exercise 5.2
stall to avoid data hazard.
MOV R0, #0x41000000
MOV R1, #0x12
MOV R2, #0x45
STRB R1, [R0], #1
STRB R2, [R0], #1
ADD R3, R3, R1
SUB R4, R4, R2

Exercise 5.2
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
MOV R0, #0x41000000
MOV R1, #0x12
MOV R2, #0x45
STRB R1, [R0], #1
STRB R2, [R0], #1
ADD R3, R3, R1
SUB R4, R4, R2

Exercise 5.3
stall after reordering instructions to avoid data hazard.
MOV R0, #0x41000000
MOV R1, #0x12
MOV R2, #0x45
STRB R1, [R0], #1
STRB R2, [R0], #1
ADD R3, R3, R1
SUB R4, R4, R2

Maximizing CPU Efficiency: A Comprehensive Exploration of Pipelining in Computer Architecture.pdf

Exercise 5.3
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17

Control Hazard (Branching Hazard)
▪ Control hazard occurs whenever the pipeline makes decision to
jump or not.
CMP R0, #0
BEQ ZERO
ONE: ADD R1, R1, #1
ZERO: ADD R2, R2, #1
(R0==0)
Jump to
ZERO
Jump to
ONE
Yes No

Control Hazard (Branching Hazard)
FETCH
FETCH EXE
DEC EXE MEM
MEM
WB
WB
1. CMP R0, #0
2. BEQ ZERO
3. ZERO ADD R2, R2, #1 FETCH
Clock
Cycle 1 2 3 4 5 6 7 8
CASE R==0
1. CMP R0, #0
2. BEQ ZERO
3. ONE ADD R1, R1, #1
CASE R!=0
STALL
STALL
DEC
EXE MEM
DEC
FETCH
FETCH EXE
DEC EXE MEM
MEM
WB
WB
FETCH
STALL
STALL
DEC
EXE MEM
DEC
condition as well as Jump address gets
sent in MEM, therefore, we have to wait till
it happens.

Control Hazard
What happens immediately after we fetch a conditional branch
instruction?
1. We must determine if the branch is taken or not.
2. If the branch is taken, we must compute and branch to the target
address.
(R0==0)
Jump to
ZERO
Jump to
ONE
Yes No

How Would We Like to Handle Branches?
- In a pipelined processor, we would like to calculate the next
value of the program counter (PC) in parallel with reading our
instruction memory.
- In practice, this is non-trivial to support as branches may not be
decoded and evaluated until later in the pipeline.
FE DE EX ME WB

PC
+
+4
Program
Memory
Regfile
(R0-R14)
CTRL
Box Rn
Rm
Vn
Vm
Rs Vs
Shift ALU
+
+offset
Flag
Vd
Vidx
Addr
Read
Inst
Stage 1
Fetch
Stage 2
Decode
Stage 3
Execute
Stage 4
Memory
#imm

How to speed up the branch instruction
• Option 1: Assume a branch is not taken.
• Option 2: Evaluate the branch earlier in the pipeline.
• Option 3: Branch prediction.

Assume the Branch Is Not Taken
FETCH
FETCH EXE
DEC EXE MEM
MEM
WB
WB
1. CMP R0, #0
2. BEQ ZERO
3. ONE ADD R1, R1, #1 FETCH
Clock
Cycle 1 2 3 4 5 6 7 8
Fetch the next
Instruction before
branch decision
1. CMP R0, #0
2. BEQ ZERO
3. ZERO ADD R2, R2, #1
If Branch is taken,
Flush the pipeline
DEC
EXE MEM
DEC
FETCH
FETCH EXE
DEC EXE MEM
MEM
WB
WB
FETCH
FLUSH
FLUSH
DEC
EXE MEM
DEC

Assume a Branch Is Not Taken
If we evaluate the branch in the execute stage, we would lose two
cycles every time we encountered a taken branch.
If we assume 20% of instructions are branches, and 60% are taken,
and an otherwise perfect CPI of 1:
CPI contribution from taken branches = 0.2*2 * 0.6 = 0.24
Our new CPI = 1.0 + branch stalls = 1.24

Why don’t we assume branch is taken ?
MOV R3, #0
LOOP ADD R3, R3 ,#1
ADD R1, R1, #2
SUB R2, R2, #4
CMP R3, #10
BNE LOOP
CMP R3, #10
BNE LOOP
LOOP SUB R1, R1, R0
FETCH
FETCH EXE
DEC EXE MEM
MEM
WB
WB
FETCH
STALL
STALL
DEC
EXE MEM
DEC
We don’t know
where to jump to
until offset address
is calculated.
- From the left code, it seems that
branch is taken more than not taken
(9:1)
- Why don’t we assume branch is
taken instead ?

Evaluate the branch earlier in the pipeline
▪ The same idea as “data forwarding”, the branch is evaluated early
in DEC stage and Flag is bypassed
FE DE EX ME WB
Flag Calculate
Branch Address

FE DE EX ME WB
Flag Calculate
CMP R3, #10
BNE LOOP
LOOP ADD R3, R3, R1
FETCH
FETCH EXE
DEC EXE MEM
MEM
WB
WB
FETCH
STALL
DEC
EXE MEM
DEC
Branch Address

- This technique causes the longer delay in the circuit. It affects the clock speed
of the whole system and consider to be not worth.
FE DE EX
Flag Calculate
Branch Address

Dynamic Branch Prediction
▪ Predict branch direction before branch is executed.
▪ Prediction based on branch history
▪ Start fetching instruction at predicted direction

Branch Prediction Buffer (BPB)
▪ A simple dynamic branch
prediction algorithm is to
predict a branch taken
only if it was the latest
time we executed it. We
can use a simple table of 1-
bit entries to store our
predictions
0x00000064 : BNE LOOP
0x64 = 0b01100100
0001 1
*
*
BPB
0x0000004C
*
*
Jump Address
0 = Not Taken
1 = Taken

PC
+
+4
Program
Memory
Regfile
(R0-R14)
CTRL
Box Rn
Rm
Vn
Vm
Rs Vs
Shift ALU
+
+offset
Flag
Vd
Vidx
Addr
Read
Inst
Stage1 Fetch Stage2 Decode Stage3 Execute Stage4 Memory
#imm
BPB Jump Add
Decision

1-Bit Branch Prediction (Not Taken)
0x00000040 : CMP R0, #0
0x00000044 : BEQ ZERO
0x00000048 : ONE ADD R1, R1, #1
0x0000004C : ZERO ADD R2, R2, #1
0x44 = 0b01000100
0001
0 = Not Taken
1 = Taken
0
*
*
BPB
0x0000004C
*
*
Jump Address

1-Bit Branch Prediction
FETCH
FETCH EXE
DEC EXE MEM
MEM
WB
WB
1. CMP R0, #0
2. BEQ ZERO
3. ONE ADD R1, R1, #1 FETCH
Clock
Cycle 1 2 3 4 5 6 7 8
Fetch the next
Instruction before
the prediction
DEC
EXE MEM
DEC
1. CMP R0, #0
2. BEQ ZERO
If Prediction is
Wrong
FETCH
FETCH EXE
DEC EXE MEM
MEM
WB
WB
FETCH
DEC
EXE MEM
DEC
PC+4
B_addr
FETCH DEC
PC+4

1-Bit Branch Prediction (Taken)
0x00000040 : CMP R0, #0
0x00000044 : BEQ ZERO
0x00000048 : ONE ADD R1, R1, #1
0x0000004C : ZERO ADD R2, R2, #1
0x44 = 0b01000100
0001
0 = Not Taken
1 = Taken
0
*
*
BPB
0x0000004C
*
*
Jump Address

1-Bit Branch Prediction
FETCH
FETCH EXE
DEC EXE MEM
MEM
WB
WB
1. CMP R0, #0
2. BEQ ZERO
Clock
Cycle 1 2 3 4 5 6 7 8
Fetch the next
Instruction after
the prediction
DEC
1. CMP R0, #0
2. BEQ ZERO
3. ONE ADD R1, R1, #1
If Prediction is
Wrong
FETCH
FETCH EXE
DEC EXE MEM
MEM
WB
WB
FETCH
DEC
EXE MEM
DEC
FETCH
0x4C
PC+4
EXE MEM WB
DEC
FETCH
0x4C
DEC

Exercise 5.4
stall using Branch Not-Taken method to avoid branch hazard.
MOV R0, #0x41000000
MOV R1, #0x0
MOV R2, #0x82
B1 ADD R1, R1, #1
ADD R3, R3, R0
SUB R4, R4, R2
CMP R1, #0x2
BNE B1
NOP

Exercise 5.5
stall using Branch Prediction to avoid branch hazard.
MOV R0, #0x41000000
MOV R1, #0x0
MOV R2, #0x82
B1 ADD R1, R1, #1
ADD R3, R3, R0
SUB R4, R4, R2
CMP R1, #0x2
BNE B1
NOP

Exercise 5.6
▪ Calculate the stall cycles caused by B1 branch in Exercise 5.4
MOV R0, #0x41000000
MOV R1, #0x0
MOV R2, #0x82
B1 ADD R1, R1, #1
ADD R3, R3, R0
SUB R4, R4, R2
CMP R1, #0x10
BNE B1
NOP

Exercise 5.7
▪ Calculate prediction success rate (Hit rate) of B1 branch in these
program, if the initial prediction is Not-Taken and calculate the stall
cycles caused by B1 branch in Exercise 5.5
MOV R0, #0x41000000
MOV R1, #0x0
MOV R2, #0x82
B1 ADD R1, R1, #1
ADD R3, R3, R0
SUB R4, R4, R2
CMP R1, #0x10
BNE B1
NOP

Maximizing CPU Efficiency: A Comprehensive Exploration of Pipelining in Computer Architecture.pdf

More Related Content

Similar to Maximizing CPU Efficiency: A Comprehensive Exploration of Pipelining in Computer Architecture.pdf (20)

Recently uploaded (20)

Maximizing CPU Efficiency: A Comprehensive Exploration of Pipelining in Computer Architecture.pdf