lec04-pipelining-intro&hazards.ppt

Lecture 4: Pipelining
Basics & Hazards
Kai Bu
kaibu@zju.edu.cn

Lab Opening Hours:
Mon – Thu 13:00 – 16:00
Thu 9:00 – 12:00 Sun 14:00 – 17:00
Assignment 1 Submission

Outline
• Part 1 Basics
what’s pipelining
pipelining principles
RISC and its five-stage pipeline
• Part 2 Challenges: Pipeline Hazards
structural hazard
data hazard
control hazard

What’s Pipelining
You already knew!
Try the laundry example:

Laundry Example
Ann, Brian, Cathy, Dave
Each has one load of clothes to
wash, dry, fold.
washer
30 mins
dryer
40 mins
folder
20 mins

Sequential Laundry
What would you do?
Task
Order
A
B
C
D
Time
30 40 20 30 40 20 30 40 20 30 40 20
6 Hours

Pipelined Laundry
Observations
• A task has a series
of stages;
• Stage dependency:
e.g., wash before
dry;
• Multi tasks with
overlapping stages;
• Simultaneously use
diff resources to
speed up;
• Slowest stage
determines the
finish time;
Task
Order
A
B
C
D
Time
30 40 40 40 40 20
3.5 Hours

Pipelined Laundry
Observations
• No speed up for
individual task;
e.g., A still takes
30+40+20=90
• But speed up for
average task
execution time;
e.g.,
3.5*60/4=52.5 <
30+40+20=90
Task
Order
A
B
C
D
Time
30 40 40 40 40 20
3.5 Hours

Pipelining
• An implementation technique
whereby multiple instructions are
overlapped in execution.
e.g., B wash while A dry
• Essence: Start executing one
instruction before completing the
previous one.
• Significance: Make fast CPUs.
A
B

Balanced Pipeline
• Equal-length pipe stages
e.g., Wash, dry, fold = 40 mins
per unpipelined laundry time = 40x3 mins
3 pipe stages – wash, dry, fold
A
T1
40min
T2
T3
T4
A
A
B
B
B
C
C
D

One task/instruction
per 40 mins
Time per instruction by pipeline =
Time per instr on unpipelined machine
Number of pipe stages
Speed up by pipeline =
Number of pipe stages
Balanced Pipeline
• Equal-length pipe stages
e.g., Wash, dry, fold = 40 mins
per unpipelined laundry time = 40x3 mins
3 pipe stages – wash, dry, fold
A
T1
40min
T2
T3
T4
A
A
B
B
B
C
C
D
• Performance

Pipelining Terminology
• Latency: the time for an instruction to
complete.
• Throughput of a CPU: the number of
instructions completed per second.
• Clock cycle: everything in CPU moves in
lockstep; synchronized by the clock.
• Processor Cycle: time required between
moving an instruction one step down the
pipeline;
= time required to complete a pipe stage;
= max(times for completing all stages);
= one or two clock cycles, but rarely more.
• CPI: clock cycles per instruction

RISC: Reduced Instruction Set Computer
Properties:
• All operations on data apply to data in
registers and typically change the entire
register (32 or 64 bits per reg);
• Only load and store operations affect
memory;
load: move data from mem to reg;
store: move data from reg to mem;
• Only a few instruction formats; all
instructions typically being one size.

32 registers
3 classes of instructions - 1
• ALU (Arithmetic Logic Unit) instructions
operate on two regs or a reg + a sign-
extended immediate;
store the result into a third reg;
e.g., add (DADD), subtract (DSUB)
logical operations AND, OR

• Load (LD) and store (SD) instructions
operands: base register + offset;
the sum (called effective address) is used as
a memory address;
Load: use a second reg operand as the
destination for the data loaded from memory;
Store: use a second reg operand as the
source of the data stored into memory.

• Branches and jumps
conditional transfers of control;
Branch:
specify the branch condition with a set of
condition bits or comparisons between two
regs or between a reg and zero;
decide the branch destination by adding a
sign-extended offset to the current PC
(program counter);

at most 5 clock cycles per instruction – 1
IF ID EX MEM WB
• Instruction Fetch cycle
send the PC to memory;
fetch the current instruction from mem;
PC = PC + 4; //each instr is 4 bytes

IF ID EX MEM WB
• Instruction Decode/register fetch cycle
decode the instruction;
read the registers (corresponding to
register source specifiers);

IF ID EX MEM WB
• Execution/effective address cycle
ALU operates on the operands from ID:
3 functions depending on the instr type - 1
-Memory reference: ALU adds base register
and offset to form effective address;

IF ID EX MEM WB
• Execution/effective address cycle
-Register-Register ALU instruction: ALU
performs the operation specified by opcode
on the values read from the register file;

IF ID EX MEM WB
• EXecution/effective address cycle
-Register-Immediate ALU instruction: ALU
operates on the first value read from the
register file and the sign-extended
immediate.

IF ID EX MEM WB
• MEMory access
for load instr: the memory does a read
using the effective address;
for store instr: the memory writes the
data from the second register using the
effective address.

IF ID EX MEM WB
• Write-Back cycle
for Register-Register ALU or load instr;
write the result into the register file,
whether it comes from the memory (for
load) or from the ALU (for ALU instr).

at most 5 clock cycles per instruction
IF ID EX MEM WB

RISC: Five-Stage Pipeline
Simply start a new instruction
on each clock cycle;
Speedup = 5.

• How it works
separate instruction and data mems
to eliminate conflicts for a single
memory between instruction fetch
and data memory access.
IF MEM
Instr mem Data mem

• How it works
use the register file in two stages;
either with half CC;
in one clock cycle, write before read
ID WB
read write

• How it works
introduce pipeline registers between
successive stages;
pipeline registers store the results of
a stage and use them as the input of
the next stage.

• How it works

• How it works - omit pipeline regs
for simplicity
but required in implementation

• Example
Consider an unpipelined instruction.
1 ns clock cycle;
4 cycles for ALU and branches;
5 cycles for memory operations;
relative frequencies 40%, 20%, 40%;
0.2 ns pipeline overhead (e.g., due to
stage imbalance, pipeline register setup,
clock skew)
Question: How much speedup by pipeline?

• Answer
speedup by pipelining
= Avg instr time unpipelined
Avg instr time pipelined
= ?

• Answer
Avg instr time unpipelined
= clock cycle x avg CPI
= 1 ns x [(0.4+0.2)x4 + 0.4x5]
= 4.4 ns
= 1+0.2
= 1.2 ns

• Answer
speedup by pipelining
= Avg instr time unpipelined
= 4.4 ns
1.2 ns
= 3.7 times

When Pipeline Is Stuck
LD R1, 0(R2)
DSUB R4, R1, R5
R1
R1

Pipeline Hazards
• Hazards: situations that prevent the
next instruction from executing in the
designated clock cycle.
• 3 classes of hazards:
structural hazard – resource conflicts
data hazard – data dependency
control hazard – pc changes
(e.g., branches)

Structural Hazard
• Root Cause: resource conflicts
e.g., a processor with 1 reg write port
but intend two writes in a CC
• Solution
stall one of the instructions
until required unit is available

Structural Hazard
• Example
1 mem port
mem conflict
data access
vs
instr fetch
Load
Instr i+3
Instr i+2
Instr i+1
MEM
IF

Structural Hazard
Stall Instr i+3
till CC 5

Structural Hazard
• Example
ideal CPI is 1;
40% data references;
structural hazard with 1.05 times
higher clock rate than ideal;
Question:
is pipeline w/wo hazard faster?
by how much?

Stall for
one clock cycle
Structural Hazard
• Answer
avg instr time w/o hazard
=CPI x clock cycle timeideal
=1 x clock cycle timeideal
avg instr time w/ hazard
=(1 + 0.4x1) x clock cycle timeideal
1.05
=1.3 x clock cycle timeideal
So, w/o hazard is 1.3 times faster.

Data Hazard
• Root Cause: data dependency
when the pipeline changes the order
of read/write accesses to operands;
so that the order differs from the
order seen by sequentially executing
instructions on an unpipelined
processor.

Data Hazard
DADD
DSUB
AND
OR
XOR
R1, R2, R3
R4, R1, R5
R6, R1, R7
R8, R1, R9
R10, R1, R11
R1
No hazard
1st half cycle: w
2nd half cycle: r

Data Hazard
• Solution: forwarding
directly feed back EX/MEM&MEM/WB
pipeline regs’ results to the ALU inputs;
if forwarding hw detects that previous
ALU has written the reg corresponding
to a source for the current ALU,
control logic selects the forwarded
result as the ALU input.

Data Hazard: Forwarding
DADD
DSUB
AND
OR
XOR
R1, R2, R3
R4, R1, R5
R6, R1, R7
R8, R1, R9
R10, R1, R11
R1

DADD
DSUB
AND
OR
XOR
R1, R2, R3
R4, R1, R5
R6, R1, R7
R8, R1, R9
R10, R1, R11
R1
EX/MEM

DADD
DSUB
AND
OR
XOR
R1, R2, R3
R4, R1, R5
R6, R1, R7
R8, R1, R9
R10, R1, R11
R1
MEM/WB

• Generalized forwarding
pass a result directly to the functional
unit that requires it;
forward results to not only ALU inputs
but also other types of functional units;

• Generalized forwarding
DADD R1, R2, R3
LD R4, 0(R1)
SD R4, 12(R1)
R1
R1
R1
R1
R4
R4

Data Hazard
• Sometimes stall is necessary
R1
R1
LD R1, 0(R2)
DSUB R4, R1, R5
MEM/WB
Forwarding cannot be backward.
Has to stall.

Control Hazard
• braches and jumps
• Branch hazard
a branch may or may mot change PC
to other values other than PC+4;
taken branch: changes PC to its
target address;
untaken branch: falls through;
PC is not changed till the end of ID;

Branch Hazard
• Redo IF
If the branch is untaken,
the stall is unnecessary.
essentially a stall

Branch Hazard: Solutions
4 simple compile time schemes – 1
• Freeze or flush the pipeline
hold or delete any instructions after the
branch till the branch dst is known;
i.e., Redo IF w/o the first IF

• Predicted-untaken
simply treat every branch as untaken;
when the branch is untaken,
pipelining as if no hazard.

• Predicted-untaken
but if the branch is taken:
turn fetched instr into a no-op (idle);
restart the IF at the branch target addr

• Predicted-taken
simply treat every branch as taken;
not apply to the five-stage pipeline;
apply to scenarios when branch target
addr is known before branch outcome.

• Delayed branch
delay the branch execution after the
next instruction;
pipelining sequence:
branch instruction
sequential successor
branch target if taken
Branch delay slot
the next instruction

• Delayed branch

Branch Hazard: Performance
• Example
a deeper pipeline (e.g., in MIPS R4000)
with the following branch penalties:
and the following branch frequencies:
Question: find the effective addition to
the CPI arising from branches.

Branch Hazard: Performance
• Answer
find the CPIs by
relative frequency x respective penalty.
0.04x2 0.10x3
0.08+0.30

Conclusion
• Pipelining promises fast CPU by
starting the execution of one
instruction before completing the
previous one.
• Classic five-stage pipeline for RISC
IF – ID – EX –MEM - WB
• Pipeline hazards limit ideal pipelining
structural/data/control hazard

Further Readings
• RISC wiki
http://en.wikipedia.org/wiki/Reduced_inst
ruction_set_computing
• MIPS wiki
http://en.wikipedia.org/wiki/MIPS_archite
cture
• RISC Processors
http://www.scs.carleton.ca/sivarama/org_
book/org_book_web/solution_manual/org
_soln_one/arch_book_solution_ch14.pdf
• …

lec04-pipelining-intro&hazards.ppt

More Related Content

Similar to lec04-pipelining-intro&hazards.ppt (20)

Recently uploaded (20)

lec04-pipelining-intro&hazards.ppt