Lec6 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Instruction Fetch

ECE 4100/6100
Advanced Computer Architecture
Lecture 6 Instruction Fetch
Prof. Hsien-Hsin Sean Lee
School of Electrical and Computer Engineering
Georgia Institute of Technology

Instruction Supply Issues
• Fetch throughput defines max performance that can be achieved in later
stages
• Superscalar processors need to supply more than 1 instruction per cycle
• Instruction Supply limited by
– Misalignment of multiple instructions in a fetch group
– Change of Flow (interrupting instruction supply)
– Memory latency and bandwidth
Instruction
Fetch Unit
Execution
Core
Instruction buffer

Aligned Instruction Fetching (4 instructions)
RowDecoderRowDecoderRowDecoderRowDecoder
..01
..00 A0
A4
00
A1
A5
01
A2
A6
10
A3
A7
11
inst 1inst 1 inst 2inst 2 inst 3 inst 4inst 3 inst 4inst 1inst 1 inst 2inst 2 inst 3 inst 4inst 3 inst 4
PC=..xx000000
One 64B I-
cache line
A8
A12
A9
A13
A10
A14
A11
A15
..10
..11
Assume one fetch group = 16B
Cycle nCycle nCan pull out
one row at a
time

Misaligned Fetch
..01
..00 A0
A4
00
A1
A5
01
A2
A6
10
A3
A7
11
PC=..xx001000
One 64B I-
cache line
A8
A12
A9
A13
A10
A14
A11
A15
..10
..11
inst 1 inst 2inst 1 inst 2 inst 3 inst 4inst 3 inst 4inst 1 inst 2inst 1 inst 2 inst 3 inst 4inst 3 inst 4
Rotating networkRotating network
Cycle nCycle n
IBM RS/6000

Split Cache Line Access
..01
..00 A0
A4
00
A1
A5
01
A2
A6
10
A3
A7
11
PC=..xx111000
cache line A
A8
A12
A9
A13
A10
A14
A11
A15
..10
..11
B0 B1 B2 B3
cache line B
B4 B5 B6 B7
inst 1 inst 2inst 1 inst 2inst 1 inst 2inst 1 inst 2
Cycle nCycle n
Cycle n+1Cycle n+1
Be broken down to 2 physical accesses

Split Cache Line Access Miss
A0
A4
00
A1
A5
01
A2
A6
10
A3
A7
11
cache line A
A8
A12
A9
A13
A10
A14
A11
A15
C0 C1 C2 C3
cache line C
C4 C5 C6 C7
Cache lineCache line
BB missesmisses
Cycle nCycle n
Cycle n+Cycle n+XX
..01
..00
..10
..11
PC=..xx111000

High Bandwidth Instruction Fetching
BB1
BB2 BB3
BB4
BB5
BB6BB7
• Wider issue  More instruction
feed
• Major challenge: to fetch more
than one non-contiguousnon-contiguous basic
block per cycle
• Enabling technique?
– Predication
– Branch alignment based on profiling
– Other hardware solutions (branch
prediction is a given)

Predication Example
• Convert control dependency into data dependency
• Enlarge basic block size
– More room for scheduling
– No fetch disruption
if (a[i+1]>a[i])
a[i+1] = 0
else
a[i] = 0
if (a[i+1]>a[i])
a[i+1] = 0
else
a[i] = 0
Source code
lw r2, [r1+4]
lw r3, [r1]
blt r3, r2, L1
sw r0, [r1]
j L2
L1:
sw r0, [r1+4]
L2:
lw r2, [r1+4]
lw r3, [r1]
blt r3, r2, L1
sw r0, [r1]
j L2
L1:
sw r0, [r1+4]
L2:
Typical assembly
lw r2, [r1+4]
lw r3, [r1]
sgt pr4, r2, r3
(p4) sw r0, [r1+4]
(!p4) sw r0, [r1]
lw r2, [r1+4]
lw r3, [r1]
sgt pr4, r2, r3
(p4) sw r0, [r1+4]
(!p4) sw r0, [r1]
Assembly w/ predication

Collapse Buffer [ISCA 95]
• To fetch multiple (often non-contiguous)
instructions
• Use interleaved BTB to enable multiple
branch predictions
• Align instructions in the predicted sequential
order
• Use banked I-cache for multiple line access

Collapsing Buffer
Fetch PC Interleaved BTB
Cache
Bank 1
Cache
Bank 2
Interchange Switch
Collapsing Circuit

Collapsing Buffer Mechanism
Interleaved BTB
A E
Bank Routing
E A
E F G H
A B C D
E F G H A B C D
Interchange Switch
A B C D E F G H
Collapsing Circuit
A B C E G
Valid
Instruction
Bits
D F H

High Bandwidth Instruction Fetching
BB1
BB2 BB3
BB4
BB5
BB6BB7
• To fetch more, we need to cross
multiple basic blocks (and/or
multiple cache lines)
• Multiple branches predictions

Multiple Branch Predictor [YehMarrPatt ICS’93]
• Pattern History Table (PHT) design to support MBP
• Based on global history only
Branch History Register
(BHR)
Pattern History Table
(PHT)
bk
……
b1
Primary
prediction
Secondary
prediction
Tertiary
prediction
p1
p2p1p1 p2p2
updateupdate

Multiple Branch Predictin
• Fetch address could be
retrieved from BTB
• Predicted path: BB1 →
BB2 → BB5
• How to fetch BB2 and
BB5? BTB?
– Can’t. Branch PCs of
br1br1 and br2br2 not
available when MBP
made
– Use a BAC design
BB1
br1br1
BB2
br2br2
BB3
BB4 BB5 BB6 BB7
T (2T (2ndnd
))
FF
TT
TTF (3F (3rdrd
)) FF
Fetch address
(br0 Primary prediction)
BTB entry

Branch Address Cache
• Use a Branch Address Cache (BAC): Keep 6 possible fetch addresses for
2 more predictions
• br: 2 bits for branch type (cond, uncond, return)
• V: single valid bit (to indicate if hits a branch in the sequence)
• To make one more level prediction
– Need to cache another 8 more addresses (i.e. total=14 addresses)
– 464 bits per entry = (23+3)*1 + (30+3) * (2+4) + 30*8
Tag
23 bits
Taken
Target
Address
Not-Taken
Target
Address
T-T
Address
T-N
Address
N-T
Address
N-N
Address
30 bits 30 bits
V br V br V br
212 bits per fetch address entry
1 2
Fetch Addr (from BTB)Fetch Addr (from BTB)

Caching Non-Consecutive Basic Blocks
BB2BB2
• High Fetch Bandwidth + Low Latency
BB1BB1
BB3BB3
BB4BB4
BB5BB5
Fetch in Conventional Instruction Cache
BB2BB2BB1BB1 BB3BB3 BB4BB4 BB5BB5
Fetch in Linear Memory Location

Trace Cache
• Cache dynamic non-contiguous instructions (traces)
• Cross multiple basic blocks
• Need to predict multiple branches (MBP)
E F G
H I J K
A B
C
D
I$
A B
C
D
E F G
H I J
I$ Fetch
(5 cycles)
A B C
D E F G
H I J
Collapsing
Buffer
Fetch
(3 cycles)
A B C D E F G H I J
Trace Cache
A B C D E F G H I J
T$ Fetch (1 cycle)

Trace Cache [Rotenberg Bennett Smith MICRO‘96]
• Cache at most (in original paper)
– M branches OR (M = 3 in all follow-up TC studies due to MBP)
– N instructions (N = 16 in all follow-up TC studies)
• Fall-thru address if last branch is predicted not taken
Tag
Br
flag
Fetch AddrFetch Addr
Br
mask
Fall-thru
Address
Taken
Address
MBPMBP
BB2BB1 BB3
Line fill bufferLine fill buffer
For T.C. missFor T.C. miss
T.C. hits,
N instructions
MM branches
Branch 1 Branch 2 Branch 3
10
1st
Br taken
2nd
Br Not taken
11, 1
11: 3 branches.
1: the trace ends w/ a
branch

Trace Hit Logic
A 10 11,1 X Y
Tag BF Mask Fall-thru TargetFetch: A
=
Match 1st
Block
Multi-BPred
T N
Cond.
AND
Match
Remaining
Block(s) Trace hit
N
0 1
Next Fetch
Address

Trace Cache Example
A
B C
D
Exit
5 insts
12 insts
4 insts
6 insts
BB Traversal Path: ABDABDACDABDACDABDAC
A1 A2 A3 A4 A5 B1 B2 B3 B4 B5 B6 D1 D2 D3 D4
A1 A2 A3 A4 A5 C1 C2 C3 C4 C5 C6 C7 C8 C9 C10C11
C12 D1 D2 D3 D4 A1 A2 A3 A4 A5
B1 B2 B3 B4 B5 B6 D1 D2 D3 D4 A1 A2 A3 A4 A5
C1 C2 C3 C4 C5 C6 C7 C8 C9 C10C11C12 D1 D2 D3 D4
Trace Cache (5 lines)
Cond 1: 3 branches
Cond 2: Fill a trace cache line
Cond 3: Exit
16 instructions

C12 D1 D2 D3 D4 A1 A2 A3 A4 A5
Trace Cache Example
A
B C
D
Exit
5 insts
12 insts
4 insts
6 insts
C12 D1 D2 D3 D4 A1 A2 A3 A4 A5
Cond 1: 3 branches
Cond 2: Fill a trace cache line
Cond 3: Exit

Trace Cache Example
A
B C
D
Exit
5 insts
12 insts
4 insts
6 insts
C12 D1 D2 D3 D4 A1 A2 A3 A4 A5C12C12 D1 D2 D3 D4 A1 A2 A3 A4 A5
Trace Cache is Full

Trace Cache Example
A
B C
D
Exit
5 insts
12 insts
4 insts
6 insts
C12 D1 D2 D3 D4 A1 A2 A3 A4 A5C12
How many hits?
What is the utilization?

Redundancy
• Duplication
– Note that instructions only appear
once in I-Cache
– Same instruction appears many times
in TC
• Fragmentation
– If 3 BBs < 16 instructions
– If multiple-target branch (e.g. return,
indirect jump or trap) is encountered,
stop “trace construction”.
– Empty slots ⇒ wasted resources
• Example
– A single BB is broken up to (ABC),
(BCD), (CDA), (DAB)
– Duplicating each instruction 3 times
(ABC) =16 inst
(BCD) =13 inst
(CDA) =15 inst
(DAB) =13 inst
A B
CB D
Trace Cache
C
D A B
C D A
6
4
6
3
B
C
D
A

Indexability
A
C
D
B
E
• TC saved traces (EAC) and (BCD)
• Path: (EAC) to (D)
– Cannot index interior block (D)
• Can cause duplication
• Need partial matching
– (BCD) is cached, if (BC) is needed
E
CB D
Trace Cache
A C
G

Pentium 4 (NetBurst) Trace Cache
Front-end
BTB
iTLB and
Prefetcher
L2 Cache
Decoder
Trace $
Trace $
BTB
Rename,
execute,
etc.
No I$ !!
Decoded Instructions
Trace-based prediction
(predict next-trace, not
next-PC)

Lec6 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Instruction Fetch

More Related Content

What's hot (20)

Viewers also liked (20)

Similar to Lec6 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Instruction Fetch (20)

Recently uploaded (20)

Lec6 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Instruction Fetch