SlideShare a Scribd company logo
ECE 4100/6100
Advanced Computer Architecture
Lecture 6 Instruction Fetch
Prof. Hsien-Hsin Sean Lee
School of Electrical and Computer Engineering
Georgia Institute of Technology
Instruction Supply Issues
• Fetch throughput defines max performance that can be achieved in later
stages
• Superscalar processors need to supply more than 1 instruction per cycle
• Instruction Supply limited by
– Misalignment of multiple instructions in a fetch group
– Change of Flow (interrupting instruction supply)
– Memory latency and bandwidth
Instruction
Fetch Unit
Execution
Core
Instruction buffer
Aligned Instruction Fetching (4 instructions)
RowDecoderRowDecoderRowDecoderRowDecoder
..01
..00 A0
A4
00
A1
A5
01
A2
A6
10
A3
A7
11
inst 1inst 1 inst 2inst 2 inst 3 inst 4inst 3 inst 4inst 1inst 1 inst 2inst 2 inst 3 inst 4inst 3 inst 4
PC=..xx000000
One 64B I-
cache line
A8
A12
A9
A13
A10
A14
A11
A15
..10
..11
Assume one fetch group = 16B
Cycle nCycle nCan pull out
one row at a
time
Misaligned Fetch
RowDecoderRowDecoderRowDecoderRowDecoder
..01
..00 A0
A4
00
A1
A5
01
A2
A6
10
A3
A7
11
PC=..xx001000
One 64B I-
cache line
A8
A12
A9
A13
A10
A14
A11
A15
..10
..11
inst 1 inst 2inst 1 inst 2 inst 3 inst 4inst 3 inst 4inst 1 inst 2inst 1 inst 2 inst 3 inst 4inst 3 inst 4
Rotating networkRotating network
Cycle nCycle n
IBM RS/6000
Split Cache Line Access
RowDecoderRowDecoderRowDecoderRowDecoder
..01
..00 A0
A4
00
A1
A5
01
A2
A6
10
A3
A7
11
PC=..xx111000
cache line A
A8
A12
A9
A13
A10
A14
A11
A15
..10
..11
B0 B1 B2 B3
cache line B
B4 B5 B6 B7
inst 1 inst 2inst 1 inst 2inst 1 inst 2inst 1 inst 2
inst 3 inst 4inst 3 inst 4inst 3 inst 4inst 3 inst 4
Cycle nCycle n
Cycle n+1Cycle n+1
Be broken down to 2 physical accesses
Split Cache Line Access Miss
RowDecoderRowDecoderRowDecoderRowDecoder
A0
A4
00
A1
A5
01
A2
A6
10
A3
A7
11
cache line A
A8
A12
A9
A13
A10
A14
A11
A15
C0 C1 C2 C3
cache line C
C4 C5 C6 C7
inst 1 inst 2inst 1 inst 2inst 1 inst 2inst 1 inst 2
inst 3 inst 4inst 3 inst 4inst 3 inst 4inst 3 inst 4
Cache lineCache line
BB missesmisses
Cycle nCycle n
Cycle n+Cycle n+XX
..01
..00
..10
..11
PC=..xx111000
High Bandwidth Instruction Fetching
BB1
BB2 BB3
BB4
BB5
BB6BB7
• Wider issue  More instruction
feed
• Major challenge: to fetch more
than one non-contiguousnon-contiguous basic
block per cycle
• Enabling technique?
– Predication
– Branch alignment based on profiling
– Other hardware solutions (branch
prediction is a given)
Predication Example
• Convert control dependency into data dependency
• Enlarge basic block size
– More room for scheduling
– No fetch disruption
if (a[i+1]>a[i])
a[i+1] = 0
else
a[i] = 0
if (a[i+1]>a[i])
a[i+1] = 0
else
a[i] = 0
Source code
lw r2, [r1+4]
lw r3, [r1]
blt r3, r2, L1
sw r0, [r1]
j L2
L1:
sw r0, [r1+4]
L2:
lw r2, [r1+4]
lw r3, [r1]
blt r3, r2, L1
sw r0, [r1]
j L2
L1:
sw r0, [r1+4]
L2:
Typical assembly
lw r2, [r1+4]
lw r3, [r1]
sgt pr4, r2, r3
(p4) sw r0, [r1+4]
(!p4) sw r0, [r1]
lw r2, [r1+4]
lw r3, [r1]
sgt pr4, r2, r3
(p4) sw r0, [r1+4]
(!p4) sw r0, [r1]
Assembly w/ predication
Collapse Buffer [ISCA 95]
• To fetch multiple (often non-contiguous)
instructions
• Use interleaved BTB to enable multiple
branch predictions
• Align instructions in the predicted sequential
order
• Use banked I-cache for multiple line access
Collapsing Buffer
Fetch PC Interleaved BTB
Cache
Bank 1
Cache
Bank 2
Interchange Switch
Collapsing Circuit
Collapsing Buffer Mechanism
Interleaved BTB
A E
Bank Routing
E A
E F G H
A B C D
E F G H A B C D
Interchange Switch
A B C D E F G H
Collapsing Circuit
A B C E G
Valid
Instruction
Bits
D F H
High Bandwidth Instruction Fetching
BB1
BB2 BB3
BB4
BB5
BB6BB7
• To fetch more, we need to cross
multiple basic blocks (and/or
multiple cache lines)
• Multiple branches predictions
Multiple Branch Predictor [YehMarrPatt ICS’93]
• Pattern History Table (PHT) design to support MBP
• Based on global history only
Branch History Register
(BHR)
Pattern History Table
(PHT)
bk
……
b1
Primary
prediction
Secondary
prediction
Tertiary
prediction
p1
p2p1p1 p2p2
updateupdate
Multiple Branch Predictin
• Fetch address could be
retrieved from BTB
• Predicted path: BB1 →
BB2 → BB5
• How to fetch BB2 and
BB5? BTB?
– Can’t. Branch PCs of
br1br1 and br2br2 not
available when MBP
made
– Use a BAC design
BB1
br1br1
BB2
br2br2
BB3
BB4 BB5 BB6 BB7
T (2T (2ndnd
))
FF
TT
TTF (3F (3rdrd
)) FF
Fetch address
(br0 Primary prediction)
BTB entry
Branch Address Cache
• Use a Branch Address Cache (BAC): Keep 6 possible fetch addresses for
2 more predictions
• br: 2 bits for branch type (cond, uncond, return)
• V: single valid bit (to indicate if hits a branch in the sequence)
• To make one more level prediction
– Need to cache another 8 more addresses (i.e. total=14 addresses)
– 464 bits per entry = (23+3)*1 + (30+3) * (2+4) + 30*8
Tag
23 bits
Taken
Target
Address
Not-Taken
Target
Address
T-T
Address
T-N
Address
N-T
Address
N-N
Address
30 bits 30 bits
V br V br V br
212 bits per fetch address entry
1 2
Fetch Addr (from BTB)Fetch Addr (from BTB)
Caching Non-Consecutive Basic Blocks
BB2BB2
• High Fetch Bandwidth + Low Latency
BB1BB1
BB3BB3
BB4BB4
BB5BB5
Fetch in Conventional Instruction Cache
BB2BB2BB1BB1 BB3BB3 BB4BB4 BB5BB5
Fetch in Linear Memory Location
Trace Cache
• Cache dynamic non-contiguous instructions (traces)
• Cross multiple basic blocks
• Need to predict multiple branches (MBP)
E F G
H I J K
A B
C
D
I$
A B
C
D
E F G
H I J
I$ Fetch
(5 cycles)
A B C
D E F G
H I J
Collapsing
Buffer
Fetch
(3 cycles)
A B C D E F G H I J
Trace Cache
A B C D E F G H I J
T$ Fetch (1 cycle)
Trace Cache [Rotenberg Bennett Smith MICRO‘96]
• Cache at most (in original paper)
– M branches OR (M = 3 in all follow-up TC studies due to MBP)
– N instructions (N = 16 in all follow-up TC studies)
• Fall-thru address if last branch is predicted not taken
Tag
Br
flag
Fetch AddrFetch Addr
Br
mask
Fall-thru
Address
Taken
Address
MBPMBP
BB2BB1 BB3
Line fill bufferLine fill buffer
For T.C. missFor T.C. miss
T.C. hits,
N instructions
MM branches
Branch 1 Branch 2 Branch 3
10
1st
Br taken
2nd
Br Not taken
11, 1
11: 3 branches.
1: the trace ends w/ a
branch
Trace Hit Logic
A 10 11,1 X Y
Tag BF Mask Fall-thru TargetFetch: A
=
Match 1st
Block
Multi-BPred
T N
Cond.
AND
Match
Remaining
Block(s) Trace hit
N
0 1
Next Fetch
Address
Trace Cache Example
A
B C
D
Exit
5 insts
12 insts
4 insts
6 insts
BB Traversal Path: ABDABDACDABDACDABDAC
A1 A2 A3 A4 A5 B1 B2 B3 B4 B5 B6 D1 D2 D3 D4
A1 A2 A3 A4 A5 C1 C2 C3 C4 C5 C6 C7 C8 C9 C10C11
C12 D1 D2 D3 D4 A1 A2 A3 A4 A5
B1 B2 B3 B4 B5 B6 D1 D2 D3 D4 A1 A2 A3 A4 A5
C1 C2 C3 C4 C5 C6 C7 C8 C9 C10C11C12 D1 D2 D3 D4
Trace Cache (5 lines)
Cond 1: 3 branches
Cond 2: Fill a trace cache line
Cond 3: Exit
16 instructions
A1 A2 A3 A4 A5 B1 B2 B3 B4 B5 B6 D1 D2 D3 D4
A1 A2 A3 A4 A5 C1 C2 C3 C4 C5 C6 C7 C8 C9 C10C11
C12 D1 D2 D3 D4 A1 A2 A3 A4 A5
B1 B2 B3 B4 B5 B6 D1 D2 D3 D4 A1 A2 A3 A4 A5
C1 C2 C3 C4 C5 C6 C7 C8 C9 C10C11C12 D1 D2 D3 D4
Trace Cache Example
A
B C
D
Exit
5 insts
12 insts
4 insts
6 insts
BB Traversal Path: ABDABDACDABDACDABDAC
A1 A2 A3 A4 A5 B1 B2 B3 B4 B5 B6 D1 D2 D3 D4
A1 A2 A3 A4 A5 C1 C2 C3 C4 C5 C6 C7 C8 C9 C10C11
C12 D1 D2 D3 D4 A1 A2 A3 A4 A5
B1 B2 B3 B4 B5 B6 D1 D2 D3 D4 A1 A2 A3 A4 A5
C1 C2 C3 C4 C5 C6 C7 C8 C9 C10C11C12 D1 D2 D3 D4
Trace Cache (5 lines)
Cond 1: 3 branches
Cond 2: Fill a trace cache line
Cond 3: Exit
Trace Cache Example
A
B C
D
Exit
5 insts
12 insts
4 insts
6 insts
BB Traversal Path: ABDABDACDABDACDABDAC
A1 A2 A3 A4 A5 B1 B2 B3 B4 B5 B6 D1 D2 D3 D4
A1 A2 A3 A4 A5 C1 C2 C3 C4 C5 C6 C7 C8 C9 C10C11
B1 B2 B3 B4 B5 B6 D1 D2 D3 D4 A1 A2 A3 A4 A5
C1 C2 C3 C4 C5 C6 C7 C8 C9 C10C11C12 D1 D2 D3 D4
Trace Cache (5 lines)
C12 D1 D2 D3 D4 A1 A2 A3 A4 A5C12C12 D1 D2 D3 D4 A1 A2 A3 A4 A5
Trace Cache is Full
Trace Cache Example
A
B C
D
Exit
5 insts
12 insts
4 insts
6 insts
BB Traversal Path: ABDABDACDABDACDABDAC
A1 A2 A3 A4 A5 B1 B2 B3 B4 B5 B6 D1 D2 D3 D4
A1 A2 A3 A4 A5 C1 C2 C3 C4 C5 C6 C7 C8 C9 C10C11
B1 B2 B3 B4 B5 B6 D1 D2 D3 D4 A1 A2 A3 A4 A5
C1 C2 C3 C4 C5 C6 C7 C8 C9 C10C11C12 D1 D2 D3 D4
C12 D1 D2 D3 D4 A1 A2 A3 A4 A5C12
How many hits?
What is the utilization?
Redundancy
• Duplication
– Note that instructions only appear
once in I-Cache
– Same instruction appears many times
in TC
• Fragmentation
– If 3 BBs < 16 instructions
– If multiple-target branch (e.g. return,
indirect jump or trap) is encountered,
stop “trace construction”.
– Empty slots ⇒ wasted resources
• Example
– A single BB is broken up to (ABC),
(BCD), (CDA), (DAB)
– Duplicating each instruction 3 times
(ABC) =16 inst
(BCD) =13 inst
(CDA) =15 inst
(DAB) =13 inst
A B
CB D
Trace Cache
C
D A B
C D A
6
4
6
3
B
C
D
A
Indexability
A
C
D
B
E
• TC saved traces (EAC) and (BCD)
• Path: (EAC) to (D)
– Cannot index interior block (D)
• Can cause duplication
• Need partial matching
– (BCD) is cached, if (BC) is needed
E
CB D
Trace Cache
A C
G
Pentium 4 (NetBurst) Trace Cache
Front-end
BTB
iTLB and
Prefetcher
L2 Cache
Decoder
Trace $
Trace $
BTB
Rename,
execute,
etc.
No I$ !!
Decoded Instructions
Trace-based prediction
(predict next-trace, not
next-PC)

More Related Content

What's hot (20)

PPT
Lec4 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- ISA
Hsien-Hsin Sean Lee, Ph.D.
 
PPT
Lec13 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Multicore
Hsien-Hsin Sean Lee, Ph.D.
 
PPT
Assembly Language Lecture 3
Motaz Saad
 
PPTX
8086 microprocessor instruction set by Er. Swapnil Kaware
Prof. Swapnil V. Kaware
 
PPTX
Microprocessor 8086 instructions
Ravi Anand
 
PPTX
Lec05
siddu kadiwal
 
PPT
Introduction to Assembly Language
Motaz Saad
 
PDF
Pragmatic Optimization in Modern Programming - Mastering Compiler Optimizations
Marina Kolpakova
 
PDF
Pragmatic Optimization in Modern Programming - Demystifying the Compiler
Marina Kolpakova
 
PDF
15CS44 MP & MC Module 1
RLJIT
 
PDF
Code GPU with CUDA - SIMT
Marina Kolpakova
 
PPT
8086-instruction-set-ppt
jemimajerome
 
PPTX
ARM instruction set
Karthik Vivek
 
PDF
Code GPU with CUDA - Memory Subsystem
Marina Kolpakova
 
PDF
Pragmatic Optimization in Modern Programming - Ordering Optimization Approaches
Marina Kolpakova
 
PDF
8086 labmanual
iravi9
 
PDF
Code GPU with CUDA - Device code optimization principle
Marina Kolpakova
 
PPT
Assembly Language Lecture 2
Motaz Saad
 
PPT
Instruction Set Architecture
Haris456
 
PPTX
Stacks & subroutines 1
deval patel
 
Lec4 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- ISA
Hsien-Hsin Sean Lee, Ph.D.
 
Lec13 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Multicore
Hsien-Hsin Sean Lee, Ph.D.
 
Assembly Language Lecture 3
Motaz Saad
 
8086 microprocessor instruction set by Er. Swapnil Kaware
Prof. Swapnil V. Kaware
 
Microprocessor 8086 instructions
Ravi Anand
 
Introduction to Assembly Language
Motaz Saad
 
Pragmatic Optimization in Modern Programming - Mastering Compiler Optimizations
Marina Kolpakova
 
Pragmatic Optimization in Modern Programming - Demystifying the Compiler
Marina Kolpakova
 
15CS44 MP & MC Module 1
RLJIT
 
Code GPU with CUDA - SIMT
Marina Kolpakova
 
8086-instruction-set-ppt
jemimajerome
 
ARM instruction set
Karthik Vivek
 
Code GPU with CUDA - Memory Subsystem
Marina Kolpakova
 
Pragmatic Optimization in Modern Programming - Ordering Optimization Approaches
Marina Kolpakova
 
8086 labmanual
iravi9
 
Code GPU with CUDA - Device code optimization principle
Marina Kolpakova
 
Assembly Language Lecture 2
Motaz Saad
 
Instruction Set Architecture
Haris456
 
Stacks & subroutines 1
deval patel
 

Viewers also liked (20)

PPT
Lec20 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- Da...
Hsien-Hsin Sean Lee, Ph.D.
 
PPT
Lec1 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- Intro
Hsien-Hsin Sean Lee, Ph.D.
 
PPT
Lec6 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- Can...
Hsien-Hsin Sean Lee, Ph.D.
 
PPT
Lec14 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- Se...
Hsien-Hsin Sean Lee, Ph.D.
 
PPT
Lec2 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- Num...
Hsien-Hsin Sean Lee, Ph.D.
 
PPT
Lec10 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- Mu...
Hsien-Hsin Sean Lee, Ph.D.
 
PPT
Lec3 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Performance
Hsien-Hsin Sean Lee, Ph.D.
 
PPT
Lec8 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- Qui...
Hsien-Hsin Sean Lee, Ph.D.
 
PPT
Lec9 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- Com...
Hsien-Hsin Sean Lee, Ph.D.
 
PPT
Lec3 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- CMO...
Hsien-Hsin Sean Lee, Ph.D.
 
PPT
Lec1 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Pipelining
Hsien-Hsin Sean Lee, Ph.D.
 
PPT
Lec13 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- Sh...
Hsien-Hsin Sean Lee, Ph.D.
 
PPT
Lec16 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- Fi...
Hsien-Hsin Sean Lee, Ph.D.
 
PPT
Lec12 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- Ad...
Hsien-Hsin Sean Lee, Ph.D.
 
PPT
Lec2 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- ILP
Hsien-Hsin Sean Lee, Ph.D.
 
PPT
Lec4 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- CMOS
Hsien-Hsin Sean Lee, Ph.D.
 
PPT
Lec15 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- Re...
Hsien-Hsin Sean Lee, Ph.D.
 
PPT
Lec7 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- Kar...
Hsien-Hsin Sean Lee, Ph.D.
 
PPT
Lec11 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- De...
Hsien-Hsin Sean Lee, Ph.D.
 
PPT
Lec5 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- Boo...
Hsien-Hsin Sean Lee, Ph.D.
 
Lec20 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- Da...
Hsien-Hsin Sean Lee, Ph.D.
 
Lec1 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- Intro
Hsien-Hsin Sean Lee, Ph.D.
 
Lec6 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- Can...
Hsien-Hsin Sean Lee, Ph.D.
 
Lec14 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- Se...
Hsien-Hsin Sean Lee, Ph.D.
 
Lec2 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- Num...
Hsien-Hsin Sean Lee, Ph.D.
 
Lec10 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- Mu...
Hsien-Hsin Sean Lee, Ph.D.
 
Lec3 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Performance
Hsien-Hsin Sean Lee, Ph.D.
 
Lec8 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- Qui...
Hsien-Hsin Sean Lee, Ph.D.
 
Lec9 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- Com...
Hsien-Hsin Sean Lee, Ph.D.
 
Lec3 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- CMO...
Hsien-Hsin Sean Lee, Ph.D.
 
Lec1 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Pipelining
Hsien-Hsin Sean Lee, Ph.D.
 
Lec13 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- Sh...
Hsien-Hsin Sean Lee, Ph.D.
 
Lec16 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- Fi...
Hsien-Hsin Sean Lee, Ph.D.
 
Lec12 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- Ad...
Hsien-Hsin Sean Lee, Ph.D.
 
Lec2 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- ILP
Hsien-Hsin Sean Lee, Ph.D.
 
Lec4 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- CMOS
Hsien-Hsin Sean Lee, Ph.D.
 
Lec15 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- Re...
Hsien-Hsin Sean Lee, Ph.D.
 
Lec7 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- Kar...
Hsien-Hsin Sean Lee, Ph.D.
 
Lec11 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- De...
Hsien-Hsin Sean Lee, Ph.D.
 
Lec5 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- Boo...
Hsien-Hsin Sean Lee, Ph.D.
 
Ad

Similar to Lec6 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Instruction Fetch (20)

DOCX
Coa
TPLatchoumi
 
PPTX
Cache recap
Luis Goldster
 
PPTX
Cache recap
Tony Nguyen
 
PPTX
Cache recap
Fraboni Ec
 
PPTX
Cache recap
Young Alista
 
PPTX
Cache recap
James Wong
 
PPTX
Cache recap
Harry Potter
 
PPTX
Cache recap
Hoang Nguyen
 
PDF
Software-defined IoT: 6TiSCH Centralized Scheduling and Multipath Construction
Hao Jiang
 
PPTX
05 instruction set design and architecture
Waqar Jamil
 
PPT
A Speculative Technique for Auto-Memoization Processor with Multithreading
Matsuo and Tsumura lab.
 
PDF
Risc vs cisc
Dileep Bhandarkar
 
PPTX
Basic blocks - compiler design
hmnasim15
 
PDF
3rd Semester Computer Science and Engineering (ACU) Question papers
BGS Institute of Technology, Adichunchanagiri University (ACU)
 
PPT
lec16-memory.ppt
AshokRachapalli1
 
PPTX
Windows kernel debugging workshop in florida
Sisimon Soman
 
PDF
Microprocessorandmicroconrollermcq3 121116120640-phpapp02
Yazeed Khalid
 
PDF
Vlsi model question paper 3 (june 2021)
PUSHPALATHAV1
 
PDF
Cache mapping
Bibek Brahma
 
PPT
Computer Organozation
Aabha Tiwari
 
Cache recap
Luis Goldster
 
Cache recap
Tony Nguyen
 
Cache recap
Fraboni Ec
 
Cache recap
Young Alista
 
Cache recap
James Wong
 
Cache recap
Harry Potter
 
Cache recap
Hoang Nguyen
 
Software-defined IoT: 6TiSCH Centralized Scheduling and Multipath Construction
Hao Jiang
 
05 instruction set design and architecture
Waqar Jamil
 
A Speculative Technique for Auto-Memoization Processor with Multithreading
Matsuo and Tsumura lab.
 
Risc vs cisc
Dileep Bhandarkar
 
Basic blocks - compiler design
hmnasim15
 
3rd Semester Computer Science and Engineering (ACU) Question papers
BGS Institute of Technology, Adichunchanagiri University (ACU)
 
lec16-memory.ppt
AshokRachapalli1
 
Windows kernel debugging workshop in florida
Sisimon Soman
 
Microprocessorandmicroconrollermcq3 121116120640-phpapp02
Yazeed Khalid
 
Vlsi model question paper 3 (june 2021)
PUSHPALATHAV1
 
Cache mapping
Bibek Brahma
 
Computer Organozation
Aabha Tiwari
 
Ad

Recently uploaded (20)

PPTX
atoma.pptxejejejejeejejjeejeejeju3u3u3u3
manthan912009
 
PPT
3 01032017tyuiryhjrhyureyhjkfdhghfrugjhf
DharaniMani4
 
PPTX
办理HFM文凭|购买代特莫尔德音乐学院毕业证文凭100%复刻安全可靠的
1cz3lou8
 
PDF
Abbreviations in NC-ISM_syllabus.pdf hejsnsjs
raipureastha08
 
DOCX
What Is Zoning Map Software and Why It Matters for Communities
riffatparveenseo
 
PDF
HUAWEI MOBILE PHONE IMPORTED FROM CHINA TO THAILAND REPORT.pdf.pdf
youyou851038
 
PPTX
PHISHING ATTACKS. _. _.pptx[]
kumarrana7525
 
PPTX
Normal distriutionvggggggggggggggggggg.pptx
JayeshTaneja4
 
PPTX
Basics of Memristors from zero to hero.pptx
onterusmail
 
PPTX
PPT on the topic of programming language
dishasindhava
 
PPTX
INTERNET OF THINGS (IOT) network of interconnected devices.
rp1256748
 
PDF
Endalamaw Kebede.pdfvvbhjjnhgggftygtttfgh
SirajudinAkmel1
 
PPTX
G6Q1 WEEK 2 SCIENCE PPT.pptxLVLLLLLLLLLLLLLLLLL
DitaSIdnay
 
PPTX
原版UMiami毕业证文凭迈阿密大学学费单定制学历在线制作硕士毕业证
jicaaeb0
 
PPTX
basic_parts-of_computer-1618-754-622.pptx
patelravi16187
 
PPTX
Operating-Systems-A-Journey ( by information
parthbhanushali307
 
PPTX
Boolean Algebra-Properties and Theorems.pptx
bhavanavarri5458
 
PPTX
DOC-20250728-WAprocess releases large amounts of carbon dioxide (CO₂), sulfur...
samt56673
 
PPT
community diagnosis slides show health. ppt
michaelbrucebwana
 
PPT
Susunan & Bagian DRAWING 153UWYHSGDGH.ppt
RezaFbriadi
 
atoma.pptxejejejejeejejjeejeejeju3u3u3u3
manthan912009
 
3 01032017tyuiryhjrhyureyhjkfdhghfrugjhf
DharaniMani4
 
办理HFM文凭|购买代特莫尔德音乐学院毕业证文凭100%复刻安全可靠的
1cz3lou8
 
Abbreviations in NC-ISM_syllabus.pdf hejsnsjs
raipureastha08
 
What Is Zoning Map Software and Why It Matters for Communities
riffatparveenseo
 
HUAWEI MOBILE PHONE IMPORTED FROM CHINA TO THAILAND REPORT.pdf.pdf
youyou851038
 
PHISHING ATTACKS. _. _.pptx[]
kumarrana7525
 
Normal distriutionvggggggggggggggggggg.pptx
JayeshTaneja4
 
Basics of Memristors from zero to hero.pptx
onterusmail
 
PPT on the topic of programming language
dishasindhava
 
INTERNET OF THINGS (IOT) network of interconnected devices.
rp1256748
 
Endalamaw Kebede.pdfvvbhjjnhgggftygtttfgh
SirajudinAkmel1
 
G6Q1 WEEK 2 SCIENCE PPT.pptxLVLLLLLLLLLLLLLLLLL
DitaSIdnay
 
原版UMiami毕业证文凭迈阿密大学学费单定制学历在线制作硕士毕业证
jicaaeb0
 
basic_parts-of_computer-1618-754-622.pptx
patelravi16187
 
Operating-Systems-A-Journey ( by information
parthbhanushali307
 
Boolean Algebra-Properties and Theorems.pptx
bhavanavarri5458
 
DOC-20250728-WAprocess releases large amounts of carbon dioxide (CO₂), sulfur...
samt56673
 
community diagnosis slides show health. ppt
michaelbrucebwana
 
Susunan & Bagian DRAWING 153UWYHSGDGH.ppt
RezaFbriadi
 

Lec6 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Instruction Fetch

  • 1. ECE 4100/6100 Advanced Computer Architecture Lecture 6 Instruction Fetch Prof. Hsien-Hsin Sean Lee School of Electrical and Computer Engineering Georgia Institute of Technology
  • 2. Instruction Supply Issues • Fetch throughput defines max performance that can be achieved in later stages • Superscalar processors need to supply more than 1 instruction per cycle • Instruction Supply limited by – Misalignment of multiple instructions in a fetch group – Change of Flow (interrupting instruction supply) – Memory latency and bandwidth Instruction Fetch Unit Execution Core Instruction buffer
  • 3. Aligned Instruction Fetching (4 instructions) RowDecoderRowDecoderRowDecoderRowDecoder ..01 ..00 A0 A4 00 A1 A5 01 A2 A6 10 A3 A7 11 inst 1inst 1 inst 2inst 2 inst 3 inst 4inst 3 inst 4inst 1inst 1 inst 2inst 2 inst 3 inst 4inst 3 inst 4 PC=..xx000000 One 64B I- cache line A8 A12 A9 A13 A10 A14 A11 A15 ..10 ..11 Assume one fetch group = 16B Cycle nCycle nCan pull out one row at a time
  • 4. Misaligned Fetch RowDecoderRowDecoderRowDecoderRowDecoder ..01 ..00 A0 A4 00 A1 A5 01 A2 A6 10 A3 A7 11 PC=..xx001000 One 64B I- cache line A8 A12 A9 A13 A10 A14 A11 A15 ..10 ..11 inst 1 inst 2inst 1 inst 2 inst 3 inst 4inst 3 inst 4inst 1 inst 2inst 1 inst 2 inst 3 inst 4inst 3 inst 4 Rotating networkRotating network Cycle nCycle n IBM RS/6000
  • 5. Split Cache Line Access RowDecoderRowDecoderRowDecoderRowDecoder ..01 ..00 A0 A4 00 A1 A5 01 A2 A6 10 A3 A7 11 PC=..xx111000 cache line A A8 A12 A9 A13 A10 A14 A11 A15 ..10 ..11 B0 B1 B2 B3 cache line B B4 B5 B6 B7 inst 1 inst 2inst 1 inst 2inst 1 inst 2inst 1 inst 2 inst 3 inst 4inst 3 inst 4inst 3 inst 4inst 3 inst 4 Cycle nCycle n Cycle n+1Cycle n+1 Be broken down to 2 physical accesses
  • 6. Split Cache Line Access Miss RowDecoderRowDecoderRowDecoderRowDecoder A0 A4 00 A1 A5 01 A2 A6 10 A3 A7 11 cache line A A8 A12 A9 A13 A10 A14 A11 A15 C0 C1 C2 C3 cache line C C4 C5 C6 C7 inst 1 inst 2inst 1 inst 2inst 1 inst 2inst 1 inst 2 inst 3 inst 4inst 3 inst 4inst 3 inst 4inst 3 inst 4 Cache lineCache line BB missesmisses Cycle nCycle n Cycle n+Cycle n+XX ..01 ..00 ..10 ..11 PC=..xx111000
  • 7. High Bandwidth Instruction Fetching BB1 BB2 BB3 BB4 BB5 BB6BB7 • Wider issue  More instruction feed • Major challenge: to fetch more than one non-contiguousnon-contiguous basic block per cycle • Enabling technique? – Predication – Branch alignment based on profiling – Other hardware solutions (branch prediction is a given)
  • 8. Predication Example • Convert control dependency into data dependency • Enlarge basic block size – More room for scheduling – No fetch disruption if (a[i+1]>a[i]) a[i+1] = 0 else a[i] = 0 if (a[i+1]>a[i]) a[i+1] = 0 else a[i] = 0 Source code lw r2, [r1+4] lw r3, [r1] blt r3, r2, L1 sw r0, [r1] j L2 L1: sw r0, [r1+4] L2: lw r2, [r1+4] lw r3, [r1] blt r3, r2, L1 sw r0, [r1] j L2 L1: sw r0, [r1+4] L2: Typical assembly lw r2, [r1+4] lw r3, [r1] sgt pr4, r2, r3 (p4) sw r0, [r1+4] (!p4) sw r0, [r1] lw r2, [r1+4] lw r3, [r1] sgt pr4, r2, r3 (p4) sw r0, [r1+4] (!p4) sw r0, [r1] Assembly w/ predication
  • 9. Collapse Buffer [ISCA 95] • To fetch multiple (often non-contiguous) instructions • Use interleaved BTB to enable multiple branch predictions • Align instructions in the predicted sequential order • Use banked I-cache for multiple line access
  • 10. Collapsing Buffer Fetch PC Interleaved BTB Cache Bank 1 Cache Bank 2 Interchange Switch Collapsing Circuit
  • 11. Collapsing Buffer Mechanism Interleaved BTB A E Bank Routing E A E F G H A B C D E F G H A B C D Interchange Switch A B C D E F G H Collapsing Circuit A B C E G Valid Instruction Bits D F H
  • 12. High Bandwidth Instruction Fetching BB1 BB2 BB3 BB4 BB5 BB6BB7 • To fetch more, we need to cross multiple basic blocks (and/or multiple cache lines) • Multiple branches predictions
  • 13. Multiple Branch Predictor [YehMarrPatt ICS’93] • Pattern History Table (PHT) design to support MBP • Based on global history only Branch History Register (BHR) Pattern History Table (PHT) bk …… b1 Primary prediction Secondary prediction Tertiary prediction p1 p2p1p1 p2p2 updateupdate
  • 14. Multiple Branch Predictin • Fetch address could be retrieved from BTB • Predicted path: BB1 → BB2 → BB5 • How to fetch BB2 and BB5? BTB? – Can’t. Branch PCs of br1br1 and br2br2 not available when MBP made – Use a BAC design BB1 br1br1 BB2 br2br2 BB3 BB4 BB5 BB6 BB7 T (2T (2ndnd )) FF TT TTF (3F (3rdrd )) FF Fetch address (br0 Primary prediction) BTB entry
  • 15. Branch Address Cache • Use a Branch Address Cache (BAC): Keep 6 possible fetch addresses for 2 more predictions • br: 2 bits for branch type (cond, uncond, return) • V: single valid bit (to indicate if hits a branch in the sequence) • To make one more level prediction – Need to cache another 8 more addresses (i.e. total=14 addresses) – 464 bits per entry = (23+3)*1 + (30+3) * (2+4) + 30*8 Tag 23 bits Taken Target Address Not-Taken Target Address T-T Address T-N Address N-T Address N-N Address 30 bits 30 bits V br V br V br 212 bits per fetch address entry 1 2 Fetch Addr (from BTB)Fetch Addr (from BTB)
  • 16. Caching Non-Consecutive Basic Blocks BB2BB2 • High Fetch Bandwidth + Low Latency BB1BB1 BB3BB3 BB4BB4 BB5BB5 Fetch in Conventional Instruction Cache BB2BB2BB1BB1 BB3BB3 BB4BB4 BB5BB5 Fetch in Linear Memory Location
  • 17. Trace Cache • Cache dynamic non-contiguous instructions (traces) • Cross multiple basic blocks • Need to predict multiple branches (MBP) E F G H I J K A B C D I$ A B C D E F G H I J I$ Fetch (5 cycles) A B C D E F G H I J Collapsing Buffer Fetch (3 cycles) A B C D E F G H I J Trace Cache A B C D E F G H I J T$ Fetch (1 cycle)
  • 18. Trace Cache [Rotenberg Bennett Smith MICRO‘96] • Cache at most (in original paper) – M branches OR (M = 3 in all follow-up TC studies due to MBP) – N instructions (N = 16 in all follow-up TC studies) • Fall-thru address if last branch is predicted not taken Tag Br flag Fetch AddrFetch Addr Br mask Fall-thru Address Taken Address MBPMBP BB2BB1 BB3 Line fill bufferLine fill buffer For T.C. missFor T.C. miss T.C. hits, N instructions MM branches Branch 1 Branch 2 Branch 3 10 1st Br taken 2nd Br Not taken 11, 1 11: 3 branches. 1: the trace ends w/ a branch
  • 19. Trace Hit Logic A 10 11,1 X Y Tag BF Mask Fall-thru TargetFetch: A = Match 1st Block Multi-BPred T N Cond. AND Match Remaining Block(s) Trace hit N 0 1 Next Fetch Address
  • 20. Trace Cache Example A B C D Exit 5 insts 12 insts 4 insts 6 insts BB Traversal Path: ABDABDACDABDACDABDAC A1 A2 A3 A4 A5 B1 B2 B3 B4 B5 B6 D1 D2 D3 D4 A1 A2 A3 A4 A5 C1 C2 C3 C4 C5 C6 C7 C8 C9 C10C11 C12 D1 D2 D3 D4 A1 A2 A3 A4 A5 B1 B2 B3 B4 B5 B6 D1 D2 D3 D4 A1 A2 A3 A4 A5 C1 C2 C3 C4 C5 C6 C7 C8 C9 C10C11C12 D1 D2 D3 D4 Trace Cache (5 lines) Cond 1: 3 branches Cond 2: Fill a trace cache line Cond 3: Exit 16 instructions
  • 21. A1 A2 A3 A4 A5 B1 B2 B3 B4 B5 B6 D1 D2 D3 D4 A1 A2 A3 A4 A5 C1 C2 C3 C4 C5 C6 C7 C8 C9 C10C11 C12 D1 D2 D3 D4 A1 A2 A3 A4 A5 B1 B2 B3 B4 B5 B6 D1 D2 D3 D4 A1 A2 A3 A4 A5 C1 C2 C3 C4 C5 C6 C7 C8 C9 C10C11C12 D1 D2 D3 D4 Trace Cache Example A B C D Exit 5 insts 12 insts 4 insts 6 insts BB Traversal Path: ABDABDACDABDACDABDAC A1 A2 A3 A4 A5 B1 B2 B3 B4 B5 B6 D1 D2 D3 D4 A1 A2 A3 A4 A5 C1 C2 C3 C4 C5 C6 C7 C8 C9 C10C11 C12 D1 D2 D3 D4 A1 A2 A3 A4 A5 B1 B2 B3 B4 B5 B6 D1 D2 D3 D4 A1 A2 A3 A4 A5 C1 C2 C3 C4 C5 C6 C7 C8 C9 C10C11C12 D1 D2 D3 D4 Trace Cache (5 lines) Cond 1: 3 branches Cond 2: Fill a trace cache line Cond 3: Exit
  • 22. Trace Cache Example A B C D Exit 5 insts 12 insts 4 insts 6 insts BB Traversal Path: ABDABDACDABDACDABDAC A1 A2 A3 A4 A5 B1 B2 B3 B4 B5 B6 D1 D2 D3 D4 A1 A2 A3 A4 A5 C1 C2 C3 C4 C5 C6 C7 C8 C9 C10C11 B1 B2 B3 B4 B5 B6 D1 D2 D3 D4 A1 A2 A3 A4 A5 C1 C2 C3 C4 C5 C6 C7 C8 C9 C10C11C12 D1 D2 D3 D4 Trace Cache (5 lines) C12 D1 D2 D3 D4 A1 A2 A3 A4 A5C12C12 D1 D2 D3 D4 A1 A2 A3 A4 A5 Trace Cache is Full
  • 23. Trace Cache Example A B C D Exit 5 insts 12 insts 4 insts 6 insts BB Traversal Path: ABDABDACDABDACDABDAC A1 A2 A3 A4 A5 B1 B2 B3 B4 B5 B6 D1 D2 D3 D4 A1 A2 A3 A4 A5 C1 C2 C3 C4 C5 C6 C7 C8 C9 C10C11 B1 B2 B3 B4 B5 B6 D1 D2 D3 D4 A1 A2 A3 A4 A5 C1 C2 C3 C4 C5 C6 C7 C8 C9 C10C11C12 D1 D2 D3 D4 C12 D1 D2 D3 D4 A1 A2 A3 A4 A5C12 How many hits? What is the utilization?
  • 24. Redundancy • Duplication – Note that instructions only appear once in I-Cache – Same instruction appears many times in TC • Fragmentation – If 3 BBs < 16 instructions – If multiple-target branch (e.g. return, indirect jump or trap) is encountered, stop “trace construction”. – Empty slots ⇒ wasted resources • Example – A single BB is broken up to (ABC), (BCD), (CDA), (DAB) – Duplicating each instruction 3 times (ABC) =16 inst (BCD) =13 inst (CDA) =15 inst (DAB) =13 inst A B CB D Trace Cache C D A B C D A 6 4 6 3 B C D A
  • 25. Indexability A C D B E • TC saved traces (EAC) and (BCD) • Path: (EAC) to (D) – Cannot index interior block (D) • Can cause duplication • Need partial matching – (BCD) is cached, if (BC) is needed E CB D Trace Cache A C G
  • 26. Pentium 4 (NetBurst) Trace Cache Front-end BTB iTLB and Prefetcher L2 Cache Decoder Trace $ Trace $ BTB Rename, execute, etc. No I$ !! Decoded Instructions Trace-based prediction (predict next-trace, not next-PC)