SlideShare a Scribd company logo
Topics Left
• Superscalar machines
• IA64 / EPIC architecture
• Multithreading (explicit and implicit)
• Multicore Machines
• Clusters
• Parallel Processors
• Hardware implementation vs microprogramming
Chapter 14
Superscalar Processors
• Definition of Superscalar
• Design Issues:
- Instruction Issue Policy
- Register renaming
- Machine parallelism
- Branch Prediction
- Execution
• Pentium 4 example
What is Superscalar?
• “Common” instructions (arithmetic, load/store, conditional branch)
can be executed independently.
• Equally applicable to RISC & CISC, but more straightforward in
RISC machines.
• The order of execution is usually assisted by the compiler.
A Superscalar machine executes multiple independent
instructions in parallel.
They are pipelined as well.
Example of Superscalar Organization
• 2 Integer ALU pipelines,
• 2 FP ALU pipelines,
• 1 memory pipeline (?)
Superscalar v Superpipelined
Limitations of Superscalar
• Dependent upon:
- Instruction level parallelism possible
- Compiler based optimization
- Hardware support
• Limited by
— Data dependency
— Procedural dependency
— Resource conflicts
(Recall) True Data Dependency
(Must W before R)
ADD r1, r2 r1+r2  r1
MOVE r3, r1 r1  r3
• Can fetch and decode second instruction in parallel with
first
LOAD r1, X x (memory)  r1
MOVE r3, r1 r1 r3
• Can NOT execute second instruction until first is
finished
Second instruction is dependent on first (R after W)
(recall) Antidependancy (Must R before W)
ADD R4, R3, 1 R3 + 1  R4
ADD R3, R5, 1 R5 + 1  R3
• Cannot complete the second instruction before the first has
read R3
(Recall) Procedural Dependency
• Can’t execute instructions after a branch in parallel
with instructions before a branch, because?
Note: Also, if instruction length is not fixed,
instructions have to be decoded to find out how many
fetches are needed
(recall) Resource Conflict
• Two or more instructions requiring access to the
same resource at the same time
— e.g. two arithmetic instructions need the ALU
• Solution - Can possibly duplicate resources
— e.g. have two arithmetic units
Effect of Dependencies on Superscalar Operation
Notes:
1) Superscalar operation is double impacted by a stall.
2) CISC machines typically have different length instructions and need to be at least
partially decoded before the next can be fetched – not good for superscalar operation
Instruction-level Parallelism – degree of
• Consider:
LOAD R1, R2
ADD R3, 1
ADD R4, R2
These can be handled in parallel.
• Consider:
ADD R3, 1
ADD R4, R3
STO (R4), R0
These cannot be handled in parallel.
The “degree” of instruction-level parallelism is determined by the
number of instructions that can be executed in parallel without
stalling for dependencies
Instruction Issue Policies
• Order in which instructions are fetched
• Order in which instructions are executed
• Order in which instructions update registers and
memory values (order of completion)
Standard Categories:
• In-order issue with in-order completion
• In-order issue with out-of-order completion
• Out-of order issue with out-of-order completion
In-Order Issue -- In-Order Completion
Issue instructions in the order they occur:
• Not very efficient
• Instructions must stall if necessary (and stalling in
superpipelining is expensive)
In-Order Issue -- In-Order Completion
(Example)
Assume:
• I1 requires 2 cycles to execute
• I3 & I4 conflict for the same functional unit
• I5 depends upon value produced by I4
• I5 & I6 conflict for a functional unit
In-Order Issue -- Out-of-Order Completion
(Example)
How does this effect interrupts?
Again:
• I1 requires 2 cycles to execute
• I3 & I4 conflict for the same functional unit
• I5 depends upon value produced by I4
• I5 & I6 conflict for a functional unit
Out-of-Order Issue -- Out-of-Order Completion
• Decouple decode pipeline from execution pipeline
• Can continue to fetch and decode until the “window”
is full
• When a functional unit becomes available an
instruction can be executed (usually in as much in-
order as possible)
• Since instructions have been decoded, processor can
look ahead
Out-of-Order Issue -- Out-of-Order Completion
(Example)
Note: I5 depends upon I4, but I6 does not
Again:
• I1 requires 2 cycles to execute
• I3 & I4 conflict for the same functional unit
• I5 depends upon value produced by I4
• I5 & I6 conflict for a functional unit
Register Renaming
to avoid hazards
• Output and antidependencies occur because register
contents may not reflect the correct ordering from the
program
• Can require a pipeline stall
• One solution: Allocate Registers dynamically
(renaming registers)
Register Renaming example
Add R3, R3, R5 R3b:=R3a + R5a (I1)
Add R4, R3, 1 R4b:=R3b + 1 (I2)
Add R3, R5, 1 R3c:=R5a + 1 (I3)
Add R7, R3, R4 R7b:=R3c + R4b (I4)
• Without “subscript” refers to logical register in
instruction
• With subscript is hardware register allocated:
R3a R3b R3c
Note: R3c avoids: antidependency on I2
output dependency I1
Recaping: Machine Parallelism Support
• Duplication of Resources
• Out of order issue hardware
• Windowing to decouple execution from decode
• Register Renaming capability
Speedups of Machine Organizations
(Without Procedural Dependencies)
• Not worth duplication of functional units without register renaming
• Need instruction window large enough (more than 8, probably not more than 32)
Branch Prediction in Superscalar Machines
• Delayed branch not used much. Why?
Multiple instructions need to execute in the delay slot.
This leads to much complexity in recovery.
• Branch prediction should be used - Branch history is
very useful
View of Superscalar Execution
Committing or Retiring Instructions
Results need to be put into order (commit or retire)
• Results sometimes must be held in temporary storage
until it is certain they can be placed in “permanent”
storage.
(either committed or retired/flushed)
• Temporary storage requires regular clean up –
overhead – done in hardware.
Superscalar Hardware Support
• Facilities to simultaneously fetch multiple
instructions
• Logic to determine true dependencies involving
register values and Mechanisms to communicate
these values
• Mechanisms to initiate multiple instructions in
parallel
• Resources for parallel execution of multiple
instructions
• Mechanisms for committing process state in correct
Example: Pentium 4
A Superscalar CISC Machine
Pentium 4 alternate view
Pentium 4 pipeline
20 stages !
a) Generation of Micro-ops (stages 1 &2)
• Using the Branch Target Buffer and Instruction Translation
Lookaside Buffer, the x86 instructions are fetched 64 bytes at a
time from the L2 cache
•The instruction boundaries are determined and instructions
decoded into 1-4 118-bit RISC micro-ops
• Micro-ops are stored in the trace cache
b) Trace cache next instruction pointer (stage 3)
• The Trace Cache Branch Target Buffer contains dynamic
gathered history information (4 bit tag)
• If target is not in BTB
- Branch not PC relative: predict branch taken if it is a return,
predict not taken otherwise
- For PC relative backward conditional branches, predict take,
otherwise not taken
c) Trace Cache fetch (stage 4)
• Orders micro-ops in program-ordered sequences called traces
• These are fetched in order, subject to branch prediction
• Some micro-ops require many micro-ops (CISC instructions).
These are coded into the ROM and fetched from the ROM
d) Drive (stage 5)
• Delivers instructions from the Trace Cache to the
Rename/Allocator module for reordering
e) Allocate: register naming (stages 6, 7, & 8)
• Allocates resources for execution (3 micro-ops arrive per clock cycle):
- Each micro-op is allocated to a slot in the 126 position circular Reorder Buffer (ROB) which
tracks progress of the micro-ops.
Buffer entries include:
- State – scheduled, dispatched, completed, ready for retire
- Address that generated the micro-op
- Operation
- Alias registers are assigned for one of 16 arch reg (128 alias registers)
{to remove data
dependencies}
• The micro-ops are dispatched out of order as resources are available
• Allocates an entry to one of the 2 scheduler queues - memory access or not
f) Micro-op queuing (stage 9)
• Micro-ops are loaded into one of 2 queues:
- one for memory operations
- one for non memory operations
• Each queue operates on a FIFO policy
g) Micro-op scheduling
(stages 10, 11, & 12)
• The 2 schedulers retrieve micro-ops based upon having all
the operands ready and dispatch them to an available unit (up
to 6 per clock cycle)
• If two micro-ops need the same unit, they are dispatched
in sequence.
h) Dispatch
(stages 13 & 14)
i) Register file
(stages 15 & 16)
j) Execute: flags
(stages 17 & 18)
• The register files are the sources for pending fixed and FF
operations
• A separate stage is used to compute the flags
k) Branch check
(stage 19)
l) Branch check results
(stage 20)
• Checks flags and compares results with predictions
• If the branch prediction was wrong:
- all incorrect micro-ops must be flushed (don’t want to be wrong!)
- the correct branch destination is provided to the Branch Predictor
- the pipeline is restarted from the new target address

More Related Content

PPT
Instruction Level Parallelism and Superscalar Processors
Syed Zaid Irshad
 
ODP
axi protocol
Azad Mishra
 
PPTX
Hbase
AllsoftSolutions
 
PPT
Ch10: Virtual Memory
Ahmar Hashmi
 
PPTX
Broken Linux Performance Tools 2016
Brendan Gregg
 
PDF
Practical reverse engineering and exploit development for AVR-based Embedded ...
Alexander Bolshev
 
ODP
APB protocol v1.0
Azad Mishra
 
PPTX
Superscalar Processor
Manash Kumar Mondal
 
Instruction Level Parallelism and Superscalar Processors
Syed Zaid Irshad
 
axi protocol
Azad Mishra
 
Ch10: Virtual Memory
Ahmar Hashmi
 
Broken Linux Performance Tools 2016
Brendan Gregg
 
Practical reverse engineering and exploit development for AVR-based Embedded ...
Alexander Bolshev
 
APB protocol v1.0
Azad Mishra
 
Superscalar Processor
Manash Kumar Mondal
 

What's hot (20)

PDF
Introduction To Flink
Knoldus Inc.
 
PPTX
Instruction cycle presentation
Moniba Irfan
 
PDF
Apache Spark on K8S Best Practice and Performance in the Cloud
Databricks
 
PDF
Parquet Strata/Hadoop World, New York 2013
Julien Le Dem
 
PDF
Linux Instrumentation
DarkStarSword
 
PDF
Profiling deep learning network using NVIDIA nsight systems
Jack (Jaegeun) Han
 
PDF
Efficient Data Storage for Analytics with Apache Parquet 2.0
Cloudera, Inc.
 
PDF
OpenSSL Basic Function Call Flow
William Lee
 
PPTX
Understanding DPDK
Denys Haryachyy
 
PDF
Sparklens: Understanding the Scalability Limits of Spark Applications with R...
Databricks
 
PPTX
Node Labels in YARN
DataWorks Summit
 
PDF
The Real Cost of Slow Time vs Downtime
Radware
 
PDF
Kernel Recipes 2019 - Faster IO through io_uring
Anne Nicolas
 
PPTX
8237 dma controller
Tech_MX
 
PDF
Three Optimization Tips for C++
Andrei Alexandrescu
 
PPTX
Input Output Interfaces
Darling Jemima
 
PDF
BPF - in-kernel virtual machine
Alexei Starovoitov
 
PDF
Enabling new protocol processing with DPDK using Dynamic Device Personalization
Michelle Holley
 
PPTX
Revisit DCA, PCIe TPH and DDIO
Hisaki Ohara
 
PDF
Pwning in c++ (basic)
Angel Boy
 
Introduction To Flink
Knoldus Inc.
 
Instruction cycle presentation
Moniba Irfan
 
Apache Spark on K8S Best Practice and Performance in the Cloud
Databricks
 
Parquet Strata/Hadoop World, New York 2013
Julien Le Dem
 
Linux Instrumentation
DarkStarSword
 
Profiling deep learning network using NVIDIA nsight systems
Jack (Jaegeun) Han
 
Efficient Data Storage for Analytics with Apache Parquet 2.0
Cloudera, Inc.
 
OpenSSL Basic Function Call Flow
William Lee
 
Understanding DPDK
Denys Haryachyy
 
Sparklens: Understanding the Scalability Limits of Spark Applications with R...
Databricks
 
Node Labels in YARN
DataWorks Summit
 
The Real Cost of Slow Time vs Downtime
Radware
 
Kernel Recipes 2019 - Faster IO through io_uring
Anne Nicolas
 
8237 dma controller
Tech_MX
 
Three Optimization Tips for C++
Andrei Alexandrescu
 
Input Output Interfaces
Darling Jemima
 
BPF - in-kernel virtual machine
Alexei Starovoitov
 
Enabling new protocol processing with DPDK using Dynamic Device Personalization
Michelle Holley
 
Revisit DCA, PCIe TPH and DDIO
Hisaki Ohara
 
Pwning in c++ (basic)
Angel Boy
 
Ad

Similar to 13 superscalar (20)

PPT
14 superscalar
dilip kumar
 
PPT
14 superscalar
Anwal Mirza
 
PDF
chap2_slidesforparallelcomputingananthgarama
doomzday27
 
PPT
Chapter 3
Rozase Patel
 
PPTX
Computer Architecture and Organization
ssuserdfc773
 
PPT
Chapter 2 pc
Hanif Durad
 
PDF
3 ilp
KaushikGhosh91
 
PPT
12 processor structure and function
Anwal Mirza
 
PDF
Arch 1112-6
Hector Sanjuan
 
PPTX
Scope of parallelism
Syed Zaid Irshad
 
PPT
Chap2 slides
BaliThorat1
 
PPT
Performance Enhancement with Pipelining
Aneesh Raveendran
 
PPT
12 processor structure and function
dilip kumar
 
PPT
computer architecture module3 notes module
thirugnanasambandham4
 
PPT
Unit 5-lecture 5
vishal choudhary
 
PPT
CH-5-Pipelining Computer architecture and organization.ppt
ssuser127433
 
PPT
IT209 Cpu Structure Report
Bis Aquino
 
PPT
CALecture3Module1.ppt
BeeMUcz
 
PPT
Pipelining
AJAL A J
 
PDF
Topic2a ss pipelines
turki_09
 
14 superscalar
dilip kumar
 
14 superscalar
Anwal Mirza
 
chap2_slidesforparallelcomputingananthgarama
doomzday27
 
Chapter 3
Rozase Patel
 
Computer Architecture and Organization
ssuserdfc773
 
Chapter 2 pc
Hanif Durad
 
12 processor structure and function
Anwal Mirza
 
Arch 1112-6
Hector Sanjuan
 
Scope of parallelism
Syed Zaid Irshad
 
Chap2 slides
BaliThorat1
 
Performance Enhancement with Pipelining
Aneesh Raveendran
 
12 processor structure and function
dilip kumar
 
computer architecture module3 notes module
thirugnanasambandham4
 
Unit 5-lecture 5
vishal choudhary
 
CH-5-Pipelining Computer architecture and organization.ppt
ssuser127433
 
IT209 Cpu Structure Report
Bis Aquino
 
CALecture3Module1.ppt
BeeMUcz
 
Pipelining
AJAL A J
 
Topic2a ss pipelines
turki_09
 
Ad

Recently uploaded (20)

PPTX
13. ANAESTHETICS AND ALCOHOLS.pptx fucking
sriramraja650
 
PPTX
INTERNET OF THINGS (IOT) network of interconnected devices.
rp1256748
 
PPTX
great itemsgreat itemsgreat itemsgreat items.pptx
saurabh13smr
 
PDF
Endalamaw Kebede.pdfvvbhjjnhgggftygtttfgh
SirajudinAkmel1
 
PPTX
办理HFM文凭|购买代特莫尔德音乐学院毕业证文凭100%复刻安全可靠的
1cz3lou8
 
PPT
Susunan & Bagian DRAWING 153UWYHSGDGH.ppt
RezaFbriadi
 
PPTX
PPT on the topic of programming language
dishasindhava
 
PPT
3 01032017tyuiryhjrhyureyhjkfdhghfrugjhf
DharaniMani4
 
PPTX
cocomo-220726173706-141e08f0.tyuiuuupptx
DharaniMani4
 
PPTX
Operating-Systems-A-Journey ( by information
parthbhanushali307
 
PPTX
西班牙海牙认证瓦伦西亚国际大学毕业证与成绩单文凭复刻快速办理毕业证书
sw6vvn9s
 
PPTX
DOC-20250728-WAprocess releases large amounts of carbon dioxide (CO₂), sulfur...
samt56673
 
PPTX
Basics of Memristors from zero to hero.pptx
onterusmail
 
PPT
community diagnosis slides show health. ppt
michaelbrucebwana
 
PPTX
PHISHING ATTACKS. _. _.pptx[]
kumarrana7525
 
PPTX
G6Q1 WEEK 2 SCIENCE PPT.pptxLVLLLLLLLLLLLLLLLLL
DitaSIdnay
 
PPTX
Intro_S4HANA_Using_Global_Bike_Slides_SD_en_v4.1.pptx
trishalasharma7
 
PPTX
Basics of Memristors and fundamentals.pptx
onterusmail
 
PDF
Portable Veterinary Ultrasound Scanners & Animal Medical Equipment - TcCryo
3447752272
 
PPTX
Boolean Algebra-Properties and Theorems.pptx
bhavanavarri5458
 
13. ANAESTHETICS AND ALCOHOLS.pptx fucking
sriramraja650
 
INTERNET OF THINGS (IOT) network of interconnected devices.
rp1256748
 
great itemsgreat itemsgreat itemsgreat items.pptx
saurabh13smr
 
Endalamaw Kebede.pdfvvbhjjnhgggftygtttfgh
SirajudinAkmel1
 
办理HFM文凭|购买代特莫尔德音乐学院毕业证文凭100%复刻安全可靠的
1cz3lou8
 
Susunan & Bagian DRAWING 153UWYHSGDGH.ppt
RezaFbriadi
 
PPT on the topic of programming language
dishasindhava
 
3 01032017tyuiryhjrhyureyhjkfdhghfrugjhf
DharaniMani4
 
cocomo-220726173706-141e08f0.tyuiuuupptx
DharaniMani4
 
Operating-Systems-A-Journey ( by information
parthbhanushali307
 
西班牙海牙认证瓦伦西亚国际大学毕业证与成绩单文凭复刻快速办理毕业证书
sw6vvn9s
 
DOC-20250728-WAprocess releases large amounts of carbon dioxide (CO₂), sulfur...
samt56673
 
Basics of Memristors from zero to hero.pptx
onterusmail
 
community diagnosis slides show health. ppt
michaelbrucebwana
 
PHISHING ATTACKS. _. _.pptx[]
kumarrana7525
 
G6Q1 WEEK 2 SCIENCE PPT.pptxLVLLLLLLLLLLLLLLLLL
DitaSIdnay
 
Intro_S4HANA_Using_Global_Bike_Slides_SD_en_v4.1.pptx
trishalasharma7
 
Basics of Memristors and fundamentals.pptx
onterusmail
 
Portable Veterinary Ultrasound Scanners & Animal Medical Equipment - TcCryo
3447752272
 
Boolean Algebra-Properties and Theorems.pptx
bhavanavarri5458
 

13 superscalar

  • 1. Topics Left • Superscalar machines • IA64 / EPIC architecture • Multithreading (explicit and implicit) • Multicore Machines • Clusters • Parallel Processors • Hardware implementation vs microprogramming
  • 2. Chapter 14 Superscalar Processors • Definition of Superscalar • Design Issues: - Instruction Issue Policy - Register renaming - Machine parallelism - Branch Prediction - Execution • Pentium 4 example
  • 3. What is Superscalar? • “Common” instructions (arithmetic, load/store, conditional branch) can be executed independently. • Equally applicable to RISC & CISC, but more straightforward in RISC machines. • The order of execution is usually assisted by the compiler. A Superscalar machine executes multiple independent instructions in parallel. They are pipelined as well.
  • 4. Example of Superscalar Organization • 2 Integer ALU pipelines, • 2 FP ALU pipelines, • 1 memory pipeline (?)
  • 6. Limitations of Superscalar • Dependent upon: - Instruction level parallelism possible - Compiler based optimization - Hardware support • Limited by — Data dependency — Procedural dependency — Resource conflicts
  • 7. (Recall) True Data Dependency (Must W before R) ADD r1, r2 r1+r2  r1 MOVE r3, r1 r1  r3 • Can fetch and decode second instruction in parallel with first LOAD r1, X x (memory)  r1 MOVE r3, r1 r1 r3 • Can NOT execute second instruction until first is finished Second instruction is dependent on first (R after W)
  • 8. (recall) Antidependancy (Must R before W) ADD R4, R3, 1 R3 + 1  R4 ADD R3, R5, 1 R5 + 1  R3 • Cannot complete the second instruction before the first has read R3
  • 9. (Recall) Procedural Dependency • Can’t execute instructions after a branch in parallel with instructions before a branch, because? Note: Also, if instruction length is not fixed, instructions have to be decoded to find out how many fetches are needed
  • 10. (recall) Resource Conflict • Two or more instructions requiring access to the same resource at the same time — e.g. two arithmetic instructions need the ALU • Solution - Can possibly duplicate resources — e.g. have two arithmetic units
  • 11. Effect of Dependencies on Superscalar Operation Notes: 1) Superscalar operation is double impacted by a stall. 2) CISC machines typically have different length instructions and need to be at least partially decoded before the next can be fetched – not good for superscalar operation
  • 12. Instruction-level Parallelism – degree of • Consider: LOAD R1, R2 ADD R3, 1 ADD R4, R2 These can be handled in parallel. • Consider: ADD R3, 1 ADD R4, R3 STO (R4), R0 These cannot be handled in parallel. The “degree” of instruction-level parallelism is determined by the number of instructions that can be executed in parallel without stalling for dependencies
  • 13. Instruction Issue Policies • Order in which instructions are fetched • Order in which instructions are executed • Order in which instructions update registers and memory values (order of completion) Standard Categories: • In-order issue with in-order completion • In-order issue with out-of-order completion • Out-of order issue with out-of-order completion
  • 14. In-Order Issue -- In-Order Completion Issue instructions in the order they occur: • Not very efficient • Instructions must stall if necessary (and stalling in superpipelining is expensive)
  • 15. In-Order Issue -- In-Order Completion (Example) Assume: • I1 requires 2 cycles to execute • I3 & I4 conflict for the same functional unit • I5 depends upon value produced by I4 • I5 & I6 conflict for a functional unit
  • 16. In-Order Issue -- Out-of-Order Completion (Example) How does this effect interrupts? Again: • I1 requires 2 cycles to execute • I3 & I4 conflict for the same functional unit • I5 depends upon value produced by I4 • I5 & I6 conflict for a functional unit
  • 17. Out-of-Order Issue -- Out-of-Order Completion • Decouple decode pipeline from execution pipeline • Can continue to fetch and decode until the “window” is full • When a functional unit becomes available an instruction can be executed (usually in as much in- order as possible) • Since instructions have been decoded, processor can look ahead
  • 18. Out-of-Order Issue -- Out-of-Order Completion (Example) Note: I5 depends upon I4, but I6 does not Again: • I1 requires 2 cycles to execute • I3 & I4 conflict for the same functional unit • I5 depends upon value produced by I4 • I5 & I6 conflict for a functional unit
  • 19. Register Renaming to avoid hazards • Output and antidependencies occur because register contents may not reflect the correct ordering from the program • Can require a pipeline stall • One solution: Allocate Registers dynamically (renaming registers)
  • 20. Register Renaming example Add R3, R3, R5 R3b:=R3a + R5a (I1) Add R4, R3, 1 R4b:=R3b + 1 (I2) Add R3, R5, 1 R3c:=R5a + 1 (I3) Add R7, R3, R4 R7b:=R3c + R4b (I4) • Without “subscript” refers to logical register in instruction • With subscript is hardware register allocated: R3a R3b R3c Note: R3c avoids: antidependency on I2 output dependency I1
  • 21. Recaping: Machine Parallelism Support • Duplication of Resources • Out of order issue hardware • Windowing to decouple execution from decode • Register Renaming capability
  • 22. Speedups of Machine Organizations (Without Procedural Dependencies) • Not worth duplication of functional units without register renaming • Need instruction window large enough (more than 8, probably not more than 32)
  • 23. Branch Prediction in Superscalar Machines • Delayed branch not used much. Why? Multiple instructions need to execute in the delay slot. This leads to much complexity in recovery. • Branch prediction should be used - Branch history is very useful
  • 24. View of Superscalar Execution
  • 25. Committing or Retiring Instructions Results need to be put into order (commit or retire) • Results sometimes must be held in temporary storage until it is certain they can be placed in “permanent” storage. (either committed or retired/flushed) • Temporary storage requires regular clean up – overhead – done in hardware.
  • 26. Superscalar Hardware Support • Facilities to simultaneously fetch multiple instructions • Logic to determine true dependencies involving register values and Mechanisms to communicate these values • Mechanisms to initiate multiple instructions in parallel • Resources for parallel execution of multiple instructions • Mechanisms for committing process state in correct
  • 27. Example: Pentium 4 A Superscalar CISC Machine
  • 30. a) Generation of Micro-ops (stages 1 &2) • Using the Branch Target Buffer and Instruction Translation Lookaside Buffer, the x86 instructions are fetched 64 bytes at a time from the L2 cache •The instruction boundaries are determined and instructions decoded into 1-4 118-bit RISC micro-ops • Micro-ops are stored in the trace cache
  • 31. b) Trace cache next instruction pointer (stage 3) • The Trace Cache Branch Target Buffer contains dynamic gathered history information (4 bit tag) • If target is not in BTB - Branch not PC relative: predict branch taken if it is a return, predict not taken otherwise - For PC relative backward conditional branches, predict take, otherwise not taken
  • 32. c) Trace Cache fetch (stage 4) • Orders micro-ops in program-ordered sequences called traces • These are fetched in order, subject to branch prediction • Some micro-ops require many micro-ops (CISC instructions). These are coded into the ROM and fetched from the ROM
  • 33. d) Drive (stage 5) • Delivers instructions from the Trace Cache to the Rename/Allocator module for reordering
  • 34. e) Allocate: register naming (stages 6, 7, & 8) • Allocates resources for execution (3 micro-ops arrive per clock cycle): - Each micro-op is allocated to a slot in the 126 position circular Reorder Buffer (ROB) which tracks progress of the micro-ops. Buffer entries include: - State – scheduled, dispatched, completed, ready for retire - Address that generated the micro-op - Operation - Alias registers are assigned for one of 16 arch reg (128 alias registers) {to remove data dependencies} • The micro-ops are dispatched out of order as resources are available • Allocates an entry to one of the 2 scheduler queues - memory access or not
  • 35. f) Micro-op queuing (stage 9) • Micro-ops are loaded into one of 2 queues: - one for memory operations - one for non memory operations • Each queue operates on a FIFO policy
  • 36. g) Micro-op scheduling (stages 10, 11, & 12) • The 2 schedulers retrieve micro-ops based upon having all the operands ready and dispatch them to an available unit (up to 6 per clock cycle) • If two micro-ops need the same unit, they are dispatched in sequence. h) Dispatch (stages 13 & 14)
  • 37. i) Register file (stages 15 & 16) j) Execute: flags (stages 17 & 18) • The register files are the sources for pending fixed and FF operations • A separate stage is used to compute the flags
  • 38. k) Branch check (stage 19) l) Branch check results (stage 20) • Checks flags and compares results with predictions • If the branch prediction was wrong: - all incorrect micro-ops must be flushed (don’t want to be wrong!) - the correct branch destination is provided to the Branch Predictor - the pipeline is restarted from the new target address