SlideShare a Scribd company logo
Assembly Language AI-64 Motaz K. Saad Spring 2007 Motaz K. Saad, Dept. of CS
Background to IA-64 Pentium 4 appears to be last in x86 line Intel & Hewlett-Packard (HP) jointly developed New architecture 64 bit architecture Not extension of x86 Not adaptation of HP 64bit RISC architecture Exploits vast circuitry and high speeds Systematic use of parallelism Departure from superscalar Motaz K. Saad, Dept. of CS
Motivation Instruction level parallelism  Implicit in machine instruction Not determined at run time by processor Long or very long instruction words (LIW/VLIW) Branch predication (not the same as branch prediction) Speculative loading Intel & HP call this Explicit Parallel Instruction Computing (EPIC) IA-64 is an instruction set architecture intended for implementation on EPIC Itanium is first Intel product Motaz K. Saad, Dept. of CS
Superscalar v IA-64 Motaz K. Saad, Dept. of CS
Why New Architecture? Not hardware compatible with x86 Now have tens of millions of transistors available on chip Could build bigger cache Diminishing returns Add more execution units  Increase superscaling “ Complexity wall” More units makes processor “wider” More logic needed to orchestrate Improved branch prediction required Longer pipelines required Greater penalty for misprediction Larger number of renaming registers required At most six instructions per cycle Motaz K. Saad, Dept. of CS
Explicit Parallelism Instruction parallelism scheduled at compile time Included with machine instruction Processor uses this info to perform parallel execution Requires less complex circuitry Compiler has much more time to determine possible parallel operations Compiler sees whole program Motaz K. Saad, Dept. of CS
General Organization Motaz K. Saad, Dept. of CS
Key Features Large number of registers IA-64 instruction format assumes 256 128 * 64 bit integer, logical & general purpose 128 * 82 bit floating point and graphic 64 * 1 bit predicated execution registers (see later) To support high degree of parallelism Multiple execution units Expected to be 8 or more Depends on number of transistors available Execution of parallel instructions depends on hardware available 8 parallel instructions may be spilt into two lots of four if only four execution units are available Motaz K. Saad, Dept. of CS
IA-64 Execution Units I-Unit Integer arithmetic Shift and add Logical Compare Integer multimedia ops M-Unit Load and store Between register and memory Some integer ALU B-Unit Branch instructions F-Unit Floating point instructions Motaz K. Saad, Dept. of CS
Instruction Format Diagram Motaz K. Saad, Dept. of CS
Instruction Format 128 bit bundle Holds three instructions (syllables) plus template Can fetch one or more bundles at a time Template contains info on which instructions can be executed in parallel Not confined to single bundle e.g. a stream of 8 instructions may be executed in parallel Compiler will have re-ordered instructions to form contiguous bundles Can mix dependent and independent instructions in same bundle Instruction is 41 bit long More registers than usual RISC Predicated execution registers (see later) Motaz K. Saad, Dept. of CS
Assembly Language Format [qp] mnemonic [.comp] dest = srcs // qp  - predicate register 1 at execution then execute and commit result to hardware 0 result is discarded mnemonic  - name of instruction comp  – one or more instruction completers used to qualify mnemonic dest  – one or more destination operands srcs  – one or more source operands //  - comment Instruction groups and stops indicated by  ;; Sequence without read after write or write after write Do not need hardware register dependency checks Motaz K. Saad, Dept. of CS
Assembly Examples ld8 r1 = [r5] ;; //first group add r3 = r1, r4 //second group Second instruction depends on value in r1 Changed by first instruction Can not be in same group for parallel execution Motaz K. Saad, Dept. of CS
Predication Motaz K. Saad, Dept. of CS
Speculative  Loading Motaz K. Saad, Dept. of CS
Control & Data Speculation Control AKA Speculative loading Load data from memory before needed Data Load moved before store that might alter memory location Subsequent check in value Motaz K. Saad, Dept. of CS
Software Pipelining L1: ld4 r4=[r5],4 ;; //cycle 0 load postinc 4 add r7=r4,r9 ;; //cycle 2 st4 [r6]=r7,4 //cycle 3 store postinc 4 br.cloop L1 ;; //cycle 3 Adds constant to one vector and stores result in another No opportunity for instruction level parallelism Instruction in iteration  x  all executed before iteration  x+1  begins If no address conflicts between loads and stores can move independent instructions from loop  x+1  to loop  x Motaz K. Saad, Dept. of CS
Unrolled Loop ld4 r32=[r5],4;; //cycle 0 ld4 r33=[r5],4;; //cycle 1 ld4 r34=[r5],4  //cycle 2 add r36=r32,r9;; //cycle 2 ld4 r35=[r5],4  //cycle 3 add r37=r33,r9  //cycle 3 st4 [r6]=r36,4;; //cycle 3 ld4 r36=[r5],4  //cycle 3 add r38=r34,r9  //cycle 4 st4 [r6]=r37,4;; //cycle 4 add r39=r35,r9  //cycle 5 st4 [r6]=r38,4;; //cycle 5 add r40=r36,r9  //cycle 6 st4 [r6]=r39,4;; //cycle 6 st4 [r6]=r40,4;; //cycle 7 Motaz K. Saad, Dept. of CS
Unrolled Loop Detail Completes 5 iterations in 7 cycles Compared with 20 cycles in original code Assumes two memory ports Load and store can be done in parallel Motaz K. Saad, Dept. of CS
Software Pipeline Example Diagram Motaz K. Saad, Dept. of CS
Support For Software Pipelining Automatic register renaming Fixed size are of predicate and fp register file (p16-P32, fr32-fr127) and programmable size area of gp register file (max r32-r127) capable of rotation Loop using r32 on first iteration automatically uses r33 on second Predication Each instruction in loop predicated on rotating predicate register Determines whether pipeline is in prolog, kernel or epilog Special loop termination instructions Branch instructions that cause registers to rotate and loop counter to decrement Motaz K. Saad, Dept. of CS
IA-64 Register Set Motaz K. Saad, Dept. of CS
IA-64 Registers (1) General Registers 128 gp 64 bit registers r0-r31 static references interpreted literally r32-r127 can be used as rotating registers for software pipeline or register stack References are virtual Hardware may rename dynamically Floating Point Registers 128 fp 82 bit registers Will hold IEEE 745 double extended format fr0-fr31 static, fr32-fr127 can be rotated for pipeline Predicate registers 64 1 bit registers used as predicates pr0 always 1 to allow unpredicated instructions pr1-pr15 static, pr16-pr63 can be rotated Motaz K. Saad, Dept. of CS
IA-64 Registers (2) Branch registers 8 64 bit registers Instruction pointer Bundle address of currently executing instruction Current frame marker State info relating to current general register stack frame Rotation info for fr and pr User mask Set of single bit values Allignment traps, performance monitors, fp register usage monitoring Performance monitoring data registers Support performance monitoring hardware Application registers Special purpose registers Motaz K. Saad, Dept. of CS
Register Stack Avoids unnecessary movement of data at procedure call & return Provides procedure with new frame up to 96 registers on entry r32-r127 Compiler specifies required number Local output Registers renamed so local registers from previous frame hidden Output registers from calling procedure now have numbers starting r32 Physical registers r32-r127 allocated in circular buffer to virtual registers Hardware moves register contents between registers and memory if more registers needed Motaz K. Saad, Dept. of CS
Register Stack Behaviour Motaz K. Saad, Dept. of CS
Register Formats Motaz K. Saad, Dept. of CS
Itanium Organization Superscalar features Six wide, ten stage deep hardware pipeline Dynamic prefetch branch prediction register scoreboard to optimise for compile time nondeterminism EPIC features Hardware support for predicated execution Control and data speculation Software pipelining Motaz K. Saad, Dept. of CS
Itanium Processor Diagram Motaz K. Saad, Dept. of CS

More Related Content

What's hot (20)

PPTX
ADDRESSING MODES
Sadaf Rasheed
 
PPTX
Intel IA 64
Nartana Shenbagaraj
 
PPTX
Finite Automata: Deterministic And Non-deterministic Finite Automaton (DFA)
Mohammad Ilyas Malik
 
PPTX
Architecture of operating system
Supriya Kumari
 
PDF
Program execution, straight line sequence and branching
JyotiprakashMishra18
 
PPTX
Transitioning IPv4 to IPv6
Jhoni Guerrero
 
PPS
Virtual memory
Anuj Modi
 
PPTX
Interrupts
Albin Panakkal
 
PDF
RPL - Routing Protocol for Low Power and Lossy Networks
Pradeep Kumar TS
 
DOC
Distributed Mutual exclusion algorithms
MNM Jain Engineering College
 
PPTX
Pram model
MANASYJAYASURYA
 
PPSX
Congestion avoidance in TCP
selvakumar_b1985
 
PPT
Assembly language programming_fundamentals 8086
Shehrevar Davierwala
 
PDF
Difference Between CISC RISC, Harward & Von-neuman
Kailas Kharse
 
PDF
Syntax analysis
Akshaya Arunan
 
PPT
Address translation-mechanism-of-80386 by aniket bhute
Aniket Bhute
 
PPTX
EDLC-EMBEDDED PRODUCT DEVELOPMENT LIFE CYCLE
Sabeel Irshad
 
PPTX
8237 dma controller
Tech_MX
 
PPT
Internet control message protocol
asimnawaz54
 
PPT
WAP 2.0
Ramasubbu .P
 
ADDRESSING MODES
Sadaf Rasheed
 
Intel IA 64
Nartana Shenbagaraj
 
Finite Automata: Deterministic And Non-deterministic Finite Automaton (DFA)
Mohammad Ilyas Malik
 
Architecture of operating system
Supriya Kumari
 
Program execution, straight line sequence and branching
JyotiprakashMishra18
 
Transitioning IPv4 to IPv6
Jhoni Guerrero
 
Virtual memory
Anuj Modi
 
Interrupts
Albin Panakkal
 
RPL - Routing Protocol for Low Power and Lossy Networks
Pradeep Kumar TS
 
Distributed Mutual exclusion algorithms
MNM Jain Engineering College
 
Pram model
MANASYJAYASURYA
 
Congestion avoidance in TCP
selvakumar_b1985
 
Assembly language programming_fundamentals 8086
Shehrevar Davierwala
 
Difference Between CISC RISC, Harward & Von-neuman
Kailas Kharse
 
Syntax analysis
Akshaya Arunan
 
Address translation-mechanism-of-80386 by aniket bhute
Aniket Bhute
 
EDLC-EMBEDDED PRODUCT DEVELOPMENT LIFE CYCLE
Sabeel Irshad
 
8237 dma controller
Tech_MX
 
Internet control message protocol
asimnawaz54
 
WAP 2.0
Ramasubbu .P
 

Viewers also liked (19)

PDF
Browsing Linux Kernel Source
Motaz Saad
 
PDF
Hewahi, saad 2006 - class outliers mining distance-based approach
Motaz Saad
 
PPT
Assembly Language Lecture 5
Motaz Saad
 
PPT
3.7 outlier analysis
Krish_ver2
 
PDF
Cross Language Concept Mining
Motaz Saad
 
PDF
مقدمة في تكنواوجيا المعلومات
Motaz Saad
 
PPT
Knowledge discovery thru data mining
Devakumar Jain
 
PPT
The x86 Family
Motaz Saad
 
PDF
Browsing The Source Code of Linux Packages
Motaz Saad
 
PDF
Class Outlier Mining
Motaz Saad
 
PPT
OS Lab: Introduction to Linux
Motaz Saad
 
PDF
Open Source Business Models
Motaz Saad
 
PDF
Data Mining and Business Intelligence Tools
Motaz Saad
 
PPT
Assembly Language Lecture 4
Motaz Saad
 
PPTX
Data Mining: Outlier analysis
DataminingTools Inc
 
PPT
Assembly Language Lecture 3
Motaz Saad
 
PPTX
Structured Vs, Object Oriented Analysis and Design
Motaz Saad
 
PPT
Data mining: Concepts and Techniques, Chapter12 outlier Analysis
Salah Amean
 
PPT
Introduction to CLIPS Expert System
Motaz Saad
 
Browsing Linux Kernel Source
Motaz Saad
 
Hewahi, saad 2006 - class outliers mining distance-based approach
Motaz Saad
 
Assembly Language Lecture 5
Motaz Saad
 
3.7 outlier analysis
Krish_ver2
 
Cross Language Concept Mining
Motaz Saad
 
مقدمة في تكنواوجيا المعلومات
Motaz Saad
 
Knowledge discovery thru data mining
Devakumar Jain
 
The x86 Family
Motaz Saad
 
Browsing The Source Code of Linux Packages
Motaz Saad
 
Class Outlier Mining
Motaz Saad
 
OS Lab: Introduction to Linux
Motaz Saad
 
Open Source Business Models
Motaz Saad
 
Data Mining and Business Intelligence Tools
Motaz Saad
 
Assembly Language Lecture 4
Motaz Saad
 
Data Mining: Outlier analysis
DataminingTools Inc
 
Assembly Language Lecture 3
Motaz Saad
 
Structured Vs, Object Oriented Analysis and Design
Motaz Saad
 
Data mining: Concepts and Techniques, Chapter12 outlier Analysis
Salah Amean
 
Introduction to CLIPS Expert System
Motaz Saad
 
Ad

Similar to Intel 64bit Architecture (20)

PPT
15 ia64
dilip kumar
 
PPTX
Lec05
siddu kadiwal
 
PPT
Design and implementation of five stage pipelined RISC-V processor using Ver...
RITHISHKUMAR17
 
PDF
BKK16-103 OpenCSD - Open for Business!
Linaro
 
PPT
x86_1.ppt
jeronimored
 
PPT
Microprocessor Systems and Interfacing Slides
maxpaines2005
 
PPTX
Embedded System Programming on ARM Cortex M3 and M4 Course
FastBit Embedded Brain Academy
 
PDF
2009-03-13 Atlanda System z Council Meeting
Shawn Wells
 
PDF
Doc32000
Alfredo Santillan
 
PDF
Highridge ISA
Alec Selfridge
 
PPT
ARMicrocontroller Memory and Exceptions,Traps.ppt
ECEHITS
 
PPTX
Lec02
siddu kadiwal
 
PPTX
ADNSU Computer Architecture Topic Presentation.pptx
KamranGasanov1
 
PPTX
Computer architecture instruction formats
Mazin Alwaaly
 
PDF
Doc32002
Alfredo Santillan
 
PPTX
The sunsparc architecture
Taha Malampatti
 
PDF
Design and Implementation of Pipelined 8-Bit RISC Processor using Verilog HDL...
IRJET Journal
 
PDF
Various type of register
Muhammad Taqi Hassan Bukhari
 
PPTX
Advanced Processor Power Point Presentation
PrashantYadav931011
 
PDF
Arm architecture overview
Sathish Arumugasamy
 
15 ia64
dilip kumar
 
Design and implementation of five stage pipelined RISC-V processor using Ver...
RITHISHKUMAR17
 
BKK16-103 OpenCSD - Open for Business!
Linaro
 
x86_1.ppt
jeronimored
 
Microprocessor Systems and Interfacing Slides
maxpaines2005
 
Embedded System Programming on ARM Cortex M3 and M4 Course
FastBit Embedded Brain Academy
 
2009-03-13 Atlanda System z Council Meeting
Shawn Wells
 
Highridge ISA
Alec Selfridge
 
ARMicrocontroller Memory and Exceptions,Traps.ppt
ECEHITS
 
ADNSU Computer Architecture Topic Presentation.pptx
KamranGasanov1
 
Computer architecture instruction formats
Mazin Alwaaly
 
The sunsparc architecture
Taha Malampatti
 
Design and Implementation of Pipelined 8-Bit RISC Processor using Verilog HDL...
IRJET Journal
 
Various type of register
Muhammad Taqi Hassan Bukhari
 
Advanced Processor Power Point Presentation
PrashantYadav931011
 
Arm architecture overview
Sathish Arumugasamy
 
Ad

Recently uploaded (20)

PPTX
Extensions Framework (XaaS) - Enabling Orchestrate Anything
ShapeBlue
 
PPTX
UiPath Academic Alliance Educator Panels: Session 2 - Business Analyst Content
DianaGray10
 
PPTX
Top Managed Service Providers in Los Angeles
Captain IT
 
PDF
How Startups Are Growing Faster with App Developers in Australia.pdf
India App Developer
 
PDF
Why Orbit Edge Tech is a Top Next JS Development Company in 2025
mahendraalaska08
 
PDF
Arcee AI - building and working with small language models (06/25)
Julien SIMON
 
PDF
Complete JavaScript Notes: From Basics to Advanced Concepts.pdf
haydendavispro
 
PDF
SWEBOK Guide and Software Services Engineering Education
Hironori Washizaki
 
PDF
CloudStack GPU Integration - Rohit Yadav
ShapeBlue
 
PDF
Novus-Safe Pro: Brochure-What is Novus Safe Pro?.pdf
Novus Hi-Tech
 
PDF
UiPath vs Other Automation Tools Meeting Presentation.pdf
Tracy Dixon
 
PDF
Upgrading to z_OS V2R4 Part 01 of 02.pdf
Flavio787771
 
PPTX
Building and Operating a Private Cloud with CloudStack and LINBIT CloudStack ...
ShapeBlue
 
PDF
Are there government-backed agri-software initiatives in Limerick.pdf
giselawagner2
 
PDF
Impact of IEEE Computer Society in Advancing Emerging Technologies including ...
Hironori Washizaki
 
PDF
SFWelly Summer 25 Release Highlights July 2025
Anna Loughnan Colquhoun
 
PDF
Apache CloudStack 201: Let's Design & Build an IaaS Cloud
ShapeBlue
 
PDF
HCIP-Data Center Facility Deployment V2.0 Training Material (Without Remarks ...
mcastillo49
 
PDF
Meetup Kickoff & Welcome - Rohit Yadav, CSIUG Chairman
ShapeBlue
 
PDF
Empowering Cloud Providers with Apache CloudStack and Stackbill
ShapeBlue
 
Extensions Framework (XaaS) - Enabling Orchestrate Anything
ShapeBlue
 
UiPath Academic Alliance Educator Panels: Session 2 - Business Analyst Content
DianaGray10
 
Top Managed Service Providers in Los Angeles
Captain IT
 
How Startups Are Growing Faster with App Developers in Australia.pdf
India App Developer
 
Why Orbit Edge Tech is a Top Next JS Development Company in 2025
mahendraalaska08
 
Arcee AI - building and working with small language models (06/25)
Julien SIMON
 
Complete JavaScript Notes: From Basics to Advanced Concepts.pdf
haydendavispro
 
SWEBOK Guide and Software Services Engineering Education
Hironori Washizaki
 
CloudStack GPU Integration - Rohit Yadav
ShapeBlue
 
Novus-Safe Pro: Brochure-What is Novus Safe Pro?.pdf
Novus Hi-Tech
 
UiPath vs Other Automation Tools Meeting Presentation.pdf
Tracy Dixon
 
Upgrading to z_OS V2R4 Part 01 of 02.pdf
Flavio787771
 
Building and Operating a Private Cloud with CloudStack and LINBIT CloudStack ...
ShapeBlue
 
Are there government-backed agri-software initiatives in Limerick.pdf
giselawagner2
 
Impact of IEEE Computer Society in Advancing Emerging Technologies including ...
Hironori Washizaki
 
SFWelly Summer 25 Release Highlights July 2025
Anna Loughnan Colquhoun
 
Apache CloudStack 201: Let's Design & Build an IaaS Cloud
ShapeBlue
 
HCIP-Data Center Facility Deployment V2.0 Training Material (Without Remarks ...
mcastillo49
 
Meetup Kickoff & Welcome - Rohit Yadav, CSIUG Chairman
ShapeBlue
 
Empowering Cloud Providers with Apache CloudStack and Stackbill
ShapeBlue
 

Intel 64bit Architecture

  • 1. Assembly Language AI-64 Motaz K. Saad Spring 2007 Motaz K. Saad, Dept. of CS
  • 2. Background to IA-64 Pentium 4 appears to be last in x86 line Intel & Hewlett-Packard (HP) jointly developed New architecture 64 bit architecture Not extension of x86 Not adaptation of HP 64bit RISC architecture Exploits vast circuitry and high speeds Systematic use of parallelism Departure from superscalar Motaz K. Saad, Dept. of CS
  • 3. Motivation Instruction level parallelism Implicit in machine instruction Not determined at run time by processor Long or very long instruction words (LIW/VLIW) Branch predication (not the same as branch prediction) Speculative loading Intel & HP call this Explicit Parallel Instruction Computing (EPIC) IA-64 is an instruction set architecture intended for implementation on EPIC Itanium is first Intel product Motaz K. Saad, Dept. of CS
  • 4. Superscalar v IA-64 Motaz K. Saad, Dept. of CS
  • 5. Why New Architecture? Not hardware compatible with x86 Now have tens of millions of transistors available on chip Could build bigger cache Diminishing returns Add more execution units Increase superscaling “ Complexity wall” More units makes processor “wider” More logic needed to orchestrate Improved branch prediction required Longer pipelines required Greater penalty for misprediction Larger number of renaming registers required At most six instructions per cycle Motaz K. Saad, Dept. of CS
  • 6. Explicit Parallelism Instruction parallelism scheduled at compile time Included with machine instruction Processor uses this info to perform parallel execution Requires less complex circuitry Compiler has much more time to determine possible parallel operations Compiler sees whole program Motaz K. Saad, Dept. of CS
  • 7. General Organization Motaz K. Saad, Dept. of CS
  • 8. Key Features Large number of registers IA-64 instruction format assumes 256 128 * 64 bit integer, logical & general purpose 128 * 82 bit floating point and graphic 64 * 1 bit predicated execution registers (see later) To support high degree of parallelism Multiple execution units Expected to be 8 or more Depends on number of transistors available Execution of parallel instructions depends on hardware available 8 parallel instructions may be spilt into two lots of four if only four execution units are available Motaz K. Saad, Dept. of CS
  • 9. IA-64 Execution Units I-Unit Integer arithmetic Shift and add Logical Compare Integer multimedia ops M-Unit Load and store Between register and memory Some integer ALU B-Unit Branch instructions F-Unit Floating point instructions Motaz K. Saad, Dept. of CS
  • 10. Instruction Format Diagram Motaz K. Saad, Dept. of CS
  • 11. Instruction Format 128 bit bundle Holds three instructions (syllables) plus template Can fetch one or more bundles at a time Template contains info on which instructions can be executed in parallel Not confined to single bundle e.g. a stream of 8 instructions may be executed in parallel Compiler will have re-ordered instructions to form contiguous bundles Can mix dependent and independent instructions in same bundle Instruction is 41 bit long More registers than usual RISC Predicated execution registers (see later) Motaz K. Saad, Dept. of CS
  • 12. Assembly Language Format [qp] mnemonic [.comp] dest = srcs // qp - predicate register 1 at execution then execute and commit result to hardware 0 result is discarded mnemonic - name of instruction comp – one or more instruction completers used to qualify mnemonic dest – one or more destination operands srcs – one or more source operands // - comment Instruction groups and stops indicated by ;; Sequence without read after write or write after write Do not need hardware register dependency checks Motaz K. Saad, Dept. of CS
  • 13. Assembly Examples ld8 r1 = [r5] ;; //first group add r3 = r1, r4 //second group Second instruction depends on value in r1 Changed by first instruction Can not be in same group for parallel execution Motaz K. Saad, Dept. of CS
  • 14. Predication Motaz K. Saad, Dept. of CS
  • 15. Speculative Loading Motaz K. Saad, Dept. of CS
  • 16. Control & Data Speculation Control AKA Speculative loading Load data from memory before needed Data Load moved before store that might alter memory location Subsequent check in value Motaz K. Saad, Dept. of CS
  • 17. Software Pipelining L1: ld4 r4=[r5],4 ;; //cycle 0 load postinc 4 add r7=r4,r9 ;; //cycle 2 st4 [r6]=r7,4 //cycle 3 store postinc 4 br.cloop L1 ;; //cycle 3 Adds constant to one vector and stores result in another No opportunity for instruction level parallelism Instruction in iteration x all executed before iteration x+1 begins If no address conflicts between loads and stores can move independent instructions from loop x+1 to loop x Motaz K. Saad, Dept. of CS
  • 18. Unrolled Loop ld4 r32=[r5],4;; //cycle 0 ld4 r33=[r5],4;; //cycle 1 ld4 r34=[r5],4 //cycle 2 add r36=r32,r9;; //cycle 2 ld4 r35=[r5],4 //cycle 3 add r37=r33,r9 //cycle 3 st4 [r6]=r36,4;; //cycle 3 ld4 r36=[r5],4 //cycle 3 add r38=r34,r9 //cycle 4 st4 [r6]=r37,4;; //cycle 4 add r39=r35,r9 //cycle 5 st4 [r6]=r38,4;; //cycle 5 add r40=r36,r9 //cycle 6 st4 [r6]=r39,4;; //cycle 6 st4 [r6]=r40,4;; //cycle 7 Motaz K. Saad, Dept. of CS
  • 19. Unrolled Loop Detail Completes 5 iterations in 7 cycles Compared with 20 cycles in original code Assumes two memory ports Load and store can be done in parallel Motaz K. Saad, Dept. of CS
  • 20. Software Pipeline Example Diagram Motaz K. Saad, Dept. of CS
  • 21. Support For Software Pipelining Automatic register renaming Fixed size are of predicate and fp register file (p16-P32, fr32-fr127) and programmable size area of gp register file (max r32-r127) capable of rotation Loop using r32 on first iteration automatically uses r33 on second Predication Each instruction in loop predicated on rotating predicate register Determines whether pipeline is in prolog, kernel or epilog Special loop termination instructions Branch instructions that cause registers to rotate and loop counter to decrement Motaz K. Saad, Dept. of CS
  • 22. IA-64 Register Set Motaz K. Saad, Dept. of CS
  • 23. IA-64 Registers (1) General Registers 128 gp 64 bit registers r0-r31 static references interpreted literally r32-r127 can be used as rotating registers for software pipeline or register stack References are virtual Hardware may rename dynamically Floating Point Registers 128 fp 82 bit registers Will hold IEEE 745 double extended format fr0-fr31 static, fr32-fr127 can be rotated for pipeline Predicate registers 64 1 bit registers used as predicates pr0 always 1 to allow unpredicated instructions pr1-pr15 static, pr16-pr63 can be rotated Motaz K. Saad, Dept. of CS
  • 24. IA-64 Registers (2) Branch registers 8 64 bit registers Instruction pointer Bundle address of currently executing instruction Current frame marker State info relating to current general register stack frame Rotation info for fr and pr User mask Set of single bit values Allignment traps, performance monitors, fp register usage monitoring Performance monitoring data registers Support performance monitoring hardware Application registers Special purpose registers Motaz K. Saad, Dept. of CS
  • 25. Register Stack Avoids unnecessary movement of data at procedure call & return Provides procedure with new frame up to 96 registers on entry r32-r127 Compiler specifies required number Local output Registers renamed so local registers from previous frame hidden Output registers from calling procedure now have numbers starting r32 Physical registers r32-r127 allocated in circular buffer to virtual registers Hardware moves register contents between registers and memory if more registers needed Motaz K. Saad, Dept. of CS
  • 26. Register Stack Behaviour Motaz K. Saad, Dept. of CS
  • 27. Register Formats Motaz K. Saad, Dept. of CS
  • 28. Itanium Organization Superscalar features Six wide, ten stage deep hardware pipeline Dynamic prefetch branch prediction register scoreboard to optimise for compile time nondeterminism EPIC features Hardware support for predicated execution Control and data speculation Software pipelining Motaz K. Saad, Dept. of CS
  • 29. Itanium Processor Diagram Motaz K. Saad, Dept. of CS