SlideShare a Scribd company logo
1
PRAGMATIC
OPTIMIZATION
IN MODERN PROGRAMMING
MODERN COMPUTER ARCHITECTURE CONCEPTS
Created by for / 2015-2016Marina (geek) Kolpakova UNN
2
COURSE TOPICS
Ordering optimization approaches
Demystifying a compiler
Mastering compiler optimizations
Modern computer architectures concepts
3
OUTLINE
Three aspects of the computer architecture
Latency vs Throughput architectures
Architecture families
CISC
RISC
VLIW
Vector
Why is it doing to be load/store?
Latest trends
summary
4 . 1
1-ST ASPECT OF COMPUTER ARCHITECTURE
Instruction Set Architecture or ISA (interface)
is a contract between HW and SW,
which speci es right, possibilities & limitations.
Class of ISA (load-store, register-memory)
Memory addressing modes & rules (base-immediate,
alignment requirements)
Types & sizes of operands (size of byte, short)
Operations (general arithmetic, control, logical)
Control ow instructions (branches, jumps, calls, returns)
Encoding an ISA ( xed or variable length)
All the conceptual aspects of the architecture
4 . 2
2-ND ASPECT OF COMPUTER ARCHITECTURE
Microarchitecture (organization) is a concrete
implementation of the ISA, the high-level aspects of a
processor design (memory system, memory interconnect,
design of the processor internals).
Pipeline width
Instruction latencies
Issue wight and scheduling
Speculation capabilities
All the concrete aspects of the architecture
4 . 3
3-ND ASPECT OF COMPUTER ARCHITECTURE
Hardware or chip (design) is the speci cs of a computer,
including the logic design and packaging. This is a concrete
implementation of the microarchitecture.
Tech-process
Clock rates
On die placement
All the properties of the chip
5 . 1
ARCHITECTURE: ARMV8-A
μarch IP Hardware
Cortex-A53 ARM Octa Exynos 7(7580) 1.6GHz
28nmHKMG
Cortex-A57 ARM Octa Exynos 7(7420) big.LITTLE 2.1/1.5
14FF ( LPE) (Samsung)
Cortex-A72 ARM Deca MediaTek Helio X20 big.LITTLE
2.5/2.0/1.4 20nmHKMG (TSMC)
Cortex-A35 ARM -
5 . 2
ARCHITECTURE: ARMV8-A
μarch IP Hardware
Denver NVIDIA Dual Tegra K1 2.3GHz
28nmHPM
Kryo Qualcomm Tetra S820 big.LITTLE 2.2/1.6GHz
14FF ( LPP) (Samsung)
Exynos M1 Samsung Quad Exynos 8890 big.LITTLE
2.6/2.29GHz 14FF ( LPP)
(Samsung)
5 . 3
ARCHITECTURE: ARMV8-A
μarch IP Hardware
Cyclone Apple Dual A7 (APL0698) 1.4GHz
28nmHKMG (Samsung)
Typhoon Apple Dual A8 (APL1012) 1.3GHz
20nmHKMG (TSMC)
Twister Apple Dual A9 (APL0898) 1.85GHz
16nmFF+ (TSMC)
Dual A9 (APL1022) 1.85GHz
14nmFF( LPP) (Samsung)
6 . 1
LATENCY VS THROUGHPUT ARCHITECTURES
Latency oriented architecture
addresses latency hiding issues;
features sophisticated pipelining;
out-of-order;
employs advanced cache hierarchies;
widely uses speculation.
Compute cores occupy only a small part of a die.
6 . 2
LATENCY VS THROUGHPUT ARCHITECTURES
Throughput oriented architecture
performs a bunch of operations in y;
features many simple compute units/cores;
employs simple pipelines and large register le to
provide a low-cost thread scheduling;
uses wide basses, tiling, programmable local memory.
Compute cores occupy most part of a die.
7
KEY ARCHITECTURE FAMILIES
RISC
Reduced Instruction Set Computer
CISC
Complex Instruction Set Computer
VLIW
Very Long Instruction Word
Vector architecture
8 . 1
CISC
Complex Instruction Set Computer
Designed in the 1970s which was a time where transistors
were expensive while compilers were naive. Additionally,
instruction packaging was the main concern due to
shortage of memory. The latency of the memory was just a
bit higher then registers.
The goal was to de ne an instruction set that allows high
level language constructs be translated into as few
assembly language instructions as possible, improving
performance as well as code density.
Examples are VAX, x86, AMD64.
Latency-oriented architecture.
8 . 2
CISC
Heritage
instructions access memory, a plenty of addressing modes,
many instruction families and a very rich variable length
ISA (alignment counts!),
consequently, complicated instruction decoding logic.
Moreover, a few registers are available for programmers.
Nowadays
1. Instructions are broken down into μcode which are
much easy to pipeline and process power ef ciently.
2. Transistors are spent to cache hierarchies, out-of-order
execution, large RB and speculation to eliminate stalls.
3. Symmetric multi-processing.
9 . 1
RISC
Reduced Instruction Set Computer
Designed in the 1980s which was a time there IPL was the
great concern. The memory-processor gap already began
to come out.
The goal was to decrease the number of clocks per
instruction (CPI) while pipeline instructions as much as
possible employing hardware to help with it. Uniform ISA,
pipelining and large register le is a must-have.
Examples are MIPS, ARM, PowerPC.
Latency-oriented architecture.
9 . 2
RISC
Heritage
Relatively few instructions, all are the same length.
Only load and store instructions access memory.
Large resister le than typical CISC processors have.
No μcode
Nowadays Most architectures that comes from RISC are
called Load-Store architectures, while may employ μops.
They combine concepts of a classic RISC with usage of
modern hardware enhancements:
1. deep pipelines, multi-cycle instructions,
2. out-of-order execution,
3. speculation.
10
THE HARDWARE/SOFTWARE GAP
Compiler
analyzes control ow, analyzes dependency
schedules instructions
maps variables to limited register set
Hardware
analyzes control ow, analyzes dependency
schedules instructions
remaps ISA register to large internal register set
11
A WORD TOWARDS REGISTERS
In deed, registers are temporary storage locations inside the
processor that hold data and addresses.
Local variables are not the same as registers in ISA, since
compiler uses IR internally and does register allocation
close to the end of optimization process.
Registers provided by ISA is not the same as actual
registers on the processor. Internal reorder buffers which
hold decoded instruction parameters and intermediate
results are closer to classic de nition of a register le.
12 . 1
VLIW
Very Long Instruction Word
Designed in the 1980s which was a time there IPL was the
great concern.
The goal was to pipeline instructions as much as possible
employing software to help with it reducing complexity of
the hardware and mitigate the the Hardware/Software
gap. Boost processor clock simplifying work per cycle.
Example is IntelHP Itanium.
Throughput-oriented architecture
12 . 2
VLIW
Heritage
Compiler determines which instructions can be
performed in parallel,
bundles this information and the instructions,
and passes the bundle(word) to the hardware.
No data dependencies between instructions in a word.
Each operation in a word assigned to speci c issue slot
(dedicated FU).
12 . 3
VLIW
Nowadays
hardly any generic processor implements VLIW
brunchy nature of production codes (in contrast to
HPC or scienti c codes),
need to follow binary compatibility across the
μarchitecture families.
Whereas architecture is widely adopted for
programmable co-processors where shrink in power
consumption without lose of performance is crucial
(DSP, GPU).
13 . 1
VECTOR PROCESSORS
First introduced in 1976 and dominated for HPC in the
1980s because of high instruction throughput.
The goal was to perform operations on vectors of data
exposing data level parallelism (DLP) to increase
instruction throughput. Vector pipelining is also called
chaining.
Example is Cray
Throughput-oriented architectures.
13 . 2
VECTOR PROCESSORS
Heritage
Process the data in vectors, each element in a vector
(lane) is independent on any other.
Deep pipelines, wide execution units, not necessary the
same width (batch length) as size of vector in elements
(vector length).
Most ef cient for simple memory patterns, but
getter/scatter is usually possible too.
Wide memory interfaces to saturate execution units.
Large vector register le, cache is not a strict
requirement and absent for classical vector processors.
13 . 3
VECTOR PROCESSORS
Nowadays
They aren't used in generic processors design, but used
as a co-processors for a speci c workloads: HPC,
multimedia.
Precursors of most designs of modern GPUs.
Vector pipes with short vector length (8-16 bytes) called
SIMD units are widely integrated in modern general
purpose processors to accelerate most demanding
loops.
14 . 1
WHY IS IT DOING TO BE RISC LOAD/STORE?
1. Simple xed-width instructions & few addressing modes
Cache-ef cient instruction fetch, branches are aligned.
Simple hardware logic → power ef cient chips.
Drive a higher clock rate.
2. Concise ISA with orthogonal functionality
Complex instructions are ignored by compilers due to
semantic gap → simple instructions simplify scheduling.
Complex addressing lead either to variable length
instructions or big instruction size → inef cient
decoding and scheduling as well as alignment issues.
3. Large register set
Expose possible instruction parallelism to the compiler.
15
LATEST TRENDS
Architecture is seen as Load-store RISC-inherited
Internally instructions are broken down into single-pipe
μops
μops are reordered and optionally organized into words
μops or words are scheduled for execution, caching in the
highest level is usually performed on this preprocessed
view.
Latest generations of Intel processors, NVIDIA Denver
architecture and 64-bit ARM Cortex-A processors already
employ this approach.
16 . 1
SUMMARY
There are three key aspects of computer architecture:
Instruction Set Architecture, μarchitecture and design.
Some architectures aim to hide latency while others aim to
maximize instruction throughput.
CISC is created for compact code size and exact instruction
encoding and used only on ISA level nowadays.
RISC leads to less complicated decoding and pipeline stages
allow boosting clock in affordable power budget.
VLIW targets power ef cient high performance devices for
speci c tasks or used internally on μarchitecture level.
Vector processors transformed into SIMD-extensions and
SIMT-like GPU designs.
16 . 2
SUMMARY
Loads-Store architectres with its simple xed-width
instructions, few addressing modes, concise ISA
and optimal size register size is a winner solution.
Architecture can expose different properties
for it's different levels (ISA, μarchitecture).
17
THE END
/ 2015-2016MARINA KOLPAKOVA

More Related Content

What's hot (20)

PPT
Lec15 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- EPIC VLIW
Hsien-Hsin Sean Lee, Ph.D.
 
PPT
Lec4 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- ISA
Hsien-Hsin Sean Lee, Ph.D.
 
PDF
Moving NEON to 64 bits
Chiou-Nan Chen
 
PPT
Lec13 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Multicore
Hsien-Hsin Sean Lee, Ph.D.
 
PPT
Lec17 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- Me...
Hsien-Hsin Sean Lee, Ph.D.
 
PDF
TinyML - 4 speech recognition
艾鍗科技
 
PDF
Code gpu with cuda - CUDA introduction
Marina Kolpakova
 
PPT
Lec7 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Dynamic Sch...
Hsien-Hsin Sean Lee, Ph.D.
 
PPTX
An Open Discussion of RISC-V BitManip, trends, and comparisons _ Claire
RISC-V International
 
PDF
Arm tools and roadmap for SVE compiler support
Linaro
 
PPT
07 processor basics
Murali M
 
PDF
An evaluation of LLVM compiler for SVE with fairly complicated loops
Linaro
 
PPTX
RISC-V 30907 summit 2020 joint picocom_mentor
RISC-V International
 
PPT
Lec6 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Instruction...
Hsien-Hsin Sean Lee, Ph.D.
 
PPTX
Debug generic process
Vipin Varghese
 
PPT
Lec11 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Memory part3
Hsien-Hsin Sean Lee, Ph.D.
 
PDF
Challenges in GPU compilers
AnastasiaStulova
 
PDF
Keynote (Mike Muller) - Is There Anything New in Heterogeneous Computing - by...
AMD Developer Central
 
PDF
Code GPU with CUDA - Identifying performance limiters
Marina Kolpakova
 
PPT
Lec12 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- P6, Netbur...
Hsien-Hsin Sean Lee, Ph.D.
 
Lec15 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- EPIC VLIW
Hsien-Hsin Sean Lee, Ph.D.
 
Lec4 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- ISA
Hsien-Hsin Sean Lee, Ph.D.
 
Moving NEON to 64 bits
Chiou-Nan Chen
 
Lec13 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Multicore
Hsien-Hsin Sean Lee, Ph.D.
 
Lec17 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- Me...
Hsien-Hsin Sean Lee, Ph.D.
 
TinyML - 4 speech recognition
艾鍗科技
 
Code gpu with cuda - CUDA introduction
Marina Kolpakova
 
Lec7 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Dynamic Sch...
Hsien-Hsin Sean Lee, Ph.D.
 
An Open Discussion of RISC-V BitManip, trends, and comparisons _ Claire
RISC-V International
 
Arm tools and roadmap for SVE compiler support
Linaro
 
07 processor basics
Murali M
 
An evaluation of LLVM compiler for SVE with fairly complicated loops
Linaro
 
RISC-V 30907 summit 2020 joint picocom_mentor
RISC-V International
 
Lec6 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Instruction...
Hsien-Hsin Sean Lee, Ph.D.
 
Debug generic process
Vipin Varghese
 
Lec11 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Memory part3
Hsien-Hsin Sean Lee, Ph.D.
 
Challenges in GPU compilers
AnastasiaStulova
 
Keynote (Mike Muller) - Is There Anything New in Heterogeneous Computing - by...
AMD Developer Central
 
Code GPU with CUDA - Identifying performance limiters
Marina Kolpakova
 
Lec12 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- P6, Netbur...
Hsien-Hsin Sean Lee, Ph.D.
 

Viewers also liked (18)

PPTX
Computer Architecture and organization
Badrinath Kadam
 
PPT
Computer architecture
Rishabha Garg
 
PDF
Code GPU with CUDA - Applying optimization techniques
Marina Kolpakova
 
PPT
0 introduction to computer architecture
aamc1100
 
PPT
Computer architecture
Ashish Kumar
 
PPT
Ntroduction to computer architecture and organization
Fakulti seni, komputeran dan indusri kreatif
 
PDF
[2017.03.18] hst binary training part 1
Chia-Hao Tsai
 
PPTX
Hydrogen production by a thermally integrated ATR based fuel processor
Antonio Ricca
 
PDF
Advanced Techniques for Exploiting ILP
Dr. A. B. Shinde
 
PPTX
1 Computer Architecture
fika sweety
 
PDF
SOC Chip Basics
Dr. A. B. Shinde
 
PDF
The Modern Software Architect
Niels Bech Nielsen
 
PDF
History of computers
Hoang Nguyen
 
PPTX
Dual-core processor
praveenraogmail
 
PDF
SOC Interconnects: AMBA & CoreConnect
Dr. A. B. Shinde
 
PPTX
Computer architecture
International Islamic University
 
PDF
SOC Processors Used in SOC
Dr. A. B. Shinde
 
PDF
BigchainDB: A Scalable Blockchain Database, In Python
Trent McConaghy
 
Computer Architecture and organization
Badrinath Kadam
 
Computer architecture
Rishabha Garg
 
Code GPU with CUDA - Applying optimization techniques
Marina Kolpakova
 
0 introduction to computer architecture
aamc1100
 
Computer architecture
Ashish Kumar
 
Ntroduction to computer architecture and organization
Fakulti seni, komputeran dan indusri kreatif
 
[2017.03.18] hst binary training part 1
Chia-Hao Tsai
 
Hydrogen production by a thermally integrated ATR based fuel processor
Antonio Ricca
 
Advanced Techniques for Exploiting ILP
Dr. A. B. Shinde
 
1 Computer Architecture
fika sweety
 
SOC Chip Basics
Dr. A. B. Shinde
 
The Modern Software Architect
Niels Bech Nielsen
 
History of computers
Hoang Nguyen
 
Dual-core processor
praveenraogmail
 
SOC Interconnects: AMBA & CoreConnect
Dr. A. B. Shinde
 
Computer architecture
International Islamic University
 
SOC Processors Used in SOC
Dr. A. B. Shinde
 
BigchainDB: A Scalable Blockchain Database, In Python
Trent McConaghy
 
Ad

Similar to Pragmatic optimization in modern programming - modern computer architecture concepts (20)

PDF
A New Golden Age for Computer Architecture
Yanbin Kong
 
PPTX
Instruction Set Architecture
Jaffer Haadi
 
PPT
CSE675_01_Introduction.ppt
AshokRachapalli1
 
PPT
software engineering CSE675_01_Introduction.ppt
SomnathMule5
 
PPT
CSE675_01_Introduction.ppt
AshokRachapalli1
 
PPTX
Processors selection
Pradeep Shankhwar
 
PDF
Arm
anishgoel
 
PDF
“A New Golden Age for Computer Architecture: Processor Innovation to Enable U...
Edge AI and Vision Alliance
 
PPTX
Caqa5e ch1 with_review_and_examples
Aravindharamanan S
 
PDF
SOC System Design Approach
Dr. A. B. Shinde
 
PPTX
Advanced Computer Architecture – An Introduction
Dilum Bandara
 
PPTX
Vlsi_ppt_34_36_64.pptx
SahilMaske1
 
PPTX
Advanced Processor Power Point Presentation
PrashantYadav931011
 
PPT
isa architecture
AJAL A J
 
PPTX
Chapter 1.pptx
claudio48
 
PDF
L1.pdf
ssuser92b827
 
PDF
L1.pdf
ssuser92b827
 
PDF
18CS44-MODULE1-PPT.pdf
VanshikaRajvanshi1
 
PDF
Microprocessor Architecture From Simple Pipelines To Chip Multiprocessors Jea...
lelluakpai
 
PPTX
microprocessor and microcontroller material
sivapriyaSivakumar1
 
A New Golden Age for Computer Architecture
Yanbin Kong
 
Instruction Set Architecture
Jaffer Haadi
 
CSE675_01_Introduction.ppt
AshokRachapalli1
 
software engineering CSE675_01_Introduction.ppt
SomnathMule5
 
CSE675_01_Introduction.ppt
AshokRachapalli1
 
Processors selection
Pradeep Shankhwar
 
“A New Golden Age for Computer Architecture: Processor Innovation to Enable U...
Edge AI and Vision Alliance
 
Caqa5e ch1 with_review_and_examples
Aravindharamanan S
 
SOC System Design Approach
Dr. A. B. Shinde
 
Advanced Computer Architecture – An Introduction
Dilum Bandara
 
Vlsi_ppt_34_36_64.pptx
SahilMaske1
 
Advanced Processor Power Point Presentation
PrashantYadav931011
 
isa architecture
AJAL A J
 
Chapter 1.pptx
claudio48
 
L1.pdf
ssuser92b827
 
L1.pdf
ssuser92b827
 
18CS44-MODULE1-PPT.pdf
VanshikaRajvanshi1
 
Microprocessor Architecture From Simple Pipelines To Chip Multiprocessors Jea...
lelluakpai
 
microprocessor and microcontroller material
sivapriyaSivakumar1
 
Ad

Recently uploaded (20)

PDF
The Constitution Review Committee (CRC) has released an updated schedule for ...
nservice241
 
PPTX
Stereochemistry-Optical Isomerism in organic compoundsptx
Tarannum Nadaf-Mansuri
 
PPTX
How to Manage Large Scrollbar in Odoo 18 POS
Celine George
 
PDF
0725.WHITEPAPER-UNIQUEWAYSOFPROTOTYPINGANDUXNOW.pdf
Thomas GIRARD, MA, CDP
 
PPTX
BANDHA (BANDAGES) PPT.pptx ayurveda shalya tantra
rakhan78619
 
PDF
The Different Types of Non-Experimental Research
Thelma Villaflores
 
PPTX
Soil and agriculture microbiology .pptx
Keerthana Ramesh
 
PDF
ARAL-Orientation_Morning-Session_Day-11.pdf
JoelVilloso1
 
PPTX
ASRB NET 2023 PREVIOUS YEAR QUESTION PAPER GENETICS AND PLANT BREEDING BY SAT...
Krashi Coaching
 
PPTX
PATIENT ASSIGNMENTS AND NURSING CARE RESPONSIBILITIES.pptx
PRADEEP ABOTHU
 
PDF
Lesson 2 - WATER,pH, BUFFERS, AND ACID-BASE.pdf
marvinnbustamante1
 
PPTX
Pyhton with Mysql to perform CRUD operations.pptx
Ramakrishna Reddy Bijjam
 
PPTX
Mathematics 5 - Time Measurement: Time Zone
menchreo
 
PDF
Generative AI: it's STILL not a robot (CIJ Summer 2025)
Paul Bradshaw
 
PPT
Talk on Critical Theory, Part II, Philosophy of Social Sciences
Soraj Hongladarom
 
PDF
LAW OF CONTRACT (5 YEAR LLB & UNITARY LLB )- MODULE - 1.& 2 - LEARN THROUGH P...
APARNA T SHAIL KUMAR
 
PPTX
Quarter1-English3-W4-Identifying Elements of the Story
FLORRACHELSANTOS
 
PPTX
MENINGITIS: NURSING MANAGEMENT, BACTERIAL MENINGITIS, VIRAL MENINGITIS.pptx
PRADEEP ABOTHU
 
PPTX
How to Set Maximum Difference Odoo 18 POS
Celine George
 
PDF
CEREBRAL PALSY: NURSING MANAGEMENT .pdf
PRADEEP ABOTHU
 
The Constitution Review Committee (CRC) has released an updated schedule for ...
nservice241
 
Stereochemistry-Optical Isomerism in organic compoundsptx
Tarannum Nadaf-Mansuri
 
How to Manage Large Scrollbar in Odoo 18 POS
Celine George
 
0725.WHITEPAPER-UNIQUEWAYSOFPROTOTYPINGANDUXNOW.pdf
Thomas GIRARD, MA, CDP
 
BANDHA (BANDAGES) PPT.pptx ayurveda shalya tantra
rakhan78619
 
The Different Types of Non-Experimental Research
Thelma Villaflores
 
Soil and agriculture microbiology .pptx
Keerthana Ramesh
 
ARAL-Orientation_Morning-Session_Day-11.pdf
JoelVilloso1
 
ASRB NET 2023 PREVIOUS YEAR QUESTION PAPER GENETICS AND PLANT BREEDING BY SAT...
Krashi Coaching
 
PATIENT ASSIGNMENTS AND NURSING CARE RESPONSIBILITIES.pptx
PRADEEP ABOTHU
 
Lesson 2 - WATER,pH, BUFFERS, AND ACID-BASE.pdf
marvinnbustamante1
 
Pyhton with Mysql to perform CRUD operations.pptx
Ramakrishna Reddy Bijjam
 
Mathematics 5 - Time Measurement: Time Zone
menchreo
 
Generative AI: it's STILL not a robot (CIJ Summer 2025)
Paul Bradshaw
 
Talk on Critical Theory, Part II, Philosophy of Social Sciences
Soraj Hongladarom
 
LAW OF CONTRACT (5 YEAR LLB & UNITARY LLB )- MODULE - 1.& 2 - LEARN THROUGH P...
APARNA T SHAIL KUMAR
 
Quarter1-English3-W4-Identifying Elements of the Story
FLORRACHELSANTOS
 
MENINGITIS: NURSING MANAGEMENT, BACTERIAL MENINGITIS, VIRAL MENINGITIS.pptx
PRADEEP ABOTHU
 
How to Set Maximum Difference Odoo 18 POS
Celine George
 
CEREBRAL PALSY: NURSING MANAGEMENT .pdf
PRADEEP ABOTHU
 

Pragmatic optimization in modern programming - modern computer architecture concepts

  • 1. 1 PRAGMATIC OPTIMIZATION IN MODERN PROGRAMMING MODERN COMPUTER ARCHITECTURE CONCEPTS Created by for / 2015-2016Marina (geek) Kolpakova UNN
  • 2. 2 COURSE TOPICS Ordering optimization approaches Demystifying a compiler Mastering compiler optimizations Modern computer architectures concepts
  • 3. 3 OUTLINE Three aspects of the computer architecture Latency vs Throughput architectures Architecture families CISC RISC VLIW Vector Why is it doing to be load/store? Latest trends summary
  • 4. 4 . 1 1-ST ASPECT OF COMPUTER ARCHITECTURE Instruction Set Architecture or ISA (interface) is a contract between HW and SW, which speci es right, possibilities & limitations. Class of ISA (load-store, register-memory) Memory addressing modes & rules (base-immediate, alignment requirements) Types & sizes of operands (size of byte, short) Operations (general arithmetic, control, logical) Control ow instructions (branches, jumps, calls, returns) Encoding an ISA ( xed or variable length) All the conceptual aspects of the architecture
  • 5. 4 . 2 2-ND ASPECT OF COMPUTER ARCHITECTURE Microarchitecture (organization) is a concrete implementation of the ISA, the high-level aspects of a processor design (memory system, memory interconnect, design of the processor internals). Pipeline width Instruction latencies Issue wight and scheduling Speculation capabilities All the concrete aspects of the architecture
  • 6. 4 . 3 3-ND ASPECT OF COMPUTER ARCHITECTURE Hardware or chip (design) is the speci cs of a computer, including the logic design and packaging. This is a concrete implementation of the microarchitecture. Tech-process Clock rates On die placement All the properties of the chip
  • 7. 5 . 1 ARCHITECTURE: ARMV8-A μarch IP Hardware Cortex-A53 ARM Octa Exynos 7(7580) 1.6GHz 28nmHKMG Cortex-A57 ARM Octa Exynos 7(7420) big.LITTLE 2.1/1.5 14FF ( LPE) (Samsung) Cortex-A72 ARM Deca MediaTek Helio X20 big.LITTLE 2.5/2.0/1.4 20nmHKMG (TSMC) Cortex-A35 ARM -
  • 8. 5 . 2 ARCHITECTURE: ARMV8-A μarch IP Hardware Denver NVIDIA Dual Tegra K1 2.3GHz 28nmHPM Kryo Qualcomm Tetra S820 big.LITTLE 2.2/1.6GHz 14FF ( LPP) (Samsung) Exynos M1 Samsung Quad Exynos 8890 big.LITTLE 2.6/2.29GHz 14FF ( LPP) (Samsung)
  • 9. 5 . 3 ARCHITECTURE: ARMV8-A μarch IP Hardware Cyclone Apple Dual A7 (APL0698) 1.4GHz 28nmHKMG (Samsung) Typhoon Apple Dual A8 (APL1012) 1.3GHz 20nmHKMG (TSMC) Twister Apple Dual A9 (APL0898) 1.85GHz 16nmFF+ (TSMC) Dual A9 (APL1022) 1.85GHz 14nmFF( LPP) (Samsung)
  • 10. 6 . 1 LATENCY VS THROUGHPUT ARCHITECTURES Latency oriented architecture addresses latency hiding issues; features sophisticated pipelining; out-of-order; employs advanced cache hierarchies; widely uses speculation. Compute cores occupy only a small part of a die.
  • 11. 6 . 2 LATENCY VS THROUGHPUT ARCHITECTURES Throughput oriented architecture performs a bunch of operations in y; features many simple compute units/cores; employs simple pipelines and large register le to provide a low-cost thread scheduling; uses wide basses, tiling, programmable local memory. Compute cores occupy most part of a die.
  • 12. 7 KEY ARCHITECTURE FAMILIES RISC Reduced Instruction Set Computer CISC Complex Instruction Set Computer VLIW Very Long Instruction Word Vector architecture
  • 13. 8 . 1 CISC Complex Instruction Set Computer Designed in the 1970s which was a time where transistors were expensive while compilers were naive. Additionally, instruction packaging was the main concern due to shortage of memory. The latency of the memory was just a bit higher then registers. The goal was to de ne an instruction set that allows high level language constructs be translated into as few assembly language instructions as possible, improving performance as well as code density. Examples are VAX, x86, AMD64. Latency-oriented architecture.
  • 14. 8 . 2 CISC Heritage instructions access memory, a plenty of addressing modes, many instruction families and a very rich variable length ISA (alignment counts!), consequently, complicated instruction decoding logic. Moreover, a few registers are available for programmers. Nowadays 1. Instructions are broken down into μcode which are much easy to pipeline and process power ef ciently. 2. Transistors are spent to cache hierarchies, out-of-order execution, large RB and speculation to eliminate stalls. 3. Symmetric multi-processing.
  • 15. 9 . 1 RISC Reduced Instruction Set Computer Designed in the 1980s which was a time there IPL was the great concern. The memory-processor gap already began to come out. The goal was to decrease the number of clocks per instruction (CPI) while pipeline instructions as much as possible employing hardware to help with it. Uniform ISA, pipelining and large register le is a must-have. Examples are MIPS, ARM, PowerPC. Latency-oriented architecture.
  • 16. 9 . 2 RISC Heritage Relatively few instructions, all are the same length. Only load and store instructions access memory. Large resister le than typical CISC processors have. No μcode Nowadays Most architectures that comes from RISC are called Load-Store architectures, while may employ μops. They combine concepts of a classic RISC with usage of modern hardware enhancements: 1. deep pipelines, multi-cycle instructions, 2. out-of-order execution, 3. speculation.
  • 17. 10 THE HARDWARE/SOFTWARE GAP Compiler analyzes control ow, analyzes dependency schedules instructions maps variables to limited register set Hardware analyzes control ow, analyzes dependency schedules instructions remaps ISA register to large internal register set
  • 18. 11 A WORD TOWARDS REGISTERS In deed, registers are temporary storage locations inside the processor that hold data and addresses. Local variables are not the same as registers in ISA, since compiler uses IR internally and does register allocation close to the end of optimization process. Registers provided by ISA is not the same as actual registers on the processor. Internal reorder buffers which hold decoded instruction parameters and intermediate results are closer to classic de nition of a register le.
  • 19. 12 . 1 VLIW Very Long Instruction Word Designed in the 1980s which was a time there IPL was the great concern. The goal was to pipeline instructions as much as possible employing software to help with it reducing complexity of the hardware and mitigate the the Hardware/Software gap. Boost processor clock simplifying work per cycle. Example is IntelHP Itanium. Throughput-oriented architecture
  • 20. 12 . 2 VLIW Heritage Compiler determines which instructions can be performed in parallel, bundles this information and the instructions, and passes the bundle(word) to the hardware. No data dependencies between instructions in a word. Each operation in a word assigned to speci c issue slot (dedicated FU).
  • 21. 12 . 3 VLIW Nowadays hardly any generic processor implements VLIW brunchy nature of production codes (in contrast to HPC or scienti c codes), need to follow binary compatibility across the μarchitecture families. Whereas architecture is widely adopted for programmable co-processors where shrink in power consumption without lose of performance is crucial (DSP, GPU).
  • 22. 13 . 1 VECTOR PROCESSORS First introduced in 1976 and dominated for HPC in the 1980s because of high instruction throughput. The goal was to perform operations on vectors of data exposing data level parallelism (DLP) to increase instruction throughput. Vector pipelining is also called chaining. Example is Cray Throughput-oriented architectures.
  • 23. 13 . 2 VECTOR PROCESSORS Heritage Process the data in vectors, each element in a vector (lane) is independent on any other. Deep pipelines, wide execution units, not necessary the same width (batch length) as size of vector in elements (vector length). Most ef cient for simple memory patterns, but getter/scatter is usually possible too. Wide memory interfaces to saturate execution units. Large vector register le, cache is not a strict requirement and absent for classical vector processors.
  • 24. 13 . 3 VECTOR PROCESSORS Nowadays They aren't used in generic processors design, but used as a co-processors for a speci c workloads: HPC, multimedia. Precursors of most designs of modern GPUs. Vector pipes with short vector length (8-16 bytes) called SIMD units are widely integrated in modern general purpose processors to accelerate most demanding loops.
  • 25. 14 . 1 WHY IS IT DOING TO BE RISC LOAD/STORE? 1. Simple xed-width instructions & few addressing modes Cache-ef cient instruction fetch, branches are aligned. Simple hardware logic → power ef cient chips. Drive a higher clock rate. 2. Concise ISA with orthogonal functionality Complex instructions are ignored by compilers due to semantic gap → simple instructions simplify scheduling. Complex addressing lead either to variable length instructions or big instruction size → inef cient decoding and scheduling as well as alignment issues. 3. Large register set Expose possible instruction parallelism to the compiler.
  • 26. 15 LATEST TRENDS Architecture is seen as Load-store RISC-inherited Internally instructions are broken down into single-pipe μops μops are reordered and optionally organized into words μops or words are scheduled for execution, caching in the highest level is usually performed on this preprocessed view. Latest generations of Intel processors, NVIDIA Denver architecture and 64-bit ARM Cortex-A processors already employ this approach.
  • 27. 16 . 1 SUMMARY There are three key aspects of computer architecture: Instruction Set Architecture, μarchitecture and design. Some architectures aim to hide latency while others aim to maximize instruction throughput. CISC is created for compact code size and exact instruction encoding and used only on ISA level nowadays. RISC leads to less complicated decoding and pipeline stages allow boosting clock in affordable power budget. VLIW targets power ef cient high performance devices for speci c tasks or used internally on μarchitecture level. Vector processors transformed into SIMD-extensions and SIMT-like GPU designs.
  • 28. 16 . 2 SUMMARY Loads-Store architectres with its simple xed-width instructions, few addressing modes, concise ISA and optimal size register size is a winner solution. Architecture can expose different properties for it's different levels (ISA, μarchitecture).