Pragmatic optimization in modern programming - modern computer architecture concepts

1
PRAGMATIC
OPTIMIZATION
IN MODERN PROGRAMMING
MODERN COMPUTER ARCHITECTURE CONCEPTS
Created by for / 2015-2016Marina (geek) Kolpakova UNN

2
COURSE TOPICS
Ordering optimization approaches
Demystifying a compiler
Mastering compiler optimizations
Modern computer architectures concepts

3
OUTLINE
Three aspects of the computer architecture
Latency vs Throughput architectures
Architecture families
CISC
RISC
VLIW
Vector
Why is it doing to be load/store?
Latest trends
summary

4 . 1
1-ST ASPECT OF COMPUTER ARCHITECTURE
Instruction Set Architecture or ISA (interface)
is a contract between HW and SW,
which speci es right, possibilities & limitations.
Class of ISA (load-store, register-memory)
Memory addressing modes & rules (base-immediate,
alignment requirements)
Types & sizes of operands (size of byte, short)
Operations (general arithmetic, control, logical)
Control ow instructions (branches, jumps, calls, returns)
Encoding an ISA ( xed or variable length)
All the conceptual aspects of the architecture

4 . 2
2-ND ASPECT OF COMPUTER ARCHITECTURE
Microarchitecture (organization) is a concrete
implementation of the ISA, the high-level aspects of a
processor design (memory system, memory interconnect,
design of the processor internals).
Pipeline width
Instruction latencies
Issue wight and scheduling
Speculation capabilities
All the concrete aspects of the architecture

4 . 3
3-ND ASPECT OF COMPUTER ARCHITECTURE
Hardware or chip (design) is the speci cs of a computer,
including the logic design and packaging. This is a concrete
implementation of the microarchitecture.
Tech-process
Clock rates
On die placement
All the properties of the chip

5 . 1
ARCHITECTURE: ARMV8-A
μarch IP Hardware
Cortex-A53 ARM Octa Exynos 7(7580) 1.6GHz
28nmHKMG
Cortex-A57 ARM Octa Exynos 7(7420) big.LITTLE 2.1/1.5
14FF ( LPE) (Samsung)
Cortex-A72 ARM Deca MediaTek Helio X20 big.LITTLE
2.5/2.0/1.4 20nmHKMG (TSMC)
Cortex-A35 ARM -

5 . 2
μarch IP Hardware
Denver NVIDIA Dual Tegra K1 2.3GHz
28nmHPM
Kryo Qualcomm Tetra S820 big.LITTLE 2.2/1.6GHz
14FF ( LPP) (Samsung)
Exynos M1 Samsung Quad Exynos 8890 big.LITTLE
2.6/2.29GHz 14FF ( LPP)
(Samsung)

5 . 3
μarch IP Hardware
Cyclone Apple Dual A7 (APL0698) 1.4GHz
28nmHKMG (Samsung)
Typhoon Apple Dual A8 (APL1012) 1.3GHz
20nmHKMG (TSMC)
Twister Apple Dual A9 (APL0898) 1.85GHz
16nmFF+ (TSMC)
Dual A9 (APL1022) 1.85GHz
14nmFF( LPP) (Samsung)

6 . 1
LATENCY VS THROUGHPUT ARCHITECTURES
Latency oriented architecture
addresses latency hiding issues;
features sophisticated pipelining;
out-of-order;
employs advanced cache hierarchies;
widely uses speculation.
Compute cores occupy only a small part of a die.

6 . 2
LATENCY VS THROUGHPUT ARCHITECTURES
Throughput oriented architecture
performs a bunch of operations in y;
features many simple compute units/cores;
employs simple pipelines and large register le to
provide a low-cost thread scheduling;
uses wide basses, tiling, programmable local memory.
Compute cores occupy most part of a die.

7
KEY ARCHITECTURE FAMILIES
RISC
Reduced Instruction Set Computer
CISC
Complex Instruction Set Computer
VLIW
Very Long Instruction Word
Vector architecture

8 . 1
CISC
Complex Instruction Set Computer
Designed in the 1970s which was a time where transistors
were expensive while compilers were naive. Additionally,
instruction packaging was the main concern due to
shortage of memory. The latency of the memory was just a
bit higher then registers.
The goal was to de ne an instruction set that allows high
level language constructs be translated into as few
assembly language instructions as possible, improving
performance as well as code density.
Examples are VAX, x86, AMD64.
Latency-oriented architecture.

8 . 2
CISC
Heritage
instructions access memory, a plenty of addressing modes,
many instruction families and a very rich variable length
ISA (alignment counts!),
consequently, complicated instruction decoding logic.
Moreover, a few registers are available for programmers.
Nowadays
1. Instructions are broken down into μcode which are
much easy to pipeline and process power ef ciently.
2. Transistors are spent to cache hierarchies, out-of-order
execution, large RB and speculation to eliminate stalls.
3. Symmetric multi-processing.

9 . 1
RISC
Reduced Instruction Set Computer
Designed in the 1980s which was a time there IPL was the
great concern. The memory-processor gap already began
to come out.
The goal was to decrease the number of clocks per
instruction (CPI) while pipeline instructions as much as
possible employing hardware to help with it. Uniform ISA,
pipelining and large register le is a must-have.
Examples are MIPS, ARM, PowerPC.
Latency-oriented architecture.

9 . 2
RISC
Heritage
Relatively few instructions, all are the same length.
Only load and store instructions access memory.
Large resister le than typical CISC processors have.
No μcode
Nowadays Most architectures that comes from RISC are
called Load-Store architectures, while may employ μops.
They combine concepts of a classic RISC with usage of
modern hardware enhancements:
1. deep pipelines, multi-cycle instructions,
2. out-of-order execution,
3. speculation.

10
THE HARDWARE/SOFTWARE GAP
Compiler
analyzes control ow, analyzes dependency
schedules instructions
maps variables to limited register set
Hardware
analyzes control ow, analyzes dependency
schedules instructions
remaps ISA register to large internal register set

11
A WORD TOWARDS REGISTERS
In deed, registers are temporary storage locations inside the
processor that hold data and addresses.
Local variables are not the same as registers in ISA, since
compiler uses IR internally and does register allocation
close to the end of optimization process.
Registers provided by ISA is not the same as actual
registers on the processor. Internal reorder buffers which
hold decoded instruction parameters and intermediate
results are closer to classic de nition of a register le.

12 . 1
VLIW
Very Long Instruction Word
Designed in the 1980s which was a time there IPL was the
great concern.
The goal was to pipeline instructions as much as possible
employing software to help with it reducing complexity of
the hardware and mitigate the the Hardware/Software
gap. Boost processor clock simplifying work per cycle.
Example is IntelHP Itanium.
Throughput-oriented architecture

12 . 2
VLIW
Heritage
Compiler determines which instructions can be
performed in parallel,
bundles this information and the instructions,
and passes the bundle(word) to the hardware.
No data dependencies between instructions in a word.
Each operation in a word assigned to speci c issue slot
(dedicated FU).

12 . 3
VLIW
Nowadays
hardly any generic processor implements VLIW
brunchy nature of production codes (in contrast to
HPC or scienti c codes),
need to follow binary compatibility across the
μarchitecture families.
Whereas architecture is widely adopted for
programmable co-processors where shrink in power
consumption without lose of performance is crucial
(DSP, GPU).

13 . 1
VECTOR PROCESSORS
First introduced in 1976 and dominated for HPC in the
1980s because of high instruction throughput.
The goal was to perform operations on vectors of data
exposing data level parallelism (DLP) to increase
instruction throughput. Vector pipelining is also called
chaining.
Example is Cray
Throughput-oriented architectures.

13 . 2
VECTOR PROCESSORS
Heritage
Process the data in vectors, each element in a vector
(lane) is independent on any other.
Deep pipelines, wide execution units, not necessary the
same width (batch length) as size of vector in elements
(vector length).
Most ef cient for simple memory patterns, but
getter/scatter is usually possible too.
Wide memory interfaces to saturate execution units.
Large vector register le, cache is not a strict
requirement and absent for classical vector processors.

13 . 3
VECTOR PROCESSORS
Nowadays
They aren't used in generic processors design, but used
as a co-processors for a speci c workloads: HPC,
multimedia.
Precursors of most designs of modern GPUs.
Vector pipes with short vector length (8-16 bytes) called
SIMD units are widely integrated in modern general
purpose processors to accelerate most demanding
loops.

14 . 1
WHY IS IT DOING TO BE RISC LOAD/STORE?
1. Simple xed-width instructions & few addressing modes
Cache-ef cient instruction fetch, branches are aligned.
Simple hardware logic → power ef cient chips.
Drive a higher clock rate.
2. Concise ISA with orthogonal functionality
Complex instructions are ignored by compilers due to
semantic gap → simple instructions simplify scheduling.
Complex addressing lead either to variable length
instructions or big instruction size → inef cient
decoding and scheduling as well as alignment issues.
3. Large register set
Expose possible instruction parallelism to the compiler.

15
LATEST TRENDS
Architecture is seen as Load-store RISC-inherited
Internally instructions are broken down into single-pipe
μops
μops are reordered and optionally organized into words
μops or words are scheduled for execution, caching in the
highest level is usually performed on this preprocessed
view.
Latest generations of Intel processors, NVIDIA Denver
architecture and 64-bit ARM Cortex-A processors already
employ this approach.

16 . 1
SUMMARY
There are three key aspects of computer architecture:
Instruction Set Architecture, μarchitecture and design.
Some architectures aim to hide latency while others aim to
maximize instruction throughput.
CISC is created for compact code size and exact instruction
encoding and used only on ISA level nowadays.
RISC leads to less complicated decoding and pipeline stages
allow boosting clock in affordable power budget.
VLIW targets power ef cient high performance devices for
speci c tasks or used internally on μarchitecture level.
Vector processors transformed into SIMD-extensions and
SIMT-like GPU designs.

16 . 2
SUMMARY
Loads-Store architectres with its simple xed-width
instructions, few addressing modes, concise ISA
and optimal size register size is a winner solution.
Architecture can expose different properties
for it's different levels (ISA, μarchitecture).

17
THE END
/ 2015-2016MARINA KOLPAKOVA

Pragmatic optimization in modern programming - modern computer architecture concepts

More Related Content

What's hot (20)

Viewers also liked (18)

Similar to Pragmatic optimization in modern programming - modern computer architecture concepts (20)

Recently uploaded (20)

Pragmatic optimization in modern programming - modern computer architecture concepts