ECE 4100/6100
Advanced Computer Architecture
Lecture 3 Performance
Prof. Hsien-Hsin Sean Lee
School of Electrical and Computer Engineering
Georgia Institute of Technology
Performance
• Execution/Response time (Latency)
– Elapsed time between start and completion of an
event
– How long my job takes?
• Throughput (Bandwidth)
– Total amount of work done within a given period
of time
– How many jobs done per unit time on a system?
CPU Performance
• Execution Time = Seconds / Program
cyclenInstructio
cyclesnsInstructio seconds
program
××
• Programmer
• Algorithms
• ISA
• Compilers
• Microarchitecture
• System architecture
• Microarchitecture,
pipeline depth
• Circuit design
• Technology
Pipeline Stage
Combinational
Logic
F/F
F/F
• Optimal FO4 per pipe
– 6 to 8 [UT/Compaq, ISCA-29]
– 18 (15+3 latch) [IBM, MICRO-35]
P4 pipe stage~ 16 FO4
1 FO4
Slide from Lecture 1 Pipelining
Architecture Comparison
• Many architecture research just make the following
assumptions
• Instructions / program is fixed
– Same binary ()
– Same compiler ()
– Same benchmark
• Seconds per cycle is constant ()
– Same frequency
– Same pipeline depth
– Typically a bad assumption today
• Focus on IPC or CPI
• It is more complicated for today’s architects !
Example: Calculating CPI
Typical Mix of
instruction types
in program
Base Machine (Reg / Reg)
Op Freq Cycles CPI(i) (% Time)
ALU 50% 1 .5 (33%)
Load 20% 2 .4 (27%)
Store 10% 2 .2 (13%)
Branch 20% 2 .4 (27%)
1.5
Design guideline: Make the common case fast
MIPS 1% rule: only consider adding an instruction of it is shown to add 1%
performance improvement on reasonable benchmarks.
Run benchmark and collect workload characterization
(simulate, machine counters, or sampling)
Performance Comparison
• For some program running on machine X,
PerformanceX = 1 / Execution timeX
• "X is nn times faster than Y"
PerformanceX / PerformanceY = nn
= speedup of X over Y
• Problem:
– machine A runs a program in 20 seconds
– machine B runs the same program in 25 seconds
Performance Evaluation: Benchmark
• (Real) Programs
– In the form of collection of programs
– E.g., SPEC, Winstone, SYSMARK, 3D Winbench, EEMBC
• Kernels:
– Small key pieces of real programs
– E.g., Livermore Fortran Loops Kernels (LFK), Linpack
• Modified (or scripted)
– To focus on some particular aspects (e.g. remove I/O, focus on CPU)
• (Toy) Benchmarks
– Produce expected results
• Synthetic Benchmarks:
– Representative instruction mix
– E.g., Dhrystone, Whetstone
• Important for
– Architectural and microarchitectural design trade-off
– Competitive analysis of real products
Performance Summary Measurement
• Average of total execution time
• This is Arithmetic Mean (Weighted ArithmeticArithmetic Mean (Weighted Arithmetic
Mean)Mean)
∑∑ ==
∗
n
i
ii
n
i
i TimeWeight
n
Time
n 11
1
or
1
Performance Summary Measurement
• Ratei is a function of 1/Timei
• Used to represent the average “rate” such as
instruction per cycle (IPC)
∑∑ ==
n
i i
i
n
i i Rate
Weight
n
Rate
n
11
or
1
Why Harmonic Mean?
• 30 mph for the first 10 miles
• 90 mph for the next 10 miles
• Average speed? (30+90)/2 = 60 mph??
• Wrong!
• Average speed = total distance / total time
• (10+10)/(10/30 + 10/90) = 45 mph
New Breed of Metrics
• Performance / Watt
– Performance achievable at the same cooling
capacity
• Performance / Joule (Energy)
– Achievable performance at the lifetime of the
same energy source (i.e., battery = energy)
– Equivalent to reciprocal of energy-delay product
(ED product)
Amdahl’s Law (Law of Diminishing Returns)
• Make the common case faster
• Speedup
= Perfnew / Perfold = Told / Tnew=
• Performance improvement from using faster mode
is limited by the fraction the faster mode can be
applied.
f(1 - f)
Told
(1 - f)
Tnew
f / P
P
f
f +− )1(
1
Amdahl’s Law Analogy
• Driving from Orlando to Atlanta
– 60 miles/hr from Orlando to Macon
– 120 miles/hr from Macon to Atlanta
– How much time you can save
compared against driving all the way
at 60 miles/hr from Orlando to
Atlanta?
• 6hr 45min vs. 7hr 30min = ~11%
speedup
• Key is to speed up the biggie portion, i.e.
speed up frequently executed blocks
Parallelism vs. Speedup
1.11x
1.97x
1.33x
1
10
100
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Speed­up
Code portion in Faster mode (f)
Amdahl's Law speed­up as a function of parallelism
P=1
P=2
P=4
P=8
P=16
P=32
P=64
Gustafson’s Law
• Amdahl’s Law killed massive parallel processing (MPP)
• Gustafson came to rescue
Seq
Tnew
Parallel
Told
Seq P * Parallel Time
Assume: Seq + Parallel = 1 (Tnew)
∴
Speedup = Seq + p * (1 – Seq) where p=parallel factor
If Seq diminishes with increased problem size, Speedup
 p
Amdahl versus Gustafson
Who is right?
The Principle of Locality
• Knuth made the original observation about program locality
in 1971.
– … less than 4 percent of a program generally accounts for
more than half of its running time.
• 90/10 rule: a program spends 90% of its execution time in
only 10% of the code
• Two types of locality
– Temporal locality (locality in time)
– Spatial locality (locality in space)
• Memory subsystem design heavily leverages the locality
concept for better performance
Example of Performance Evaluation (I)
Operation Frequency Clock cycle
count
ALU Ops (reg-reg) 43% 1
Loads 21% 2
Stores 12% 2
Branches 24% 2
Assume 25% of the ALU ops directly use a loaded operand that is not used again.
We propose adding ALU instructions that have one src operand in memory.
These new reg-mem instructions spend 2 clock cycles. Also assume that the
extended instruction set increase branch’s clock by 1, but no impact to cycle time.
Would this change improve performance ?
Example of Performance Evaluation (I)
Operation Frequency Clock cycle
count
ALU Ops (reg-reg) 43% 1
Loads 21% 2
Stores 12% 2
Branches 24% 2
Assume 25% of the ALU ops directly use a loaded operand that is not used again.
We propose adding ALU instructions that have one src operand in memory.
These new reg-mem instructions spend 2 clock cycles. Also assume that the
extended instruction set increase branch’s clock by 1, but no impact to cycle time.
Would this change improve performance ?
703.13*24.02*12.02*)43.0*25.021.0(1)43.025.043.0(243.025.0 =++−+∗∗−+∗∗=newCycles
57.12*24.0212.0221.0143.0 =+∗+∗+∗=oldCycles
Example of Performance Evaluation (II)
FP instructions = 25%
Average CPI of FP instructions = 4.0
Average CPI of other instructions = 1.33
FPSQRT = 2% of all instructions, CPI of FPSQRT =
20
• Design Option 1: decrease the CPI of FQSQRT to 2
• Design Option 2: decease the average CPI of all FP instructions to 2.5
Example of Performance Evaluation (II)
FP instructions = 25%
Average CPI of FP instructions = 4.0
Average CPI of other instructions = 1.33
FPSQRT = 2% of all instructions, CPI of FPSQRT =
20
• Design Option 1: decrease the CPI of FQSQRT to 2
• Design Option 2: decease the average CPI of all FP instructions to 2.5
Original CPI = 0.25*4 + 1.33*(1-0.25) = 2.0
Option 1 CPI = 2.0 – 2%*(20-2) = 1.64
Option 2 CPI = 0.25*2.5 + 1.33*(1-0.25) = 1.625
Speedup of Option 1 = 2/1.64 = 1.2195
Speedup of Option 2 = 2/1.625 = 1.2308
Example of Performance Evaluation (III)
Clock freq = 1.4 GHz
FP instructions = 25%
Average CPI of FP instructions = 4.0
Average CPI of other instructions = 1.33
FPSQRT = 2%, CPI of FPSQRT = 20
• Design Option 1: decrease the CPI of FQSQRT to 2, clock freq = 1.2GHz
• Design Option 2: decease the average CPI of all FP instructions to 2.5,
clock freq = 1.1 GHz
Example of Performance Evaluation (III)
Clock freq = 1.4 GHz
FP instructions = 25%
Average CPI of FP instructions = 4.0
Average CPI of other instructions = 1.33
FPSQRT = 2%, CPI of FPSQRT = 20
• Design Option 1: decrease the CPI of FQSQRT to 2, clock freq = 1.2GHz
• Design Option 2: decease the average CPI of all FP instructions to 2.5,
clock freq = 1.1 GHz
Original CPI = 2.0, IPC = 1/2, Inst/Sec = ½*1.4G = 0.7G inst/s
Option 1 CPI = 1.64, IPC = 1/1.64, Inst/Sec = 1/1.64*1.2G = 0.73G inst/s
Option 2 CPI = 1.625, IPC = 1/1.625, Inst/Sec = 1/1.625*1.1G = 0.68G inst/s

More Related Content

PPT
Lec1 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- Intro
PPT
Lec0 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech ECE -- Introdu...
PPT
Lec9 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Memory part 1
PPT
Lec3 final
PPT
Pipeline
PPTX
Loop parallelization & pipelining
PPT
pipeline and vector processing
PPTX
Chapter 04 the processor
Lec1 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- Intro
Lec0 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech ECE -- Introdu...
Lec9 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Memory part 1
Lec3 final
Pipeline
Loop parallelization & pipelining
pipeline and vector processing
Chapter 04 the processor

What's hot (20)

PPT
Computer Organozation
PPT
Pipeline hazard
PPTX
INSTRUCTION LEVEL PARALLALISM
PPTX
Design a pipeline
PPSX
Pipelining_Computer Organization_TU(BIM)
PPTX
Instruction pipeline: Computer Architecture
PPSX
Concept of Pipelining
PDF
Advanced Comuter Architecture Ch6 Problem Solutions
PDF
Pipeline and data hazard
PPTX
Instruction pipelining
PPTX
Arithmatic pipline
PPT
Piplining
PPTX
3 Pipelining
PPT
Pipelining In computer
PDF
Pipelining
PPT
Chapter6 pipelining
PPT
Lec18 pipeline
PPT
1.prallelism
PDF
Pragmatic optimization in modern programming - modern computer architecture c...
Computer Organozation
Pipeline hazard
INSTRUCTION LEVEL PARALLALISM
Design a pipeline
Pipelining_Computer Organization_TU(BIM)
Instruction pipeline: Computer Architecture
Concept of Pipelining
Advanced Comuter Architecture Ch6 Problem Solutions
Pipeline and data hazard
Instruction pipelining
Arithmatic pipline
Piplining
3 Pipelining
Pipelining In computer
Pipelining
Chapter6 pipelining
Lec18 pipeline
1.prallelism
Pragmatic optimization in modern programming - modern computer architecture c...
Ad

Viewers also liked (20)

PPT
Lec6 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- Can...
PPT
Lec20 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- Da...
PPT
Lec14 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- Se...
PPT
Lec5 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Branch Pred...
PPT
Lec2 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- Num...
PPT
Lec4 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- ISA
PPT
Lec6 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Instruction...
PPT
Lec10 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- Mu...
PPT
Lec19 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- Pr...
PPT
Lec8 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- Qui...
PPT
Lec18 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- In...
PPT
Lec3 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- CMO...
PPT
Lec9 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- Com...
PPT
Lec1 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Pipelining
PPT
Lec17 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- Me...
PPT
Lec13 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- Sh...
PPT
Lec16 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- Fi...
PPT
Lec12 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- Ad...
PPT
Lec4 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- CMOS
PPT
Lec2 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- ILP
Lec6 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- Can...
Lec20 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- Da...
Lec14 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- Se...
Lec5 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Branch Pred...
Lec2 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- Num...
Lec4 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- ISA
Lec6 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Instruction...
Lec10 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- Mu...
Lec19 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- Pr...
Lec8 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- Qui...
Lec18 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- In...
Lec3 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- CMO...
Lec9 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- Com...
Lec1 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Pipelining
Lec17 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- Me...
Lec13 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- Sh...
Lec16 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- Fi...
Lec12 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- Ad...
Lec4 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- CMOS
Lec2 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- ILP
Ad

Similar to Lec3 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Performance (20)

PDF
04 performance
PDF
03 performance
PPT
Introduction to MIPS Computer Architecture
PPTX
Book for general presentation for computer science
PPT
PPT
Computer Performance Evaluation(CPI).ppt
PPT
Tn6 facility layout
PPT
Tn6 facility+layout
PPTX
CMPN301-Pipelining_V2.pptx
PPS
Measuring Performance by Irfanullah
PDF
Computer architecture short note (version 8)
PPT
Performance of processor.ppt
PDF
02 performance
PPTX
Evaluation of computer performance
PPTX
Cpu performance matrix
PPTX
L07_performance and cost in advanced hardware- computer architecture.pptx
PDF
Parallel Computing - Lec 6
PPT
COMPUTER ARCHITECTURE BASIC CONCEPT
PPTX
2. Module_1_Computer Performance, Metrics, Measurement, & Evaluation (1).pptx
PPT
Basic MIPS implementation
04 performance
03 performance
Introduction to MIPS Computer Architecture
Book for general presentation for computer science
Computer Performance Evaluation(CPI).ppt
Tn6 facility layout
Tn6 facility+layout
CMPN301-Pipelining_V2.pptx
Measuring Performance by Irfanullah
Computer architecture short note (version 8)
Performance of processor.ppt
02 performance
Evaluation of computer performance
Cpu performance matrix
L07_performance and cost in advanced hardware- computer architecture.pptx
Parallel Computing - Lec 6
COMPUTER ARCHITECTURE BASIC CONCEPT
2. Module_1_Computer Performance, Metrics, Measurement, & Evaluation (1).pptx
Basic MIPS implementation

More from Hsien-Hsin Sean Lee, Ph.D. (13)

PPT
Lec15 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- Re...
PPT
Lec11 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- De...
PPT
Lec7 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- Kar...
PPT
Lec5 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- Boo...
PPT
Lec15 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- EPIC VLIW
PPT
Lec14 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech --- Coherence
PPT
Lec13 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- SMP
PPT
Lec13 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Multicore
PPT
Lec12 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- P6, Netbur...
PPT
Lec11 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Memory part3
PPT
Lec10 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Memory part2
PPT
Lec8 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Dynamic Sch...
PPT
Lec7 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Dynamic Sch...
Lec15 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- Re...
Lec11 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- De...
Lec7 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- Kar...
Lec5 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- Boo...
Lec15 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- EPIC VLIW
Lec14 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech --- Coherence
Lec13 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- SMP
Lec13 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Multicore
Lec12 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- P6, Netbur...
Lec11 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Memory part3
Lec10 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Memory part2
Lec8 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Dynamic Sch...
Lec7 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Dynamic Sch...

Recently uploaded (20)

PPTX
🛠️ Introduction to Industrial Arts.pptx
PPTX
Quiz template 300 pages advanced and Tech friendly
PPTX
Malnutrition_Presentation_Revised.pptxhwjsjjsjs
PPTX
dDifference Beetween Saving slides And Investment Slides.pptx
PDF
script scriptscriptscriptscriptscriptscript
PPTX
Installation and Maintenance in Hardware
PPT
Soldering technics Aerospace electronic assembly
PPTX
Pin configuration and project related to
PDF
Cattle Scales (https://cattlescales.com.au/)
PPTX
vinay_mahavar_enhanddsfsdfssdfssfced.pptx
PDF
GENERATOR AND IMPROVED COIL THEREFOR HAVINGELECTRODYNAMIC PROPERTIES
PPT
COA______________₹₹_₹₹33₹₹₹33₹₹₹3UNIT1V8.ppt
PPT
The process of making an electrical connection by melting low-temperature met...
PPTX
5. PPT Bersikap Kreatif.pptxjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjj
PPSX
Presentatiohdhdhdhdhdhfhfbfhrrbrurbrurbn.ppsx
PPTX
vinay_mahavar_industrial_training_3D.pptx
PPTX
F&B 5th Semester exam Class Notes (2).pptx
PPTX
Presentation utk shar baurlah bhhkuaie.pptx
PPTX
Presentation societal project DEEPIKA T.pptx
PPT
Access List. Configuration of Layer three Router Access List
🛠️ Introduction to Industrial Arts.pptx
Quiz template 300 pages advanced and Tech friendly
Malnutrition_Presentation_Revised.pptxhwjsjjsjs
dDifference Beetween Saving slides And Investment Slides.pptx
script scriptscriptscriptscriptscriptscript
Installation and Maintenance in Hardware
Soldering technics Aerospace electronic assembly
Pin configuration and project related to
Cattle Scales (https://cattlescales.com.au/)
vinay_mahavar_enhanddsfsdfssdfssfced.pptx
GENERATOR AND IMPROVED COIL THEREFOR HAVINGELECTRODYNAMIC PROPERTIES
COA______________₹₹_₹₹33₹₹₹33₹₹₹3UNIT1V8.ppt
The process of making an electrical connection by melting low-temperature met...
5. PPT Bersikap Kreatif.pptxjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjj
Presentatiohdhdhdhdhdhfhfbfhrrbrurbrurbn.ppsx
vinay_mahavar_industrial_training_3D.pptx
F&B 5th Semester exam Class Notes (2).pptx
Presentation utk shar baurlah bhhkuaie.pptx
Presentation societal project DEEPIKA T.pptx
Access List. Configuration of Layer three Router Access List

Lec3 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Performance

  • 1. ECE 4100/6100 Advanced Computer Architecture Lecture 3 Performance Prof. Hsien-Hsin Sean Lee School of Electrical and Computer Engineering Georgia Institute of Technology
  • 2. Performance • Execution/Response time (Latency) – Elapsed time between start and completion of an event – How long my job takes? • Throughput (Bandwidth) – Total amount of work done within a given period of time – How many jobs done per unit time on a system?
  • 3. CPU Performance • Execution Time = Seconds / Program cyclenInstructio cyclesnsInstructio seconds program ×× • Programmer • Algorithms • ISA • Compilers • Microarchitecture • System architecture • Microarchitecture, pipeline depth • Circuit design • Technology
  • 4. Pipeline Stage Combinational Logic F/F F/F • Optimal FO4 per pipe – 6 to 8 [UT/Compaq, ISCA-29] – 18 (15+3 latch) [IBM, MICRO-35] P4 pipe stage~ 16 FO4 1 FO4 Slide from Lecture 1 Pipelining
  • 5. Architecture Comparison • Many architecture research just make the following assumptions • Instructions / program is fixed – Same binary () – Same compiler () – Same benchmark • Seconds per cycle is constant () – Same frequency – Same pipeline depth – Typically a bad assumption today • Focus on IPC or CPI • It is more complicated for today’s architects !
  • 6. Example: Calculating CPI Typical Mix of instruction types in program Base Machine (Reg / Reg) Op Freq Cycles CPI(i) (% Time) ALU 50% 1 .5 (33%) Load 20% 2 .4 (27%) Store 10% 2 .2 (13%) Branch 20% 2 .4 (27%) 1.5 Design guideline: Make the common case fast MIPS 1% rule: only consider adding an instruction of it is shown to add 1% performance improvement on reasonable benchmarks. Run benchmark and collect workload characterization (simulate, machine counters, or sampling)
  • 7. Performance Comparison • For some program running on machine X, PerformanceX = 1 / Execution timeX • "X is nn times faster than Y" PerformanceX / PerformanceY = nn = speedup of X over Y • Problem: – machine A runs a program in 20 seconds – machine B runs the same program in 25 seconds
  • 8. Performance Evaluation: Benchmark • (Real) Programs – In the form of collection of programs – E.g., SPEC, Winstone, SYSMARK, 3D Winbench, EEMBC • Kernels: – Small key pieces of real programs – E.g., Livermore Fortran Loops Kernels (LFK), Linpack • Modified (or scripted) – To focus on some particular aspects (e.g. remove I/O, focus on CPU) • (Toy) Benchmarks – Produce expected results • Synthetic Benchmarks: – Representative instruction mix – E.g., Dhrystone, Whetstone • Important for – Architectural and microarchitectural design trade-off – Competitive analysis of real products
  • 9. Performance Summary Measurement • Average of total execution time • This is Arithmetic Mean (Weighted ArithmeticArithmetic Mean (Weighted Arithmetic Mean)Mean) ∑∑ == ∗ n i ii n i i TimeWeight n Time n 11 1 or 1
  • 10. Performance Summary Measurement • Ratei is a function of 1/Timei • Used to represent the average “rate” such as instruction per cycle (IPC) ∑∑ == n i i i n i i Rate Weight n Rate n 11 or 1
  • 11. Why Harmonic Mean? • 30 mph for the first 10 miles • 90 mph for the next 10 miles • Average speed? (30+90)/2 = 60 mph?? • Wrong! • Average speed = total distance / total time • (10+10)/(10/30 + 10/90) = 45 mph
  • 12. New Breed of Metrics • Performance / Watt – Performance achievable at the same cooling capacity • Performance / Joule (Energy) – Achievable performance at the lifetime of the same energy source (i.e., battery = energy) – Equivalent to reciprocal of energy-delay product (ED product)
  • 13. Amdahl’s Law (Law of Diminishing Returns) • Make the common case faster • Speedup = Perfnew / Perfold = Told / Tnew= • Performance improvement from using faster mode is limited by the fraction the faster mode can be applied. f(1 - f) Told (1 - f) Tnew f / P P f f +− )1( 1
  • 14. Amdahl’s Law Analogy • Driving from Orlando to Atlanta – 60 miles/hr from Orlando to Macon – 120 miles/hr from Macon to Atlanta – How much time you can save compared against driving all the way at 60 miles/hr from Orlando to Atlanta? • 6hr 45min vs. 7hr 30min = ~11% speedup • Key is to speed up the biggie portion, i.e. speed up frequently executed blocks
  • 15. Parallelism vs. Speedup 1.11x 1.97x 1.33x 1 10 100 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Speed­up Code portion in Faster mode (f) Amdahl's Law speed­up as a function of parallelism P=1 P=2 P=4 P=8 P=16 P=32 P=64
  • 16. Gustafson’s Law • Amdahl’s Law killed massive parallel processing (MPP) • Gustafson came to rescue Seq Tnew Parallel Told Seq P * Parallel Time Assume: Seq + Parallel = 1 (Tnew) ∴ Speedup = Seq + p * (1 – Seq) where p=parallel factor If Seq diminishes with increased problem size, Speedup  p
  • 18. The Principle of Locality • Knuth made the original observation about program locality in 1971. – … less than 4 percent of a program generally accounts for more than half of its running time. • 90/10 rule: a program spends 90% of its execution time in only 10% of the code • Two types of locality – Temporal locality (locality in time) – Spatial locality (locality in space) • Memory subsystem design heavily leverages the locality concept for better performance
  • 19. Example of Performance Evaluation (I) Operation Frequency Clock cycle count ALU Ops (reg-reg) 43% 1 Loads 21% 2 Stores 12% 2 Branches 24% 2 Assume 25% of the ALU ops directly use a loaded operand that is not used again. We propose adding ALU instructions that have one src operand in memory. These new reg-mem instructions spend 2 clock cycles. Also assume that the extended instruction set increase branch’s clock by 1, but no impact to cycle time. Would this change improve performance ?
  • 20. Example of Performance Evaluation (I) Operation Frequency Clock cycle count ALU Ops (reg-reg) 43% 1 Loads 21% 2 Stores 12% 2 Branches 24% 2 Assume 25% of the ALU ops directly use a loaded operand that is not used again. We propose adding ALU instructions that have one src operand in memory. These new reg-mem instructions spend 2 clock cycles. Also assume that the extended instruction set increase branch’s clock by 1, but no impact to cycle time. Would this change improve performance ? 703.13*24.02*12.02*)43.0*25.021.0(1)43.025.043.0(243.025.0 =++−+∗∗−+∗∗=newCycles 57.12*24.0212.0221.0143.0 =+∗+∗+∗=oldCycles
  • 21. Example of Performance Evaluation (II) FP instructions = 25% Average CPI of FP instructions = 4.0 Average CPI of other instructions = 1.33 FPSQRT = 2% of all instructions, CPI of FPSQRT = 20 • Design Option 1: decrease the CPI of FQSQRT to 2 • Design Option 2: decease the average CPI of all FP instructions to 2.5
  • 22. Example of Performance Evaluation (II) FP instructions = 25% Average CPI of FP instructions = 4.0 Average CPI of other instructions = 1.33 FPSQRT = 2% of all instructions, CPI of FPSQRT = 20 • Design Option 1: decrease the CPI of FQSQRT to 2 • Design Option 2: decease the average CPI of all FP instructions to 2.5 Original CPI = 0.25*4 + 1.33*(1-0.25) = 2.0 Option 1 CPI = 2.0 – 2%*(20-2) = 1.64 Option 2 CPI = 0.25*2.5 + 1.33*(1-0.25) = 1.625 Speedup of Option 1 = 2/1.64 = 1.2195 Speedup of Option 2 = 2/1.625 = 1.2308
  • 23. Example of Performance Evaluation (III) Clock freq = 1.4 GHz FP instructions = 25% Average CPI of FP instructions = 4.0 Average CPI of other instructions = 1.33 FPSQRT = 2%, CPI of FPSQRT = 20 • Design Option 1: decrease the CPI of FQSQRT to 2, clock freq = 1.2GHz • Design Option 2: decease the average CPI of all FP instructions to 2.5, clock freq = 1.1 GHz
  • 24. Example of Performance Evaluation (III) Clock freq = 1.4 GHz FP instructions = 25% Average CPI of FP instructions = 4.0 Average CPI of other instructions = 1.33 FPSQRT = 2%, CPI of FPSQRT = 20 • Design Option 1: decrease the CPI of FQSQRT to 2, clock freq = 1.2GHz • Design Option 2: decease the average CPI of all FP instructions to 2.5, clock freq = 1.1 GHz Original CPI = 2.0, IPC = 1/2, Inst/Sec = ½*1.4G = 0.7G inst/s Option 1 CPI = 1.64, IPC = 1/1.64, Inst/Sec = 1/1.64*1.2G = 0.73G inst/s Option 2 CPI = 1.625, IPC = 1/1.625, Inst/Sec = 1/1.625*1.1G = 0.68G inst/s