Lec3 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Performance

ECE 4100/6100
Advanced Computer Architecture
Lecture 3 Performance
Prof. Hsien-Hsin Sean Lee
School of Electrical and Computer Engineering
Georgia Institute of Technology

Performance
• Execution/Response time (Latency)
– Elapsed time between start and completion of an
event
– How long my job takes?
• Throughput (Bandwidth)
– Total amount of work done within a given period
of time
– How many jobs done per unit time on a system?

CPU Performance
• Execution Time = Seconds / Program
cyclenInstructio
cyclesnsInstructio seconds
program
××
• Programmer
• Algorithms
• ISA
• Compilers
• Microarchitecture
• System architecture
• Microarchitecture,
pipeline depth
• Circuit design
• Technology

Pipeline Stage
Combinational
Logic
F/F
F/F
• Optimal FO4 per pipe
– 6 to 8 [UT/Compaq, ISCA-29]
– 18 (15+3 latch) [IBM, MICRO-35]
P4 pipe stage~ 16 FO4
1 FO4
Slide from Lecture 1 Pipelining

Architecture Comparison
• Many architecture research just make the following
assumptions
• Instructions / program is fixed
– Same binary ()
– Same compiler ()
– Same benchmark
• Seconds per cycle is constant ()
– Same frequency
– Same pipeline depth
– Typically a bad assumption today
• Focus on IPC or CPI
• It is more complicated for today’s architects !

Example: Calculating CPI
Typical Mix of
instruction types
in program
Base Machine (Reg / Reg)
Op Freq Cycles CPI(i) (% Time)
ALU 50% 1 .5 (33%)
Load 20% 2 .4 (27%)
Store 10% 2 .2 (13%)
Branch 20% 2 .4 (27%)
1.5
Design guideline: Make the common case fast
MIPS 1% rule: only consider adding an instruction of it is shown to add 1%
performance improvement on reasonable benchmarks.
Run benchmark and collect workload characterization
(simulate, machine counters, or sampling)

Performance Comparison
• For some program running on machine X,
PerformanceX = 1 / Execution timeX
• "X is nn times faster than Y"
PerformanceX / PerformanceY = nn
= speedup of X over Y
• Problem:
– machine A runs a program in 20 seconds
– machine B runs the same program in 25 seconds

Performance Evaluation: Benchmark
• (Real) Programs
– In the form of collection of programs
– E.g., SPEC, Winstone, SYSMARK, 3D Winbench, EEMBC
• Kernels:
– Small key pieces of real programs
– E.g., Livermore Fortran Loops Kernels (LFK), Linpack
• Modified (or scripted)
– To focus on some particular aspects (e.g. remove I/O, focus on CPU)
• (Toy) Benchmarks
– Produce expected results
• Synthetic Benchmarks:
– Representative instruction mix
– E.g., Dhrystone, Whetstone
• Important for
– Architectural and microarchitectural design trade-off
– Competitive analysis of real products

Performance Summary Measurement
• Average of total execution time
• This is Arithmetic Mean (Weighted ArithmeticArithmetic Mean (Weighted Arithmetic
Mean)Mean)
∑∑ ==
∗
n
i
ii
n
i
i TimeWeight
n
Time
n 11
1
or
1

Performance Summary Measurement
• Ratei is a function of 1/Timei
• Used to represent the average “rate” such as
instruction per cycle (IPC)
∑∑ ==
n
i i
i
n
i i Rate
Weight
n
Rate
n
11
or
1

Why Harmonic Mean?
• 30 mph for the first 10 miles
• 90 mph for the next 10 miles
• Average speed? (30+90)/2 = 60 mph??
• Wrong!
• Average speed = total distance / total time
• (10+10)/(10/30 + 10/90) = 45 mph

New Breed of Metrics
• Performance / Watt
– Performance achievable at the same cooling
capacity
• Performance / Joule (Energy)
– Achievable performance at the lifetime of the
same energy source (i.e., battery = energy)
– Equivalent to reciprocal of energy-delay product
(ED product)

Amdahl’s Law (Law of Diminishing Returns)
• Make the common case faster
• Speedup
= Perfnew / Perfold = Told / Tnew=
• Performance improvement from using faster mode
is limited by the fraction the faster mode can be
applied.
f(1 - f)
Told
(1 - f)
Tnew
f / P
P
f
f +− )1(
1

Amdahl’s Law Analogy
• Driving from Orlando to Atlanta
– 60 miles/hr from Orlando to Macon
– 120 miles/hr from Macon to Atlanta
– How much time you can save
compared against driving all the way
at 60 miles/hr from Orlando to
Atlanta?
• 6hr 45min vs. 7hr 30min = ~11%
speedup
• Key is to speed up the biggie portion, i.e.
speed up frequently executed blocks

Parallelism vs. Speedup
1.11x
1.97x
1.33x
1
10
100
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Speedup
Code portion in Faster mode (f)
Amdahl's Law speedup as a function of parallelism
P=1
P=2
P=4
P=8
P=16
P=32
P=64

Gustafson’s Law
• Amdahl’s Law killed massive parallel processing (MPP)
• Gustafson came to rescue
Seq
Tnew
Parallel
Told
Seq P * Parallel Time
Assume: Seq + Parallel = 1 (Tnew)
∴
Speedup = Seq + p * (1 – Seq) where p=parallel factor
If Seq diminishes with increased problem size, Speedup
 p

Amdahl versus Gustafson
Who is right?

The Principle of Locality
• Knuth made the original observation about program locality
in 1971.
– … less than 4 percent of a program generally accounts for
more than half of its running time.
• 90/10 rule: a program spends 90% of its execution time in
only 10% of the code
• Two types of locality
– Temporal locality (locality in time)
– Spatial locality (locality in space)
• Memory subsystem design heavily leverages the locality
concept for better performance

Example of Performance Evaluation (I)
Operation Frequency Clock cycle
count
ALU Ops (reg-reg) 43% 1
Loads 21% 2
Stores 12% 2
Branches 24% 2
Assume 25% of the ALU ops directly use a loaded operand that is not used again.
We propose adding ALU instructions that have one src operand in memory.
These new reg-mem instructions spend 2 clock cycles. Also assume that the
extended instruction set increase branch’s clock by 1, but no impact to cycle time.
Would this change improve performance ?

Example of Performance Evaluation (I)
Operation Frequency Clock cycle
count
ALU Ops (reg-reg) 43% 1
Loads 21% 2
Stores 12% 2
Branches 24% 2
Assume 25% of the ALU ops directly use a loaded operand that is not used again.
We propose adding ALU instructions that have one src operand in memory.
These new reg-mem instructions spend 2 clock cycles. Also assume that the
extended instruction set increase branch’s clock by 1, but no impact to cycle time.
Would this change improve performance ?
703.13*24.02*12.02*)43.0*25.021.0(1)43.025.043.0(243.025.0 =++−+∗∗−+∗∗=newCycles
57.12*24.0212.0221.0143.0 =+∗+∗+∗=oldCycles

Example of Performance Evaluation (II)
FP instructions = 25%
Average CPI of FP instructions = 4.0
Average CPI of other instructions = 1.33
FPSQRT = 2% of all instructions, CPI of FPSQRT =
20
• Design Option 1: decrease the CPI of FQSQRT to 2
• Design Option 2: decease the average CPI of all FP instructions to 2.5

Example of Performance Evaluation (II)
FPSQRT = 2% of all instructions, CPI of FPSQRT =
20
• Design Option 1: decrease the CPI of FQSQRT to 2
• Design Option 2: decease the average CPI of all FP instructions to 2.5
Original CPI = 0.25*4 + 1.33*(1-0.25) = 2.0
Option 1 CPI = 2.0 – 2%*(20-2) = 1.64
Option 2 CPI = 0.25*2.5 + 1.33*(1-0.25) = 1.625
Speedup of Option 1 = 2/1.64 = 1.2195
Speedup of Option 2 = 2/1.625 = 1.2308

Example of Performance Evaluation (III)
Clock freq = 1.4 GHz
FPSQRT = 2%, CPI of FPSQRT = 20
• Design Option 1: decrease the CPI of FQSQRT to 2, clock freq = 1.2GHz
• Design Option 2: decease the average CPI of all FP instructions to 2.5,
clock freq = 1.1 GHz

Example of Performance Evaluation (III)
Clock freq = 1.4 GHz
FPSQRT = 2%, CPI of FPSQRT = 20
• Design Option 1: decrease the CPI of FQSQRT to 2, clock freq = 1.2GHz
• Design Option 2: decease the average CPI of all FP instructions to 2.5,
clock freq = 1.1 GHz
Original CPI = 2.0, IPC = 1/2, Inst/Sec = ½*1.4G = 0.7G inst/s
Option 1 CPI = 1.64, IPC = 1/1.64, Inst/Sec = 1/1.64*1.2G = 0.73G inst/s
Option 2 CPI = 1.625, IPC = 1/1.625, Inst/Sec = 1/1.625*1.1G = 0.68G inst/s

Lec3 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Performance

More Related Content

What's hot (20)

Viewers also liked (20)

Similar to Lec3 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Performance (20)

More from Hsien-Hsin Sean Lee, Ph.D. (13)

Recently uploaded (20)

Lec3 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Performance