Should I care about CPU cache?
Kamil Witecki
Disclaimers
Key notes
Relative latency
Key notes
Relative latency Organization
Key notes
Relative latency Organization Profits?
What is average DRAM latency?
0
5
10
15
20
25
0 500 1000 1500 2000 2500
speed (MT
s
)
latency(ns)
And end to end?
row 0
row N
. . .
data
address
control
And end to end?
row 0
row N
. . .
data
address
control
activeinactive
And end to end?
row 0
row N
. . .
data
address
control
activeinactive
And end to end?
row 0
row N
. . .
data
address
control
activeinactivelatency
And end to end?
row 0
row N
. . .
data
address
control
activeinactivelatency
inactiveactive
And end to end?
row 0
row N
. . .
data
address
control
activeinactivelatency
inactiveactive
more latency
And end to end?
row 0
row N
. . .
data
address
control
activeinactivelatency
inactiveactive
more latency
DRAM
controller
And end to end?
row 0
row N
. . .
data
address
control
activeinactivelatency
inactiveactive
more latency
DRAM
controller
CPU
core
And end to end?
row 0
row N
. . .
data
address
control
activeinactivelatency
inactiveactive
more latency
DRAM
controller
CPU
core
end to end latency: 50-100ns
Is 50ns a lot?
What Duration
Reference 50ns
Is 50ns a lot?
What Duration
Reference 50ns
1 clock cycle 50ns
Is 50ns a lot?
What Duration
Reference 50ns
1 clock cycle@2MHz 50ns
Fun fact: Clock rate of 8080 CPU
Is 50ns a lot?
What Duration
Reference 50ns
1 clock cycle@2MHz 50ns
1 clock cycle@1GHz 1ns
Is 50ns a lot?
What Duration
Reference 50ns
1 clock cycle@2MHz 50ns
1 clock cycle@1GHz 1ns
1 clock cycle@5GHz 0.2ns
Is 50ns a lot?
What Duration
Reference 50ns
1 clock cycle@2MHz 50ns
1 clock cycle@1GHz 1ns
1 clock cycle@5GHz 0.2ns
Fun fact: 1 heartbeat vs boiling 1 liter of
water
And how to counter that?
row 0
row N
. . .
data
address
control
activeinactive
inactiveactive
DRAM
controller
CPU
core
end to end latency: 50-100ns
And how to counter that?
row 0
row N
. . .
data
address
control
activeinactive
inactiveactive
DRAM
controller
CPU
core
end to end latency: 50-100ns
Cache
And how to counter that?
row 0
row N
. . .
data
address
control
activeinactive
inactiveactive
DRAM
controller
CPU
core
end to end latency: 50-100ns
Cache
DRAM <-> Cache latency: 50-100ns
And how to counter that?
row 0
row N
. . .
data
address
control
activeinactive
inactiveactive
DRAM
controller
CPU
core
end to end latency: 50-100ns
Cache
DRAM <-> Cache latency: 50-100nsCache latency: 1 clock cycle
Memory hierarchy
L1 64KiB
L2 512KiB
L3 8MiB
RAM 64MiB
Memory hierarchy
L1 64KiB
L2 512KiB
L3 8MiB
RAM 64MiB
Associativity or how to map memory
Must be fast, die-size and power efficient
Ln+1 Ln
0
1
2
3
4
5
6
7
0
1
2
3
Associativity or how to map memory
Must be fast, die-size and power efficient
Ln+1 Ln
0
1
2
3
4
5
6
7
0
1
2
3
Example: Direct-mapping
Selection: address mod 4
Consequences: ?
Associativity or how to map memory
Must be fast, die-size and power efficient
Ln+1 Ln
0
1
2
3
4
5
6
7
0
1
2
3
Example: Direct-mapping
Selection: address mod 4
Consequences: ?
Associativity or how to map memory
Must be fast, die-size and power efficient
Ln+1 Ln
0
1
2
3
4
5
6
7
0
1
2
3
Example: Direct-mapping
Selection: address mod 4
Consequences: ?
Associativity or how to map memory
Must be fast, die-size and power efficient
Ln+1 Ln
0
1
2
3
4
5
6
7
0
1
2
3
Example: Direct-mapping
Selection: address mod 4
Consequences: ?
Consequences:
- simple - fast and die-size, power efficient
Associativity or how to map memory
Must be fast, die-size and power efficient
Ln+1 Ln
0
1
2
3
4
5
6
7
0
1
2
3
Example: Direct-mapping
Selection: address mod 4
Consequences: ?
Consequences:
- simple - fast and die-size, power efficient
Consequences:
- good best case - optimal sequence traversal
Associativity or how to map memory
Must be fast, die-size and power efficient
Ln+1 Ln
0
1
2
3
4
5
6
7
0
1
2
3
Example: Direct-mapping
Selection: address mod 4
Consequences: ?
Consequences:
- simple - fast and die-size, power efficient
Consequences:
- good best case - optimal sequence traversal
Consequences:
- worst case - jumping every nth line
Associativity or how to map memory
Must be fast, die-size and power efficient
Ln+1 Ln
0
1
2
3
4
5
6
7
0
1
2
3
Example: Direct-mapping
Selection: address mod 4
Consequences: ?
Consequences:
- simple - fast and die-size, power efficient
Consequences:
- good best case - optimal sequence traversal
Consequences:
- worst case - jumping every nth line
Example: 2-way set associative
Selection: address mod 2
And then?
Example: 2-way set associative
Selection: address mod 2
Then: Least Recently Used
Consequences: ?
0
1
Associativity or how to map memory
Must be fast, die-size and power efficient
Ln+1 Ln
0
1
2
3
4
5
6
7
0
1
2
3
Example: Direct-mapping
Selection: address mod 4
Consequences: ?
Consequences:
- simple - fast and die-size, power efficient
Consequences:
- good best case - optimal sequence traversal
Consequences:
- worst case - jumping every nth line
Example: 2-way set associative
Selection: address mod 2
And then?
Example: 2-way set associative
Selection: address mod 2
Then: Least Recently Used
Consequences: ?
0
1
Associativity or how to map memory
Must be fast, die-size and power efficient
Ln+1 Ln
0
1
2
3
4
5
6
7
0
1
2
3
Example: Direct-mapping
Selection: address mod 4
Consequences: ?
Consequences:
- simple - fast and die-size, power efficient
Consequences:
- good best case - optimal sequence traversal
Consequences:
- worst case - jumping every nth line
Example: 2-way set associative
Selection: address mod 2
And then?
Example: 2-way set associative
Selection: address mod 2
Then: Least Recently Used
Consequences: ?
0
1
Associativity or how to map memory
Must be fast, die-size and power efficient
Ln+1 Ln
0
1
2
3
4
5
6
7
0
1
2
3
Example: Direct-mapping
Selection: address mod 4
Consequences: ?
Consequences:
- simple - fast and die-size, power efficient
Consequences:
- good best case - optimal sequence traversal
Consequences:
- worst case - jumping every nth line
Example: 2-way set associative
Selection: address mod 2
And then?
Example: 2-way set associative
Selection: address mod 2
Then: Least Recently Used
Consequences: ?
0
1
Consequences:
- complexity grows with number of ways
Consequences:
- 15% less cache misses
Consequences:
- avoids N-parallel stalls
Associativity or how to map memory
Must be fast, die-size and power efficient
Ln+1 Ln
0
1
2
3
4
5
6
7
0
1
2
3
Example: Direct-mapping
Selection: address mod 4
Consequences: ?
Consequences:
- simple - fast and die-size, power efficient
Consequences:
- good best case - optimal sequence traversal
Consequences:
- worst case - jumping every nth line
Example: 2-way set associative
Selection: address mod 2
And then?
Example: 2-way set associative
Selection: address mod 2
Then: Least Recently Used
Consequences: ?
0
1
Consequences:
- complexity grows with number of ways
Consequences:
- 15% less cache misses
Consequences:
- avoids N-parallel stalls
Associativity or how to map memory
Must be fast, die-size and power efficient
Ln+1 Ln
0
1
2
3
4
5
6
7
0
1
2
3
Example: Direct-mapping
Selection: address mod 4
Consequences: ?
Consequences:
- simple - fast and die-size, power efficient
Consequences:
- good best case - optimal sequence traversal
Consequences:
- worst case - jumping every nth line
Example: 2-way set associative
Selection: address mod 2
And then?
Example: 2-way set associative
Selection: address mod 2
Then: Least Recently Used
Consequences: ?
0
1
Consequences:
- complexity grows with number of ways
Consequences:
- 15% less cache misses
Consequences:
- avoids N-parallel stalls
Cache line
L1
Cache line
L1
Cache line: 32B
Cache line
L1
Cache line: 32B
Cache line
L1
Cache line: 32B
Cache line
L1
Cache line: 32B
Cache line
L1
Cache line: 32B
Cache line - alignment consequences
Cache line: 32B
Object size: 4B
Alignment: 2B -> alignas(2) std::byte x[4];
Cache line - alignment consequences
Cache line: 32B
Object size: 4B
Alignment: 2B -> alignas(2) std::byte x[4];
Cache line - alignment consequences
Cache line: 32B
Object size: 4B
Alignment: 2B -> alignas(2) std::byte x[4];
Cache line - alignment consequences
Cache line: 32B
Object size: 4B
Alignment: 2B -> alignas(2) std::byte x[4];Alignment: 4B -> int32_t;
Cache line - alignment consequences
Cache line: 32B
Object size: 4B
Alignment: 2B -> alignas(2) std::byte x[4];Alignment: 4B -> int32_t;
Cache line - alignment consequences
Cache line: 32B
Object size: 4B
Alignment: 2B -> alignas(2) std::byte x[4];Alignment: 4B -> int32_t;Alignment: 32B -> alignas(32) int32_t x[4];
Cache line - write consequences
Cache line: 32B
Object size: 4B
Alignment: 4B
Cache line - write consequences
Cache line: 32B
Object size: 4B
Alignment: 4B
Cache line - write consequences
Cache line: 32B
Object size: 4B
Alignment: 4B
Cache line: 32B (invalidated)
Keeping caches hot
// data: vector of objects composed of:
// int32_t type , string name , data* parent ,
// map<string , string > params
// task: find by type
Naive C-style!
// data: vector of objects composed of:
// int32_t type , string name , data* parent ,
// map<string , string > params
s t r u c t data {
i n t 3 2 _ t type ;
s t r i n g name ;
o b s e r v e r _ p t r <data> p a r en t ;
map<s t r i n g , s t r i n g > params ;
} ;
// task: find by type (vector is sorted by type)
pair <s i z e _ t , s i z e _ t >
f i n d _ b y _ t y p e ( v e c t o r <data> const & x , key type ) {
auto r = equal_range ( begin ( x ) , end ( x ) , type ) ;
r e t u r n { r . f i r s t − begin ( x ) , r . second − begin ( x ) } ;
}
Layout
s i z e o f ( data ) ; // : 64B
o f f s e t o f ( type ) ; // : 0B
o f f s e t o f ( name ) ; // : 8B
o f f s e t o f ( p ar e n t ) ; // : 40B
o f f s e t o f ( params ) ; // : 48B
a l i g n o f ( data ) ; // : 64B
//Cache line : 64B
Type #1 — Name #1 Parent #1 Params #1
Type #2 — Name #2 Parent #2 Params #2
C++ style!
// data: vector of objects composed of:
// int32_t type , string name , data* parent ,
// map<string , string > params
s t r u c t data {
s t r i n g name ;
o b s e r v e r _ p t r <data> p a r en t ;
map<s t r i n g , s t r i n g > params ;
} ;
boost : : flat_map <i n t 3 2 _ t , data >:: equal_range ;
std : : map<i n t 3 2 _ t , data >:: equal_range ;
std : : unordered_map<i n t 3 2 _ t , data >:: equal_range ;
AMD AthlonTM
II X3 440, 12GiB RAM DDR3-1333, clang 8.0.1-3
0
2500
5000
7500
10000
1e+04 1e+06 1e+08
Number of objects
Time(ns)
colour
flat-map-clang
map-clang
naive-clang
unordered-map-clang
AMD RyzenTM
1600X, 16GiB RAM DDR4-3200, clang 8.0.1-3
0
2500
5000
7500
10000
1e+04 1e+06 1e+08
Number of objects
Time(ns)
colour
flat-map-clang
map-clang
naive-clang
unordered-map-clang
Will separate array do better?
s t r u c t a r r {
v e c t o r <i n t 3 2 _ t > types ;
v e c t o r <entry > e n t r i e s ;
} ;
pair <s i z e _ t , s i z e _ t >
f i n d _ b y _ t y p e ( a r r const & d , i n t 3 2 _ t type ) {
auto b = begin ( d . types ) ;
auto r = equal_range ( b , end ( d . types ) , type ) ;
r e t u r n { r . f i r s t − b , r . second − b } ;
}
Layout
s i z e o f ( type ) ; // : 8B
a l i g n o f ( data ) ; // : 4B
//Cache line : 64B
Type #1 Type #2 Type #3 Type #4 Type #5 Type #6 Type #7 Type #8 Type #9 Type #10 Type #11 Type #12 Type #13 Type #14 Type #15 Type #16
AMD AthlonTM
II X3 440, 12GiB RAM DDR3-1333, clang 8.0.1-3
0
2500
5000
7500
10000
1e+04 1e+06 1e+08
Number of objects
Time(ns)
colour
flat-map-clang
map-clang
naive-clang
optimized-clang
unordered-map-clang
AMD RyzenTM
1600X, 16GiB RAM DDR4-3200, clang 8.0.1-3
0
2500
5000
7500
10000
1e+04 1e+06 1e+08
Number of objects
Time(ns)
colour
flat-map-clang
map-clang
naive-clang
optimized-clang
unordered-map-clang
NOT BAD
Instructions (1.0e+07, 2.5e+07, 5.0e+07)
95614956149561495614111611161116111613461346134613468698698698691241124112411241
0
500
1000
1500
Instr
values
types
flat-map
map
naive
optimized
unordered-map
L1 uses & misses
870587058888115115156156115115 1739517395122122130130227227118118
0
100
200
300
L1 misses L1 uses
values
types
flat-map
map
naive
optimized
unordered-map
LL miss rate
7.77.77.77.78.68.68.68.69.29.29.29.215.315.315.315.39.89.89.89.8
0
5
10
15
L3 miss rate
values
types
flat-map
map
naive
optimized
unordered-map
Questions and Answers
Kamil.Witecki@nokia.com
Bibliograpy I
Micron Technology, Inc.
Speed vs. latency.
White paper, Micron Technology, Inc.
David A. Patterson and John L. Hennessy.
Computer Organization and Design, Fifth Edition: The Hardware/Software
Interface.
Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 5th edition, 2013.

More Related Content

PDF
A Comparative Analysis between SHA and MD5 algorithms
PDF
RDF Streams and Continuous SPARQL (C-SPARQL)
PPTX
Secure Hash Algorithm
PDF
SHA-1 backdooring & exploitation
PPTX
NDD Project presentation
PPTX
SHA- Secure hashing algorithm
PDF
Sha
A Comparative Analysis between SHA and MD5 algorithms
RDF Streams and Continuous SPARQL (C-SPARQL)
Secure Hash Algorithm
SHA-1 backdooring & exploitation
NDD Project presentation
SHA- Secure hashing algorithm
Sha

What's hot (18)

PDF
Hash functions MD5 and RIPEMD 160
PDF
The SHA Hashing Algorithm
PPT
Secure hashing algorithm
PDF
SHA 1 Algorithm
PPTX
Message Authentication using Message Digests and the MD5 Algorithm
PPTX
Data streaming algorithms
PPT
Hash crypto
PPTX
Basic explanation to md5 implementation in C
PPT
Hash& mac algorithms
PPTX
Hash Techniques in Cryptography
PPTX
Secure Hashing Techniques - Introduction
PPTX
Message digest 5
PPT
Hash Function & Analysis
PDF
MD-5 : Algorithm
PPTX
Chacha ppt
PPT
Computer Networking : Principles, Protocols and Practice - lesson 1
Hash functions MD5 and RIPEMD 160
The SHA Hashing Algorithm
Secure hashing algorithm
SHA 1 Algorithm
Message Authentication using Message Digests and the MD5 Algorithm
Data streaming algorithms
Hash crypto
Basic explanation to md5 implementation in C
Hash& mac algorithms
Hash Techniques in Cryptography
Secure Hashing Techniques - Introduction
Message digest 5
Hash Function & Analysis
MD-5 : Algorithm
Chacha ppt
Computer Networking : Principles, Protocols and Practice - lesson 1
Ad

Similar to Code dive 2019 kamil witecki - should i care about cpu cache (20)

PPT
Memory Optimization
PPT
Memory Optimization
PDF
Caching in (DevoxxUK 2013)
PPTX
Code and Memory Optimisation Tricks
PPTX
Code and memory optimization tricks
PPTX
Data oriented design and c++
PPT
Gdc03 ericson memory_optimization
PPT
cache memory
PDF
Caching in
PPT
04 Cache Memory
PPTX
#GDC15 Code Clinic
PPT
04 cache memory
PPT
Memory organization including cache and RAM.ppt
PPT
Memory Organization and Cache mapping.ppt
PDF
Cache-Memory for university courses at PG
PPT
chapter6.ppt
PPT
Lec9 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Memory part 1
PPT
cache memory.ppt
PPT
cache memory.ppt
PPTX
Cache recap
Memory Optimization
Memory Optimization
Caching in (DevoxxUK 2013)
Code and Memory Optimisation Tricks
Code and memory optimization tricks
Data oriented design and c++
Gdc03 ericson memory_optimization
cache memory
Caching in
04 Cache Memory
#GDC15 Code Clinic
04 cache memory
Memory organization including cache and RAM.ppt
Memory Organization and Cache mapping.ppt
Cache-Memory for university courses at PG
chapter6.ppt
Lec9 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Memory part 1
cache memory.ppt
cache memory.ppt
Cache recap
Ad

Recently uploaded (20)

PPTX
Airline CRS | Airline CRS Systems | CRS System
PDF
AI-Powered Fuzz Testing: The Future of QA
PDF
PDF-XChange Editor Plus 10.7.0.398.0 Crack Free Download Latest 2025
PDF
Type Class Derivation in Scala 3 - Jose Luis Pintado Barbero
PDF
Top 10 Software Development Trends to Watch in 2025 🚀.pdf
PDF
Multiverse AI Review 2025: Access All TOP AI Model-Versions!
PDF
DNT Brochure 2025 – ISV Solutions @ D365
PPTX
Tech Workshop Escape Room Tech Workshop
PDF
The Dynamic Duo Transforming Financial Accounting Systems Through Modern Expe...
PDF
EaseUS PDF Editor Pro 6.2.0.2 Crack with License Key 2025
PDF
BoxLang Dynamic AWS Lambda - Japan Edition
PDF
MCP Security Tutorial - Beginner to Advanced
PPTX
4Seller: The All-in-One Multi-Channel E-Commerce Management Platform for Glob...
PDF
novaPDF Pro 11.9.482 Crack + License Key [Latest 2025]
PDF
Practical Indispensable Project Management Tips for Delivering Successful Exp...
PPTX
Cybersecurity-and-Fraud-Protecting-Your-Digital-Life.pptx
PDF
Workplace Software and Skills - OpenStax
PPTX
Matchmaking for JVMs: How to Pick the Perfect GC Partner
PDF
AI/ML Infra Meetup | Beyond S3's Basics: Architecting for AI-Native Data Access
PDF
Cloud Native Aachen Meetup - Aug 21, 2025
Airline CRS | Airline CRS Systems | CRS System
AI-Powered Fuzz Testing: The Future of QA
PDF-XChange Editor Plus 10.7.0.398.0 Crack Free Download Latest 2025
Type Class Derivation in Scala 3 - Jose Luis Pintado Barbero
Top 10 Software Development Trends to Watch in 2025 🚀.pdf
Multiverse AI Review 2025: Access All TOP AI Model-Versions!
DNT Brochure 2025 – ISV Solutions @ D365
Tech Workshop Escape Room Tech Workshop
The Dynamic Duo Transforming Financial Accounting Systems Through Modern Expe...
EaseUS PDF Editor Pro 6.2.0.2 Crack with License Key 2025
BoxLang Dynamic AWS Lambda - Japan Edition
MCP Security Tutorial - Beginner to Advanced
4Seller: The All-in-One Multi-Channel E-Commerce Management Platform for Glob...
novaPDF Pro 11.9.482 Crack + License Key [Latest 2025]
Practical Indispensable Project Management Tips for Delivering Successful Exp...
Cybersecurity-and-Fraud-Protecting-Your-Digital-Life.pptx
Workplace Software and Skills - OpenStax
Matchmaking for JVMs: How to Pick the Perfect GC Partner
AI/ML Infra Meetup | Beyond S3's Basics: Architecting for AI-Native Data Access
Cloud Native Aachen Meetup - Aug 21, 2025

Code dive 2019 kamil witecki - should i care about cpu cache

  • 1. Should I care about CPU cache? Kamil Witecki
  • 5. Key notes Relative latency Organization Profits?
  • 6. What is average DRAM latency? 0 5 10 15 20 25 0 500 1000 1500 2000 2500 speed (MT s ) latency(ns)
  • 7. And end to end? row 0 row N . . . data address control
  • 8. And end to end? row 0 row N . . . data address control activeinactive
  • 9. And end to end? row 0 row N . . . data address control activeinactive
  • 10. And end to end? row 0 row N . . . data address control activeinactivelatency
  • 11. And end to end? row 0 row N . . . data address control activeinactivelatency inactiveactive
  • 12. And end to end? row 0 row N . . . data address control activeinactivelatency inactiveactive more latency
  • 13. And end to end? row 0 row N . . . data address control activeinactivelatency inactiveactive more latency DRAM controller
  • 14. And end to end? row 0 row N . . . data address control activeinactivelatency inactiveactive more latency DRAM controller CPU core
  • 15. And end to end? row 0 row N . . . data address control activeinactivelatency inactiveactive more latency DRAM controller CPU core end to end latency: 50-100ns
  • 16. Is 50ns a lot? What Duration Reference 50ns
  • 17. Is 50ns a lot? What Duration Reference 50ns 1 clock cycle 50ns
  • 18. Is 50ns a lot? What Duration Reference 50ns 1 clock cycle@2MHz 50ns Fun fact: Clock rate of 8080 CPU
  • 19. Is 50ns a lot? What Duration Reference 50ns 1 clock cycle@2MHz 50ns 1 clock cycle@1GHz 1ns
  • 20. Is 50ns a lot? What Duration Reference 50ns 1 clock cycle@2MHz 50ns 1 clock cycle@1GHz 1ns 1 clock cycle@5GHz 0.2ns
  • 21. Is 50ns a lot? What Duration Reference 50ns 1 clock cycle@2MHz 50ns 1 clock cycle@1GHz 1ns 1 clock cycle@5GHz 0.2ns Fun fact: 1 heartbeat vs boiling 1 liter of water
  • 22. And how to counter that? row 0 row N . . . data address control activeinactive inactiveactive DRAM controller CPU core end to end latency: 50-100ns
  • 23. And how to counter that? row 0 row N . . . data address control activeinactive inactiveactive DRAM controller CPU core end to end latency: 50-100ns Cache
  • 24. And how to counter that? row 0 row N . . . data address control activeinactive inactiveactive DRAM controller CPU core end to end latency: 50-100ns Cache DRAM <-> Cache latency: 50-100ns
  • 25. And how to counter that? row 0 row N . . . data address control activeinactive inactiveactive DRAM controller CPU core end to end latency: 50-100ns Cache DRAM <-> Cache latency: 50-100nsCache latency: 1 clock cycle
  • 26. Memory hierarchy L1 64KiB L2 512KiB L3 8MiB RAM 64MiB
  • 27. Memory hierarchy L1 64KiB L2 512KiB L3 8MiB RAM 64MiB
  • 28. Associativity or how to map memory Must be fast, die-size and power efficient Ln+1 Ln 0 1 2 3 4 5 6 7 0 1 2 3
  • 29. Associativity or how to map memory Must be fast, die-size and power efficient Ln+1 Ln 0 1 2 3 4 5 6 7 0 1 2 3 Example: Direct-mapping Selection: address mod 4 Consequences: ?
  • 30. Associativity or how to map memory Must be fast, die-size and power efficient Ln+1 Ln 0 1 2 3 4 5 6 7 0 1 2 3 Example: Direct-mapping Selection: address mod 4 Consequences: ?
  • 31. Associativity or how to map memory Must be fast, die-size and power efficient Ln+1 Ln 0 1 2 3 4 5 6 7 0 1 2 3 Example: Direct-mapping Selection: address mod 4 Consequences: ?
  • 32. Associativity or how to map memory Must be fast, die-size and power efficient Ln+1 Ln 0 1 2 3 4 5 6 7 0 1 2 3 Example: Direct-mapping Selection: address mod 4 Consequences: ? Consequences: - simple - fast and die-size, power efficient
  • 33. Associativity or how to map memory Must be fast, die-size and power efficient Ln+1 Ln 0 1 2 3 4 5 6 7 0 1 2 3 Example: Direct-mapping Selection: address mod 4 Consequences: ? Consequences: - simple - fast and die-size, power efficient Consequences: - good best case - optimal sequence traversal
  • 34. Associativity or how to map memory Must be fast, die-size and power efficient Ln+1 Ln 0 1 2 3 4 5 6 7 0 1 2 3 Example: Direct-mapping Selection: address mod 4 Consequences: ? Consequences: - simple - fast and die-size, power efficient Consequences: - good best case - optimal sequence traversal Consequences: - worst case - jumping every nth line
  • 35. Associativity or how to map memory Must be fast, die-size and power efficient Ln+1 Ln 0 1 2 3 4 5 6 7 0 1 2 3 Example: Direct-mapping Selection: address mod 4 Consequences: ? Consequences: - simple - fast and die-size, power efficient Consequences: - good best case - optimal sequence traversal Consequences: - worst case - jumping every nth line Example: 2-way set associative Selection: address mod 2 And then? Example: 2-way set associative Selection: address mod 2 Then: Least Recently Used Consequences: ? 0 1
  • 36. Associativity or how to map memory Must be fast, die-size and power efficient Ln+1 Ln 0 1 2 3 4 5 6 7 0 1 2 3 Example: Direct-mapping Selection: address mod 4 Consequences: ? Consequences: - simple - fast and die-size, power efficient Consequences: - good best case - optimal sequence traversal Consequences: - worst case - jumping every nth line Example: 2-way set associative Selection: address mod 2 And then? Example: 2-way set associative Selection: address mod 2 Then: Least Recently Used Consequences: ? 0 1
  • 37. Associativity or how to map memory Must be fast, die-size and power efficient Ln+1 Ln 0 1 2 3 4 5 6 7 0 1 2 3 Example: Direct-mapping Selection: address mod 4 Consequences: ? Consequences: - simple - fast and die-size, power efficient Consequences: - good best case - optimal sequence traversal Consequences: - worst case - jumping every nth line Example: 2-way set associative Selection: address mod 2 And then? Example: 2-way set associative Selection: address mod 2 Then: Least Recently Used Consequences: ? 0 1
  • 38. Associativity or how to map memory Must be fast, die-size and power efficient Ln+1 Ln 0 1 2 3 4 5 6 7 0 1 2 3 Example: Direct-mapping Selection: address mod 4 Consequences: ? Consequences: - simple - fast and die-size, power efficient Consequences: - good best case - optimal sequence traversal Consequences: - worst case - jumping every nth line Example: 2-way set associative Selection: address mod 2 And then? Example: 2-way set associative Selection: address mod 2 Then: Least Recently Used Consequences: ? 0 1 Consequences: - complexity grows with number of ways Consequences: - 15% less cache misses Consequences: - avoids N-parallel stalls
  • 39. Associativity or how to map memory Must be fast, die-size and power efficient Ln+1 Ln 0 1 2 3 4 5 6 7 0 1 2 3 Example: Direct-mapping Selection: address mod 4 Consequences: ? Consequences: - simple - fast and die-size, power efficient Consequences: - good best case - optimal sequence traversal Consequences: - worst case - jumping every nth line Example: 2-way set associative Selection: address mod 2 And then? Example: 2-way set associative Selection: address mod 2 Then: Least Recently Used Consequences: ? 0 1 Consequences: - complexity grows with number of ways Consequences: - 15% less cache misses Consequences: - avoids N-parallel stalls
  • 40. Associativity or how to map memory Must be fast, die-size and power efficient Ln+1 Ln 0 1 2 3 4 5 6 7 0 1 2 3 Example: Direct-mapping Selection: address mod 4 Consequences: ? Consequences: - simple - fast and die-size, power efficient Consequences: - good best case - optimal sequence traversal Consequences: - worst case - jumping every nth line Example: 2-way set associative Selection: address mod 2 And then? Example: 2-way set associative Selection: address mod 2 Then: Least Recently Used Consequences: ? 0 1 Consequences: - complexity grows with number of ways Consequences: - 15% less cache misses Consequences: - avoids N-parallel stalls
  • 47. Cache line - alignment consequences Cache line: 32B Object size: 4B Alignment: 2B -> alignas(2) std::byte x[4];
  • 48. Cache line - alignment consequences Cache line: 32B Object size: 4B Alignment: 2B -> alignas(2) std::byte x[4];
  • 49. Cache line - alignment consequences Cache line: 32B Object size: 4B Alignment: 2B -> alignas(2) std::byte x[4];
  • 50. Cache line - alignment consequences Cache line: 32B Object size: 4B Alignment: 2B -> alignas(2) std::byte x[4];Alignment: 4B -> int32_t;
  • 51. Cache line - alignment consequences Cache line: 32B Object size: 4B Alignment: 2B -> alignas(2) std::byte x[4];Alignment: 4B -> int32_t;
  • 52. Cache line - alignment consequences Cache line: 32B Object size: 4B Alignment: 2B -> alignas(2) std::byte x[4];Alignment: 4B -> int32_t;Alignment: 32B -> alignas(32) int32_t x[4];
  • 53. Cache line - write consequences Cache line: 32B Object size: 4B Alignment: 4B
  • 54. Cache line - write consequences Cache line: 32B Object size: 4B Alignment: 4B
  • 55. Cache line - write consequences Cache line: 32B Object size: 4B Alignment: 4B Cache line: 32B (invalidated)
  • 56. Keeping caches hot // data: vector of objects composed of: // int32_t type , string name , data* parent , // map<string , string > params // task: find by type
  • 57. Naive C-style! // data: vector of objects composed of: // int32_t type , string name , data* parent , // map<string , string > params s t r u c t data { i n t 3 2 _ t type ; s t r i n g name ; o b s e r v e r _ p t r <data> p a r en t ; map<s t r i n g , s t r i n g > params ; } ; // task: find by type (vector is sorted by type) pair <s i z e _ t , s i z e _ t > f i n d _ b y _ t y p e ( v e c t o r <data> const & x , key type ) { auto r = equal_range ( begin ( x ) , end ( x ) , type ) ; r e t u r n { r . f i r s t − begin ( x ) , r . second − begin ( x ) } ; }
  • 58. Layout s i z e o f ( data ) ; // : 64B o f f s e t o f ( type ) ; // : 0B o f f s e t o f ( name ) ; // : 8B o f f s e t o f ( p ar e n t ) ; // : 40B o f f s e t o f ( params ) ; // : 48B a l i g n o f ( data ) ; // : 64B //Cache line : 64B Type #1 — Name #1 Parent #1 Params #1 Type #2 — Name #2 Parent #2 Params #2
  • 59. C++ style! // data: vector of objects composed of: // int32_t type , string name , data* parent , // map<string , string > params s t r u c t data { s t r i n g name ; o b s e r v e r _ p t r <data> p a r en t ; map<s t r i n g , s t r i n g > params ; } ; boost : : flat_map <i n t 3 2 _ t , data >:: equal_range ; std : : map<i n t 3 2 _ t , data >:: equal_range ; std : : unordered_map<i n t 3 2 _ t , data >:: equal_range ;
  • 60. AMD AthlonTM II X3 440, 12GiB RAM DDR3-1333, clang 8.0.1-3 0 2500 5000 7500 10000 1e+04 1e+06 1e+08 Number of objects Time(ns) colour flat-map-clang map-clang naive-clang unordered-map-clang
  • 61. AMD RyzenTM 1600X, 16GiB RAM DDR4-3200, clang 8.0.1-3 0 2500 5000 7500 10000 1e+04 1e+06 1e+08 Number of objects Time(ns) colour flat-map-clang map-clang naive-clang unordered-map-clang
  • 62. Will separate array do better? s t r u c t a r r { v e c t o r <i n t 3 2 _ t > types ; v e c t o r <entry > e n t r i e s ; } ; pair <s i z e _ t , s i z e _ t > f i n d _ b y _ t y p e ( a r r const & d , i n t 3 2 _ t type ) { auto b = begin ( d . types ) ; auto r = equal_range ( b , end ( d . types ) , type ) ; r e t u r n { r . f i r s t − b , r . second − b } ; }
  • 63. Layout s i z e o f ( type ) ; // : 8B a l i g n o f ( data ) ; // : 4B //Cache line : 64B Type #1 Type #2 Type #3 Type #4 Type #5 Type #6 Type #7 Type #8 Type #9 Type #10 Type #11 Type #12 Type #13 Type #14 Type #15 Type #16
  • 64. AMD AthlonTM II X3 440, 12GiB RAM DDR3-1333, clang 8.0.1-3 0 2500 5000 7500 10000 1e+04 1e+06 1e+08 Number of objects Time(ns) colour flat-map-clang map-clang naive-clang optimized-clang unordered-map-clang
  • 65. AMD RyzenTM 1600X, 16GiB RAM DDR4-3200, clang 8.0.1-3 0 2500 5000 7500 10000 1e+04 1e+06 1e+08 Number of objects Time(ns) colour flat-map-clang map-clang naive-clang optimized-clang unordered-map-clang
  • 67. Instructions (1.0e+07, 2.5e+07, 5.0e+07) 95614956149561495614111611161116111613461346134613468698698698691241124112411241 0 500 1000 1500 Instr values types flat-map map naive optimized unordered-map
  • 68. L1 uses & misses 870587058888115115156156115115 1739517395122122130130227227118118 0 100 200 300 L1 misses L1 uses values types flat-map map naive optimized unordered-map
  • 69. LL miss rate 7.77.77.77.78.68.68.68.69.29.29.29.215.315.315.315.39.89.89.89.8 0 5 10 15 L3 miss rate values types flat-map map naive optimized unordered-map
  • 71. Bibliograpy I Micron Technology, Inc. Speed vs. latency. White paper, Micron Technology, Inc. David A. Patterson and John L. Hennessy. Computer Organization and Design, Fifth Edition: The Hardware/Software Interface. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 5th edition, 2013.