Core 2 Duo die
“Just a few years ago, the idea of putting multiple processors
on a chip was farfetched. Now it is accepted and
commonplace, and virtually every new high performance
processor is a chip multiprocessor of some sort…”
Center for Electronic System Design
Univ. of California Berkeley
Chip Multiprocessors??
“Mowry is working on the development
of single-chip multiprocessors: one large
chip capable of performing multiple
operations at once, using similar
techniques to maximize performance”
-- Technology Review, 1999
Sony's Playstation 3, 2006
CMP Caches: Design Space
• Architecture
– Placement of Cache/Processors
– Interconnects/Routing
• Cache Organization & Management
– Private/Shared/Hybrid
– Fully Hardware/OS Interface
“L2 is the last line of defense before hitting the
memory wall, and is the focus of our talk”
Private L2 Cache
I$ D$ I$ D$
L2 $ L2 $ L2 $ L2 $ L2 $ L2 $
I N T E R C O N N E C T
Coherence Protocol
Offchip Memory
+ Less interconnect traffic
+ Insulates L2 units
+ Hit latency
– Duplication
– Load imbalance
– Complexity of coherence
– Higher miss rate
L1 L1
Proc
Shared-Interleaved L2 Cache
– Interconnect traffic
– Interference between cores
– Hit latency is higher
+ No duplication
+ Balance the load
+ Lower miss rate
+ Simplicity of coherence
I$ D$ I$ D$
I N T E R C O N N E C T
Coherence ProtocolL1
L2
Take Home Message
• Leverage on-chip access time
Take Home Messages
• Leverage on-chip access time
• Better sharing of cache resources
• Isolating performance of processors
• Place data on the chip close to where it is used
• Minimize inter-processor misses (in shared cache)
• Fairness towards processors
On to some solutions…
Jichuan Chang and Gurindar S. Sohi
Cooperative Caching for Chip Multiprocessors
International Symposium on Computer Architecture, 2006.
Nikos Hardavellas, Michael Ferdman, Babak Falsafi, and Anastasia Ailamaki
Reactive NUCA: Near-Optimal Block Placement and Replication in Distributed Caches
International Symposium on Computer Architecture, 2009.
Shekhar Srikantaiah, Mahmut Kandemir, and Mary Jane Irwin
Adaptive Set-Pinning: Managing Shared Caches in Chip Multiprocessors
Architectural Support for Programming Languages and Operating, Systems 2008.
each handles this problem in a different way
Co-operative Caching
(Chang & Sohi)
• Private L2 caches
• Attract data locally to reduce remote on chip access.
Lowers average on-chip misses.
• Co-operation among the private caches for efficient
use of resources on the chip.
• Controlling the extent of co-operation to suit the
dynamic workload behavior
CC Techniques
• Cache to cache transfer of clean data
– In case of miss transfer “clean” blocks from another L2 cache.
– This is useful in the case of “read only” data (instructions) .
• Replication aware data replacement
– Singlet/Replicate.
– Evict singlet only when no replicates exist.
– Singlets can be “spilled” to other cache banks.
• Global replacement of inactive data
– Global management needed for managing “spilling”.
– N-Chance Forwarding.
– Set recirculation count to N when spilled.
– Decrease N by 1 when spilled again, unless N becomes 0.
Set “Pinning” -- Setup
P1
P2
P3
P4
Set 0
Set 1
:
:
Set (S-1)
L1
cache
Processors Shared
L2 cache
I
n
t
e
r
c
o
n
n
e
c
t
Main
Memory
Set “Pinning” -- Problem
P1
P2
P3
P4
Set 0
Set 1
:
:
Set (S-1)
Main
Memory
Set “Pinning”
-- Types of Cache Misses
• Compulsory
(aka Cold)
• Capacity
• Conflict
• Coherence
• Compulsory
• Inter-processor
• Intra-processor
versus
P1
P2
P3
P4
Main
Memory
POP 1
POP 2
POP 3
POP 4
Set
:
:
Set
Owner Other bits Data
R-NUCA: Use Class-Based Strategies
Solve for the common case!
Most current (and future) programs have the following types of accesses
1. Instruction Access – Shared, but Read-Only
2. Private Data Access – Read-Write, but not Shared
3. Shared Data Access – Read-Write (or) Read-Only, but Shared.
R-NUCA: Can do this online!
• We have information from the OS and TLB
• For each memory block, classify it as
– Instruction
– Private Data
– Shared Data
• Handle them differently
– Replicate instructions
– Keep private data locally
– Keep shared data globally
R-NUCA: Reactive Clustering
• Assign clusters based on level of sharing
– Private Data given level-1 clusters (local cache)
– Shared Data given level-16 clusters (16 neighboring machines), etc.
Clusters ≈ Overlapping Sets in Set-Associative Mapping
• Within a cluster, “Rotational Interleaving”
– Load-Balancing to minimize contention on bus and controller
Future Directions
Area has been closed.
Just Kidding…
• Optimize for Power Consumption
• Assess trade-offs between more caches and more cores
• Minimize usage of OS, but still retain flexibility
• Application adaptation to allocated cache quotas
• Adding hardware directed thread level speculation
Questions?
THANK YOU!
Backup
• Commercial and research prototypes
– Sun MAJC
– Piranha
– IBM Power 4/5
– Stanford Hydra
Backup

More Related Content

PPT
Microkernel-based operating system development
PDF
Microkernel Evolution
PPTX
Red Hat System Administration
PDF
olibc: Another C Library optimized for Embedded Linux
PDF
Making Linux do Hard Real-time
PDF
seL4 intro
PDF
A Reimplementation of NetBSD Based on a Microkernel by Andrew S. Tanenbaum
PDF
Running Applications on the NetBSD Rump Kernel by Justin Cormack
Microkernel-based operating system development
Microkernel Evolution
Red Hat System Administration
olibc: Another C Library optimized for Embedded Linux
Making Linux do Hard Real-time
seL4 intro
A Reimplementation of NetBSD Based on a Microkernel by Andrew S. Tanenbaum
Running Applications on the NetBSD Rump Kernel by Justin Cormack

What's hot (15)

PDF
Introduction to Microkernels
PDF
Plan 9: Not (Only) A Better UNIX
PDF
From L3 to seL4: What have we learnt in 20 years of L4 microkernels
PDF
Linux kernel Architecture and Properties
PDF
μ-Kernel Evolution
PDF
Architecture Of The Linux Kernel
PDF
Parallel Computing - Lec 3
PDF
L4 Microkernel :: Design Overview
PPT
Pacemaker+DRBD
PDF
[TALK] Exokernel vs. Microkernel
PDF
Hints for L4 Microkernel
PDF
Embedded Hypervisor for ARM
PPT
Multicore Processors
PDF
Implement Runtime Environments for HSA using LLVM
DOCX
Introduction to Microkernels
Plan 9: Not (Only) A Better UNIX
From L3 to seL4: What have we learnt in 20 years of L4 microkernels
Linux kernel Architecture and Properties
μ-Kernel Evolution
Architecture Of The Linux Kernel
Parallel Computing - Lec 3
L4 Microkernel :: Design Overview
Pacemaker+DRBD
[TALK] Exokernel vs. Microkernel
Hints for L4 Microkernel
Embedded Hypervisor for ARM
Multicore Processors
Implement Runtime Environments for HSA using LLVM
Ad

Similar to Optimizing shared caches in chip multiprocessors (20)

PPT
chapter-6-multiprocessors-and-thread-level (1).ppt
PDF
[Harvard CS264] 07 - GPU Cluster Programming (MPI & ZeroMQ)
PPT
Massively Parallel Architectures
DOC
Introduction to multi core
PPTX
Cache coherence ppt
PPT
Introduction to symmetric multiprocessor
PPTX
High_Performance_ComputingforComputers.pptx
PPTX
Leveraging Structured Data To Reduce Disk, IO & Network Bandwidth
PPT
Pdc lecture1
PPT
12-6810-12.ppt
PPT
247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).ppt
PDF
MT48 A Flash into the future of storage….  Flash meets Persistent Memory: The...
PDF
PyData Paris 2015 - Closing keynote Francesc Alted
PPTX
Webinar: OpenEBS - Still Free and now FASTEST Kubernetes storage
PPT
Chap1
PPTX
Preparing OpenSHMEM for Exascale
PPT
Ceg4131 models
PPTX
Chorus - Distributed Operating System [ case study ]
PPT
module4.ppt
PPTX
Using the big guns: Advanced OS performance tools for troubleshooting databas...
chapter-6-multiprocessors-and-thread-level (1).ppt
[Harvard CS264] 07 - GPU Cluster Programming (MPI & ZeroMQ)
Massively Parallel Architectures
Introduction to multi core
Cache coherence ppt
Introduction to symmetric multiprocessor
High_Performance_ComputingforComputers.pptx
Leveraging Structured Data To Reduce Disk, IO & Network Bandwidth
Pdc lecture1
12-6810-12.ppt
247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).ppt
MT48 A Flash into the future of storage….  Flash meets Persistent Memory: The...
PyData Paris 2015 - Closing keynote Francesc Alted
Webinar: OpenEBS - Still Free and now FASTEST Kubernetes storage
Chap1
Preparing OpenSHMEM for Exascale
Ceg4131 models
Chorus - Distributed Operating System [ case study ]
module4.ppt
Using the big guns: Advanced OS performance tools for troubleshooting databas...
Ad

More from Luis Goldster (20)

PPTX
Ruby on rails evaluation
PPTX
Design patterns
PPT
Lisp and scheme i
PPT
Ado.net & data persistence frameworks
PPTX
Multithreading models.ppt
PPTX
Business analytics and data mining
PPTX
Big picture of data mining
PPTX
Data mining and knowledge discovery
PPTX
Cache recap
PPTX
Directory based cache coherence
PPTX
Hardware managed cache
PPTX
How analysis services caching works
PPT
Abstract data types
PPTX
Api crash
PPTX
Object model
PPTX
Abstraction file
PPTX
Object oriented analysis
PPT
Abstract class
PPTX
Concurrency with java
PPTX
Data structures and algorithms
Ruby on rails evaluation
Design patterns
Lisp and scheme i
Ado.net & data persistence frameworks
Multithreading models.ppt
Business analytics and data mining
Big picture of data mining
Data mining and knowledge discovery
Cache recap
Directory based cache coherence
Hardware managed cache
How analysis services caching works
Abstract data types
Api crash
Object model
Abstraction file
Object oriented analysis
Abstract class
Concurrency with java
Data structures and algorithms

Recently uploaded (20)

PDF
INTERSPEECH 2025 「Recent Advances and Future Directions in Voice Conversion」
PDF
Accessing-Finance-in-Jordan-MENA 2024 2025.pdf
DOCX
Basics of Cloud Computing - Cloud Ecosystem
PDF
Flame analysis and combustion estimation using large language and vision assi...
PDF
Auditboard EB SOX Playbook 2023 edition.
PDF
NewMind AI Weekly Chronicles – August ’25 Week IV
PDF
Transform-Your-Factory-with-AI-Driven-Quality-Engineering.pdf
PDF
CXOs-Are-you-still-doing-manual-DevOps-in-the-age-of-AI.pdf
PDF
Advancing precision in air quality forecasting through machine learning integ...
PDF
Data Virtualization in Action: Scaling APIs and Apps with FME
PDF
Statistics on Ai - sourced from AIPRM.pdf
PPTX
Microsoft User Copilot Training Slide Deck
PPTX
Custom Battery Pack Design Considerations for Performance and Safety
PPTX
Internet of Everything -Basic concepts details
PPTX
Training Program for knowledge in solar cell and solar industry
PDF
Improvisation in detection of pomegranate leaf disease using transfer learni...
PPTX
AI-driven Assurance Across Your End-to-end Network With ThousandEyes
PDF
Transform-Quality-Engineering-with-AI-A-60-Day-Blueprint-for-Digital-Success.pdf
PDF
The-2025-Engineering-Revolution-AI-Quality-and-DevOps-Convergence.pdf
PPT
Galois Field Theory of Risk: A Perspective, Protocol, and Mathematical Backgr...
INTERSPEECH 2025 「Recent Advances and Future Directions in Voice Conversion」
Accessing-Finance-in-Jordan-MENA 2024 2025.pdf
Basics of Cloud Computing - Cloud Ecosystem
Flame analysis and combustion estimation using large language and vision assi...
Auditboard EB SOX Playbook 2023 edition.
NewMind AI Weekly Chronicles – August ’25 Week IV
Transform-Your-Factory-with-AI-Driven-Quality-Engineering.pdf
CXOs-Are-you-still-doing-manual-DevOps-in-the-age-of-AI.pdf
Advancing precision in air quality forecasting through machine learning integ...
Data Virtualization in Action: Scaling APIs and Apps with FME
Statistics on Ai - sourced from AIPRM.pdf
Microsoft User Copilot Training Slide Deck
Custom Battery Pack Design Considerations for Performance and Safety
Internet of Everything -Basic concepts details
Training Program for knowledge in solar cell and solar industry
Improvisation in detection of pomegranate leaf disease using transfer learni...
AI-driven Assurance Across Your End-to-end Network With ThousandEyes
Transform-Quality-Engineering-with-AI-A-60-Day-Blueprint-for-Digital-Success.pdf
The-2025-Engineering-Revolution-AI-Quality-and-DevOps-Convergence.pdf
Galois Field Theory of Risk: A Perspective, Protocol, and Mathematical Backgr...

Optimizing shared caches in chip multiprocessors

  • 1. Core 2 Duo die “Just a few years ago, the idea of putting multiple processors on a chip was farfetched. Now it is accepted and commonplace, and virtually every new high performance processor is a chip multiprocessor of some sort…” Center for Electronic System Design Univ. of California Berkeley Chip Multiprocessors?? “Mowry is working on the development of single-chip multiprocessors: one large chip capable of performing multiple operations at once, using similar techniques to maximize performance” -- Technology Review, 1999 Sony's Playstation 3, 2006
  • 2. CMP Caches: Design Space • Architecture – Placement of Cache/Processors – Interconnects/Routing • Cache Organization & Management – Private/Shared/Hybrid – Fully Hardware/OS Interface “L2 is the last line of defense before hitting the memory wall, and is the focus of our talk”
  • 3. Private L2 Cache I$ D$ I$ D$ L2 $ L2 $ L2 $ L2 $ L2 $ L2 $ I N T E R C O N N E C T Coherence Protocol Offchip Memory + Less interconnect traffic + Insulates L2 units + Hit latency – Duplication – Load imbalance – Complexity of coherence – Higher miss rate L1 L1 Proc
  • 4. Shared-Interleaved L2 Cache – Interconnect traffic – Interference between cores – Hit latency is higher + No duplication + Balance the load + Lower miss rate + Simplicity of coherence I$ D$ I$ D$ I N T E R C O N N E C T Coherence ProtocolL1 L2
  • 5. Take Home Message • Leverage on-chip access time
  • 6. Take Home Messages • Leverage on-chip access time • Better sharing of cache resources • Isolating performance of processors • Place data on the chip close to where it is used • Minimize inter-processor misses (in shared cache) • Fairness towards processors
  • 7. On to some solutions… Jichuan Chang and Gurindar S. Sohi Cooperative Caching for Chip Multiprocessors International Symposium on Computer Architecture, 2006. Nikos Hardavellas, Michael Ferdman, Babak Falsafi, and Anastasia Ailamaki Reactive NUCA: Near-Optimal Block Placement and Replication in Distributed Caches International Symposium on Computer Architecture, 2009. Shekhar Srikantaiah, Mahmut Kandemir, and Mary Jane Irwin Adaptive Set-Pinning: Managing Shared Caches in Chip Multiprocessors Architectural Support for Programming Languages and Operating, Systems 2008. each handles this problem in a different way
  • 8. Co-operative Caching (Chang & Sohi) • Private L2 caches • Attract data locally to reduce remote on chip access. Lowers average on-chip misses. • Co-operation among the private caches for efficient use of resources on the chip. • Controlling the extent of co-operation to suit the dynamic workload behavior
  • 9. CC Techniques • Cache to cache transfer of clean data – In case of miss transfer “clean” blocks from another L2 cache. – This is useful in the case of “read only” data (instructions) . • Replication aware data replacement – Singlet/Replicate. – Evict singlet only when no replicates exist. – Singlets can be “spilled” to other cache banks. • Global replacement of inactive data – Global management needed for managing “spilling”. – N-Chance Forwarding. – Set recirculation count to N when spilled. – Decrease N by 1 when spilled again, unless N becomes 0.
  • 10. Set “Pinning” -- Setup P1 P2 P3 P4 Set 0 Set 1 : : Set (S-1) L1 cache Processors Shared L2 cache I n t e r c o n n e c t Main Memory
  • 11. Set “Pinning” -- Problem P1 P2 P3 P4 Set 0 Set 1 : : Set (S-1) Main Memory
  • 12. Set “Pinning” -- Types of Cache Misses • Compulsory (aka Cold) • Capacity • Conflict • Coherence • Compulsory • Inter-processor • Intra-processor versus
  • 13. P1 P2 P3 P4 Main Memory POP 1 POP 2 POP 3 POP 4 Set : : Set Owner Other bits Data
  • 14. R-NUCA: Use Class-Based Strategies Solve for the common case! Most current (and future) programs have the following types of accesses 1. Instruction Access – Shared, but Read-Only 2. Private Data Access – Read-Write, but not Shared 3. Shared Data Access – Read-Write (or) Read-Only, but Shared.
  • 15. R-NUCA: Can do this online! • We have information from the OS and TLB • For each memory block, classify it as – Instruction – Private Data – Shared Data • Handle them differently – Replicate instructions – Keep private data locally – Keep shared data globally
  • 16. R-NUCA: Reactive Clustering • Assign clusters based on level of sharing – Private Data given level-1 clusters (local cache) – Shared Data given level-16 clusters (16 neighboring machines), etc. Clusters ≈ Overlapping Sets in Set-Associative Mapping • Within a cluster, “Rotational Interleaving” – Load-Balancing to minimize contention on bus and controller
  • 18. Just Kidding… • Optimize for Power Consumption • Assess trade-offs between more caches and more cores • Minimize usage of OS, but still retain flexibility • Application adaptation to allocated cache quotas • Adding hardware directed thread level speculation
  • 20. Backup • Commercial and research prototypes – Sun MAJC – Piranha – IBM Power 4/5 – Stanford Hydra