SlideShare a Scribd company logo
Techniques to Improve Scalability of Transactional Memory systems Salil Pant Advisor:  Dr G. Byrd
Introduction Shared memory parallel programs need synchronization Lock-based synchronization using atomic read-modify-write primitives Problems with locks  Solution:- Transactional memory  Speculative and optimistic  Relieves the programmer
Issues with TM Scalability Contributions  Analysis of TM for scalability Value-Predictor  Results  Proposed work
Conventional Synchronization  Conservative, blocking, lock-based  Atomic read-modify-write primitives Provide atomicity only for a single address. Sync variables exposed to the programmer Programmer orchestrates synchronization Granularity  =  (No. of shared R/W variables covered)   (No. of lock variables) High (>> 1) = coarse , low (~1) = fine Fine granularity => More concurrency => better perf. as long as program runs correctly
Software  Mapping from locks to shared conf. variables Programmers opt for coarse grain locks Deadlocks, livelocks, starvation other issues managed by programmer Blocking sync not good for fault tolerance Hardware  Basic test and set not scalable Software queue-based locks too heavy for common case  Problems Fine granularity == lot of locks == hard to program/debug
Transactional Memory Proposed by Herlihy “ Transactional abstraction”  -  critical sections become “transactions” ACI properties Optimistic speculative execution of critical sections Conflicting accesses detected and execution rolled back  read-write, write-write, write-read Can be implemented by hardware or software Lock (X); Update (A) Unlock (X);  Lock (Y)  Update (B); Unlock (Y); Begin_transaction; Update(A); End_transaction; Begin_transaction; Update(B); End_transaction ; Lock (X); Lock (Y); Update (A,B); Unlock(Y); Unlock(X); Begin_transaction; Update(A,B); End_transaction;
Hardware-supported TM Special instructions to indicate transactional accesses Initialize buffer to save transactional data Checkpoint at the beginning Buffer to log versions of transactional data  Special write buffer In-memory log Conflict detection/resolution mechanism via coherence protocol  “ timestamps” – local logical clock + cpu_id Mechanism to rollback state  Hardware to checkpoint processor state ROB-based
Hardware TM  Additions to the chip ( TLR proposal )
Advantages Transfers burden to the designer  deadlocks, livelocks, starvation freedom etc.  Ease of programming  More transactions does not mean hard programs Performs better than locks in the common case  More concurrency, less overhead Concurrency now depends on size of transaction Non-blocking advantages Can be implemented in software or by hardware. We mainly focus on hardware
Issues with TM  TM is a optimistic speculative sync scheme Works best under mild/medium contention How does HTM deal with ? Large transaction sizes  System calls or I/O inside transactions  Processes/threads getting de-scheduled  Thread migration
Scalability Issue Scalability of TM with increasing number of processors Optimistic execution beneficial at 32 processors ? Greater overhead with conflicts/aborts compared to lock-based sync  Memory + processor rollback  Network overhead Serialized commit/abort needed to maintain atomicity. Transaction sizes predicted to increase  Support I/O, system calls within transactions Integrate TM with higher programming language models
Measuring scalability What are we looking for ?  Application vs. system scalability TM overhead == conflicts Measure speedup for up to 32 processor systems “ Tourmaline” simulator for TM Simple TM system with a timing model for memory accesses. Provides version management & conflict detection.  Timestamps for conflict resolution Conflicts always abort “younger” transactions No network model Added simple backoff  2 Splash Benchmarks were “transactified”  Cholesky & Raytrace
Queue Micro-benchmark Queue Micro-benchmark for TM 2^10 insert/delete operations Important structure used in splash benchmarks
Micro-benchmark Results
Benchmark results
Observations Conflicts increase with increasing CPUs TM overhead can lead to slowdown Situation gets worse with increased transaction sizes Effect on speedup might be worse with a network model in place.  How to make TM resilient to conflicts ?
Value Predictor Idea TM performance degrades with conflicts Certain data structures hard to parallelize No performance difference with TM
Serializing data/operations are predictable Pointers: head, tail etc Sizes: constant increment/decrements HTM already includes  speculative hardware for buffering checkpoint and rollback capability Still reap benefits of TM Allows running transactions in parallel with predicted values Such queues used mainly for task/memory management purposes  Cholesky, Raytrace, Radiosity
Implementation Stride-based, memory-level  Base LogTM model  In-memory logging of old values during Xn stores Eager conflict detection, timestamps for conflict resolution  Uses a Nack-based coherence protocol Deadlock detection mechanism Commits are easy  Aborts need memory + processor rollback.  Nacks used to trigger value predictor
Implementation Addresses identified as predictable by the programmer/compiler. Value predictor initializes entry with the address VP entry, 1 per VP address Ordered list of real values, 2 in our design  Ordered list of predicted values Ordered list of predicted cpus Fortunately,  max 3 or 4 VP entries needed so far.
Need an extra buffer to hold predicted data. Only with LogTM Cannot log predicted load value in memory  Predictions checked at commit time  Execution does not advance beyond commit until verified Needs changes in the coherence protocol More deadlock scenarios  Simplifications  Address, VP entries Timing of VP  Always generate exclusive requests
Implementation Directory M 1 CPU 1 CPU 3 CPU 2 Data GetX GetX FGetX Nack Nack Log TM model
Implementation Directory M-1 S-2 Value Predictor CPU 1 CPU 3 CPU 2 Nack Nack Pred Retry GetX FGetX FGetX Generating predictions Nack Nack
State after predictions Directory M-1 S-2-3 Value Predictor CPU 1 CPU 3 CPU 2 Nack Retry FGetX Nack Retry FGetX Nack Nack
Successful predictions Directory M-1 S-2-3 Value Predictor CPU 1 CPU 3 CPU 2 Retry FGetX Unblock Retry FGetX Nack Nack Result M-2 S-3
Directory M-1 S-2-3 Value Predictor CPU 1 CPU 3 CPU 2 Retry FGetX Unblock Retry Result NP S-3 Failed Predictions  Result
Evaluation Microbenchmarks: Loop-based, 2^10 Xn Shared-counter  Simple counter, increment by fixed value  Queue-based Insert only  Random inserts and deletes Simulation platform: SIMICS in-order processors (1,2,4,8,16)  GEMS (RUBY) memory system  Highly optimized LogTM model for experiments Cholesky & Raytrace benchmarks Both contain a linked-list for memory management.  Cholesky could not be completely transactified
Results
Splash Benchmarks  Cholesky  Raytrace  Adding directives to support value prediction
Splash benchmarks Table with TM parameters for 16 processors 13572 7100 Xn size No. of Xn %Stalls Writes %Aborts LogTM / VP-TM Cholesky 24466  40.2 2.2 30 / 18.8 Raytrace 46958 32.3 3.9 20 / 13.4
Observations Value predictor can improve speedup without much overhead Performance gains with increasing number of processors Aborts increases as number of processors increases Is TM scalable ? More benchmarks needed
Extending the value predictor Improving the simulation model Exploring other types of value predictors  Expanding application scope  Controlling aggressiveness  Adding confidence mechanisms.  Reducing hardware complexity of the value predictor entry.
Proposed Ideas Value predictor not general enough! Need to reduce conflicts Better backoff schemes Centralized transaction scheduler  “ Intelligent” backoff times Expose transactions to the directory begin_Xn and end_Xn messages to the directory ? Count number of memory accesses in transactions  Generate backoff time based on count
Proposed Ideas contd. Why is this different from any other scalability research ?  Recent work by Bobba shows HTM designs impact performance by almost 80%.  Different data/conflict management schemes needed for different applications? STM can help, but performance suffers  Can we have both lazy and eager version management? Is HTM on large systems a good idea ?
Proposed Ideas contd. Effectiveness of Nacks/Stalls decreases as number of processors increases Need stalling mechanism without the overhead of deadlocks Stall transactions after restart Use timestamps to avoid starvation Need to understand hardware requirements  Verilog model Proposals need hardware evaluation Value predictor  Speculative buffer
Experiments/Analysis Need better benchmarks Synchronization intensive  SPECJBB , STAMP, Java Grande benchmarks Larger transactions  Test up to 64 processors  Simulations with SIMICS + GEMS
Contribution Identify scalability bottleneck with TM Value predictor for certain applications Proposal Extending value predictor work Improved backoff schemes  Transaction queuing/stalling  Hardware evaluation* END Questions ?
Directory M-1 S-2-3 Value Predictor CPU 1 CPU 3 CPU 2 Retry FGetX Unblock Retry FGetX Nack Nack Result M-2 S-3
 
Overall, TCC’s FPGA implementation adds 14% overhead in the control logic, and 29% in on chip memory as compared to a non-speculative incarnation of our cache.
 

More Related Content

What's hot (19)

PDF
Process scheduling (CPU Scheduling)
Mukesh Chinta
 
PDF
4 threads
BaliThorat1
 
PDF
Unit I Memory technology and optimization
K Gowsic Gowsic
 
PPT
13 tm adv
ashish61_scs
 
PPTX
Enery efficient data prefetching
Himanshu Koli
 
PPT
Chapter 19 - Real Time Systems
Wayne Jones Jnr
 
PPTX
cs8493 - operating systems unit 2
SIMONTHOMAS S
 
PPTX
Conditional branches
Dilip Mathuria
 
PDF
Rts assighment final
sayanpandit
 
PPT
Real time-embedded-system-lec-03
University of Computer Science and Technology
 
PPTX
Chap6 procedures & macros
HarshitParkar6677
 
PPT
Chapter 7 cpu struktur dan fungsi
risal07
 
PPT
cpu scheduling OS
Kiran Kumar Thota
 
PDF
Stream Processing Under Latency Constraints
Technische Universität Berlin
 
PDF
Os unit 2
Krupali Mistry
 
PDF
5 Sol
student_.20
 
PPTX
Lecture 4 process cpu scheduling
Kumbirai Junior Muzavazi
 
PPTX
Process scheduling
Deepika Balichwal
 
PPTX
Workshop NGS data analysis - 3
Maté Ongenaert
 
Process scheduling (CPU Scheduling)
Mukesh Chinta
 
4 threads
BaliThorat1
 
Unit I Memory technology and optimization
K Gowsic Gowsic
 
13 tm adv
ashish61_scs
 
Enery efficient data prefetching
Himanshu Koli
 
Chapter 19 - Real Time Systems
Wayne Jones Jnr
 
cs8493 - operating systems unit 2
SIMONTHOMAS S
 
Conditional branches
Dilip Mathuria
 
Rts assighment final
sayanpandit
 
Real time-embedded-system-lec-03
University of Computer Science and Technology
 
Chap6 procedures & macros
HarshitParkar6677
 
Chapter 7 cpu struktur dan fungsi
risal07
 
cpu scheduling OS
Kiran Kumar Thota
 
Stream Processing Under Latency Constraints
Technische Universität Berlin
 
Os unit 2
Krupali Mistry
 
Lecture 4 process cpu scheduling
Kumbirai Junior Muzavazi
 
Process scheduling
Deepika Balichwal
 
Workshop NGS data analysis - 3
Maté Ongenaert
 

Similar to Prelim Slides (20)

PDF
Adaptive Thread Scheduling Techniques for Improving Scalability of Software T...
Kinson Chan
 
PDF
TrC-MC: Decentralized Software Transactional Memory for Multi-Multicore Compu...
Kinson Chan
 
PPTX
Transactional Memory
Smruti Sarangi
 
PDF
A Survey on Distributed Transactional Memory System with a Proposed Design fo...
IRJET Journal
 
PPTX
[COSCUP 2022] 腳踏多條船-利用 Coroutine在 Software Transactional Memory上進行動態排程
littleuniverse24
 
PDF
Designing for Concurrency
Susan Potter
 
PDF
Synchronizing Parallel Tasks Using STM
IJERA Editor
 
PPT
Executing Multiple Thread on Modern Processor
NurHadisukmana3
 
PPT
九月.點點.滴滴
elusiveboy
 
PDF
Research Review Slides
karsithe
 
PDF
final (1)
Richard Jones
 
PPTX
6.distributed shared memory
Gd Goenka University
 
PDF
Optimization of Remote Core Locking Synchronization in Multithreaded Programs...
ITIIIndustries
 
PPT
Data race
James Wong
 
PDF
Jay Kreps on Project Voldemort Scaling Simple Storage At LinkedIn
LinkedIn
 
PPTX
HyperLoop: Group-Based NIC-Offloading to Accelerate Replicated Transactions i...
Daehyeok Kim
 
PPTX
opt-mem-trx
Miguel Gamboa
 
PPT
Hs java open_party
Open Party
 
ODP
Paractical Solutions for Multicore Programming
Guy Korland
 
Adaptive Thread Scheduling Techniques for Improving Scalability of Software T...
Kinson Chan
 
TrC-MC: Decentralized Software Transactional Memory for Multi-Multicore Compu...
Kinson Chan
 
Transactional Memory
Smruti Sarangi
 
A Survey on Distributed Transactional Memory System with a Proposed Design fo...
IRJET Journal
 
[COSCUP 2022] 腳踏多條船-利用 Coroutine在 Software Transactional Memory上進行動態排程
littleuniverse24
 
Designing for Concurrency
Susan Potter
 
Synchronizing Parallel Tasks Using STM
IJERA Editor
 
Executing Multiple Thread on Modern Processor
NurHadisukmana3
 
九月.點點.滴滴
elusiveboy
 
Research Review Slides
karsithe
 
final (1)
Richard Jones
 
6.distributed shared memory
Gd Goenka University
 
Optimization of Remote Core Locking Synchronization in Multithreaded Programs...
ITIIIndustries
 
Data race
James Wong
 
Jay Kreps on Project Voldemort Scaling Simple Storage At LinkedIn
LinkedIn
 
HyperLoop: Group-Based NIC-Offloading to Accelerate Replicated Transactions i...
Daehyeok Kim
 
opt-mem-trx
Miguel Gamboa
 
Hs java open_party
Open Party
 
Paractical Solutions for Multicore Programming
Guy Korland
 
Ad

Recently uploaded (20)

PDF
Women in Automation Presents: Reinventing Yourself — Bold Career Pivots That ...
DianaGray10
 
PDF
NewMind AI - Journal 100 Insights After The 100th Issue
NewMind AI
 
PDF
Are there government-backed agri-software initiatives in Limerick.pdf
giselawagner2
 
PDF
Market Wrap for 18th July 2025 by CIFDAQ
CIFDAQ
 
PDF
Smart Air Quality Monitoring with Serrax AQM190 LITE
SERRAX TECHNOLOGIES LLP
 
PPTX
Building and Operating a Private Cloud with CloudStack and LINBIT CloudStack ...
ShapeBlue
 
PPT
Interview paper part 3, It is based on Interview Prep
SoumyadeepGhosh39
 
PDF
Bitcoin+ Escalando sin concesiones - Parte 1
Fernando Paredes García
 
PDF
Why Orbit Edge Tech is a Top Next JS Development Company in 2025
mahendraalaska08
 
PPTX
The Yotta x CloudStack Advantage: Scalable, India-First Cloud
ShapeBlue
 
PDF
Upgrading to z_OS V2R4 Part 01 of 02.pdf
Flavio787771
 
PDF
2025-07-15 EMEA Volledig Inzicht Dutch Webinar
ThousandEyes
 
PDF
Empowering Cloud Providers with Apache CloudStack and Stackbill
ShapeBlue
 
PDF
HR agent at Mediq: Lessons learned on Agent Builder & Maestro by Tacstone Tec...
UiPathCommunity
 
PDF
Novus-Safe Pro: Brochure-What is Novus Safe Pro?.pdf
Novus Hi-Tech
 
PDF
Impact of IEEE Computer Society in Advancing Emerging Technologies including ...
Hironori Washizaki
 
PPTX
✨Unleashing Collaboration: Salesforce Channels & Community Power in Patna!✨
SanjeetMishra29
 
PDF
CIFDAQ Weekly Market Wrap for 11th July 2025
CIFDAQ
 
PDF
HCIP-Data Center Facility Deployment V2.0 Training Material (Without Remarks ...
mcastillo49
 
PDF
HydITEx corporation Booklet 2025 English
Георгий Феодориди
 
Women in Automation Presents: Reinventing Yourself — Bold Career Pivots That ...
DianaGray10
 
NewMind AI - Journal 100 Insights After The 100th Issue
NewMind AI
 
Are there government-backed agri-software initiatives in Limerick.pdf
giselawagner2
 
Market Wrap for 18th July 2025 by CIFDAQ
CIFDAQ
 
Smart Air Quality Monitoring with Serrax AQM190 LITE
SERRAX TECHNOLOGIES LLP
 
Building and Operating a Private Cloud with CloudStack and LINBIT CloudStack ...
ShapeBlue
 
Interview paper part 3, It is based on Interview Prep
SoumyadeepGhosh39
 
Bitcoin+ Escalando sin concesiones - Parte 1
Fernando Paredes García
 
Why Orbit Edge Tech is a Top Next JS Development Company in 2025
mahendraalaska08
 
The Yotta x CloudStack Advantage: Scalable, India-First Cloud
ShapeBlue
 
Upgrading to z_OS V2R4 Part 01 of 02.pdf
Flavio787771
 
2025-07-15 EMEA Volledig Inzicht Dutch Webinar
ThousandEyes
 
Empowering Cloud Providers with Apache CloudStack and Stackbill
ShapeBlue
 
HR agent at Mediq: Lessons learned on Agent Builder & Maestro by Tacstone Tec...
UiPathCommunity
 
Novus-Safe Pro: Brochure-What is Novus Safe Pro?.pdf
Novus Hi-Tech
 
Impact of IEEE Computer Society in Advancing Emerging Technologies including ...
Hironori Washizaki
 
✨Unleashing Collaboration: Salesforce Channels & Community Power in Patna!✨
SanjeetMishra29
 
CIFDAQ Weekly Market Wrap for 11th July 2025
CIFDAQ
 
HCIP-Data Center Facility Deployment V2.0 Training Material (Without Remarks ...
mcastillo49
 
HydITEx corporation Booklet 2025 English
Георгий Феодориди
 
Ad

Prelim Slides

  • 1. Techniques to Improve Scalability of Transactional Memory systems Salil Pant Advisor: Dr G. Byrd
  • 2. Introduction Shared memory parallel programs need synchronization Lock-based synchronization using atomic read-modify-write primitives Problems with locks Solution:- Transactional memory Speculative and optimistic Relieves the programmer
  • 3. Issues with TM Scalability Contributions Analysis of TM for scalability Value-Predictor Results Proposed work
  • 4. Conventional Synchronization Conservative, blocking, lock-based Atomic read-modify-write primitives Provide atomicity only for a single address. Sync variables exposed to the programmer Programmer orchestrates synchronization Granularity = (No. of shared R/W variables covered) (No. of lock variables) High (>> 1) = coarse , low (~1) = fine Fine granularity => More concurrency => better perf. as long as program runs correctly
  • 5. Software Mapping from locks to shared conf. variables Programmers opt for coarse grain locks Deadlocks, livelocks, starvation other issues managed by programmer Blocking sync not good for fault tolerance Hardware Basic test and set not scalable Software queue-based locks too heavy for common case Problems Fine granularity == lot of locks == hard to program/debug
  • 6. Transactional Memory Proposed by Herlihy “ Transactional abstraction” - critical sections become “transactions” ACI properties Optimistic speculative execution of critical sections Conflicting accesses detected and execution rolled back read-write, write-write, write-read Can be implemented by hardware or software Lock (X); Update (A) Unlock (X); Lock (Y) Update (B); Unlock (Y); Begin_transaction; Update(A); End_transaction; Begin_transaction; Update(B); End_transaction ; Lock (X); Lock (Y); Update (A,B); Unlock(Y); Unlock(X); Begin_transaction; Update(A,B); End_transaction;
  • 7. Hardware-supported TM Special instructions to indicate transactional accesses Initialize buffer to save transactional data Checkpoint at the beginning Buffer to log versions of transactional data Special write buffer In-memory log Conflict detection/resolution mechanism via coherence protocol “ timestamps” – local logical clock + cpu_id Mechanism to rollback state Hardware to checkpoint processor state ROB-based
  • 8. Hardware TM Additions to the chip ( TLR proposal )
  • 9. Advantages Transfers burden to the designer deadlocks, livelocks, starvation freedom etc. Ease of programming More transactions does not mean hard programs Performs better than locks in the common case More concurrency, less overhead Concurrency now depends on size of transaction Non-blocking advantages Can be implemented in software or by hardware. We mainly focus on hardware
  • 10. Issues with TM TM is a optimistic speculative sync scheme Works best under mild/medium contention How does HTM deal with ? Large transaction sizes System calls or I/O inside transactions Processes/threads getting de-scheduled Thread migration
  • 11. Scalability Issue Scalability of TM with increasing number of processors Optimistic execution beneficial at 32 processors ? Greater overhead with conflicts/aborts compared to lock-based sync Memory + processor rollback Network overhead Serialized commit/abort needed to maintain atomicity. Transaction sizes predicted to increase Support I/O, system calls within transactions Integrate TM with higher programming language models
  • 12. Measuring scalability What are we looking for ? Application vs. system scalability TM overhead == conflicts Measure speedup for up to 32 processor systems “ Tourmaline” simulator for TM Simple TM system with a timing model for memory accesses. Provides version management & conflict detection. Timestamps for conflict resolution Conflicts always abort “younger” transactions No network model Added simple backoff 2 Splash Benchmarks were “transactified” Cholesky & Raytrace
  • 13. Queue Micro-benchmark Queue Micro-benchmark for TM 2^10 insert/delete operations Important structure used in splash benchmarks
  • 16. Observations Conflicts increase with increasing CPUs TM overhead can lead to slowdown Situation gets worse with increased transaction sizes Effect on speedup might be worse with a network model in place. How to make TM resilient to conflicts ?
  • 17. Value Predictor Idea TM performance degrades with conflicts Certain data structures hard to parallelize No performance difference with TM
  • 18. Serializing data/operations are predictable Pointers: head, tail etc Sizes: constant increment/decrements HTM already includes speculative hardware for buffering checkpoint and rollback capability Still reap benefits of TM Allows running transactions in parallel with predicted values Such queues used mainly for task/memory management purposes Cholesky, Raytrace, Radiosity
  • 19. Implementation Stride-based, memory-level Base LogTM model In-memory logging of old values during Xn stores Eager conflict detection, timestamps for conflict resolution Uses a Nack-based coherence protocol Deadlock detection mechanism Commits are easy Aborts need memory + processor rollback. Nacks used to trigger value predictor
  • 20. Implementation Addresses identified as predictable by the programmer/compiler. Value predictor initializes entry with the address VP entry, 1 per VP address Ordered list of real values, 2 in our design Ordered list of predicted values Ordered list of predicted cpus Fortunately, max 3 or 4 VP entries needed so far.
  • 21. Need an extra buffer to hold predicted data. Only with LogTM Cannot log predicted load value in memory Predictions checked at commit time Execution does not advance beyond commit until verified Needs changes in the coherence protocol More deadlock scenarios Simplifications Address, VP entries Timing of VP Always generate exclusive requests
  • 22. Implementation Directory M 1 CPU 1 CPU 3 CPU 2 Data GetX GetX FGetX Nack Nack Log TM model
  • 23. Implementation Directory M-1 S-2 Value Predictor CPU 1 CPU 3 CPU 2 Nack Nack Pred Retry GetX FGetX FGetX Generating predictions Nack Nack
  • 24. State after predictions Directory M-1 S-2-3 Value Predictor CPU 1 CPU 3 CPU 2 Nack Retry FGetX Nack Retry FGetX Nack Nack
  • 25. Successful predictions Directory M-1 S-2-3 Value Predictor CPU 1 CPU 3 CPU 2 Retry FGetX Unblock Retry FGetX Nack Nack Result M-2 S-3
  • 26. Directory M-1 S-2-3 Value Predictor CPU 1 CPU 3 CPU 2 Retry FGetX Unblock Retry Result NP S-3 Failed Predictions Result
  • 27. Evaluation Microbenchmarks: Loop-based, 2^10 Xn Shared-counter Simple counter, increment by fixed value Queue-based Insert only Random inserts and deletes Simulation platform: SIMICS in-order processors (1,2,4,8,16) GEMS (RUBY) memory system Highly optimized LogTM model for experiments Cholesky & Raytrace benchmarks Both contain a linked-list for memory management. Cholesky could not be completely transactified
  • 29. Splash Benchmarks Cholesky Raytrace Adding directives to support value prediction
  • 30. Splash benchmarks Table with TM parameters for 16 processors 13572 7100 Xn size No. of Xn %Stalls Writes %Aborts LogTM / VP-TM Cholesky 24466 40.2 2.2 30 / 18.8 Raytrace 46958 32.3 3.9 20 / 13.4
  • 31. Observations Value predictor can improve speedup without much overhead Performance gains with increasing number of processors Aborts increases as number of processors increases Is TM scalable ? More benchmarks needed
  • 32. Extending the value predictor Improving the simulation model Exploring other types of value predictors Expanding application scope Controlling aggressiveness Adding confidence mechanisms. Reducing hardware complexity of the value predictor entry.
  • 33. Proposed Ideas Value predictor not general enough! Need to reduce conflicts Better backoff schemes Centralized transaction scheduler “ Intelligent” backoff times Expose transactions to the directory begin_Xn and end_Xn messages to the directory ? Count number of memory accesses in transactions Generate backoff time based on count
  • 34. Proposed Ideas contd. Why is this different from any other scalability research ? Recent work by Bobba shows HTM designs impact performance by almost 80%. Different data/conflict management schemes needed for different applications? STM can help, but performance suffers Can we have both lazy and eager version management? Is HTM on large systems a good idea ?
  • 35. Proposed Ideas contd. Effectiveness of Nacks/Stalls decreases as number of processors increases Need stalling mechanism without the overhead of deadlocks Stall transactions after restart Use timestamps to avoid starvation Need to understand hardware requirements Verilog model Proposals need hardware evaluation Value predictor Speculative buffer
  • 36. Experiments/Analysis Need better benchmarks Synchronization intensive SPECJBB , STAMP, Java Grande benchmarks Larger transactions Test up to 64 processors Simulations with SIMICS + GEMS
  • 37. Contribution Identify scalability bottleneck with TM Value predictor for certain applications Proposal Extending value predictor work Improved backoff schemes Transaction queuing/stalling Hardware evaluation* END Questions ?
  • 38. Directory M-1 S-2-3 Value Predictor CPU 1 CPU 3 CPU 2 Retry FGetX Unblock Retry FGetX Nack Nack Result M-2 S-3
  • 39.  
  • 40. Overall, TCC’s FPGA implementation adds 14% overhead in the control logic, and 29% in on chip memory as compared to a non-speculative incarnation of our cache.
  • 41.