Prelim Slides

Techniques to Improve Scalability of Transactional Memory systems Salil Pant Advisor: Dr G. Byrd

Introduction Shared memory parallel programs need synchronization Lock-based synchronization using atomic read-modify-write primitives Problems with locks Solution:- Transactional memory Speculative and optimistic Relieves the programmer

Issues with TM Scalability Contributions Analysis of TM for scalability Value-Predictor Results Proposed work

Conventional Synchronization Conservative, blocking, lock-based Atomic read-modify-write primitives Provide atomicity only for a single address. Sync variables exposed to the programmer Programmer orchestrates synchronization Granularity = (No. of shared R/W variables covered) (No. of lock variables) High (>> 1) = coarse , low (~1) = fine Fine granularity => More concurrency => better perf. as long as program runs correctly

Software Mapping from locks to shared conf. variables Programmers opt for coarse grain locks Deadlocks, livelocks, starvation other issues managed by programmer Blocking sync not good for fault tolerance Hardware Basic test and set not scalable Software queue-based locks too heavy for common case Problems Fine granularity == lot of locks == hard to program/debug

Transactional Memory Proposed by Herlihy “ Transactional abstraction” - critical sections become “transactions” ACI properties Optimistic speculative execution of critical sections Conflicting accesses detected and execution rolled back read-write, write-write, write-read Can be implemented by hardware or software Lock (X); Update (A) Unlock (X); Lock (Y) Update (B); Unlock (Y); Begin_transaction; Update(A); End_transaction; Begin_transaction; Update(B); End_transaction ; Lock (X); Lock (Y); Update (A,B); Unlock(Y); Unlock(X); Begin_transaction; Update(A,B); End_transaction;

Hardware-supported TM Special instructions to indicate transactional accesses Initialize buffer to save transactional data Checkpoint at the beginning Buffer to log versions of transactional data Special write buffer In-memory log Conflict detection/resolution mechanism via coherence protocol “ timestamps” – local logical clock + cpu_id Mechanism to rollback state Hardware to checkpoint processor state ROB-based

Hardware TM Additions to the chip ( TLR proposal )

Advantages Transfers burden to the designer deadlocks, livelocks, starvation freedom etc. Ease of programming More transactions does not mean hard programs Performs better than locks in the common case More concurrency, less overhead Concurrency now depends on size of transaction Non-blocking advantages Can be implemented in software or by hardware. We mainly focus on hardware

Issues with TM TM is a optimistic speculative sync scheme Works best under mild/medium contention How does HTM deal with ? Large transaction sizes System calls or I/O inside transactions Processes/threads getting de-scheduled Thread migration

Scalability Issue Scalability of TM with increasing number of processors Optimistic execution beneficial at 32 processors ? Greater overhead with conflicts/aborts compared to lock-based sync Memory + processor rollback Network overhead Serialized commit/abort needed to maintain atomicity. Transaction sizes predicted to increase Support I/O, system calls within transactions Integrate TM with higher programming language models

Measuring scalability What are we looking for ? Application vs. system scalability TM overhead == conflicts Measure speedup for up to 32 processor systems “ Tourmaline” simulator for TM Simple TM system with a timing model for memory accesses. Provides version management & conflict detection. Timestamps for conflict resolution Conflicts always abort “younger” transactions No network model Added simple backoff 2 Splash Benchmarks were “transactified” Cholesky & Raytrace

Queue Micro-benchmark Queue Micro-benchmark for TM 2^10 insert/delete operations Important structure used in splash benchmarks

Observations Conflicts increase with increasing CPUs TM overhead can lead to slowdown Situation gets worse with increased transaction sizes Effect on speedup might be worse with a network model in place. How to make TM resilient to conflicts ?

Value Predictor Idea TM performance degrades with conflicts Certain data structures hard to parallelize No performance difference with TM

Serializing data/operations are predictable Pointers: head, tail etc Sizes: constant increment/decrements HTM already includes speculative hardware for buffering checkpoint and rollback capability Still reap benefits of TM Allows running transactions in parallel with predicted values Such queues used mainly for task/memory management purposes Cholesky, Raytrace, Radiosity

Implementation Stride-based, memory-level Base LogTM model In-memory logging of old values during Xn stores Eager conflict detection, timestamps for conflict resolution Uses a Nack-based coherence protocol Deadlock detection mechanism Commits are easy Aborts need memory + processor rollback. Nacks used to trigger value predictor

Implementation Addresses identified as predictable by the programmer/compiler. Value predictor initializes entry with the address VP entry, 1 per VP address Ordered list of real values, 2 in our design Ordered list of predicted values Ordered list of predicted cpus Fortunately, max 3 or 4 VP entries needed so far.

Need an extra buffer to hold predicted data. Only with LogTM Cannot log predicted load value in memory Predictions checked at commit time Execution does not advance beyond commit until verified Needs changes in the coherence protocol More deadlock scenarios Simplifications Address, VP entries Timing of VP Always generate exclusive requests

Implementation Directory M 1 CPU 1 CPU 3 CPU 2 Data GetX GetX FGetX Nack Nack Log TM model

Implementation Directory M-1 S-2 Value Predictor CPU 1 CPU 3 CPU 2 Nack Nack Pred Retry GetX FGetX FGetX Generating predictions Nack Nack

State after predictions Directory M-1 S-2-3 Value Predictor CPU 1 CPU 3 CPU 2 Nack Retry FGetX Nack Retry FGetX Nack Nack

Successful predictions Directory M-1 S-2-3 Value Predictor CPU 1 CPU 3 CPU 2 Retry FGetX Unblock Retry FGetX Nack Nack Result M-2 S-3

Directory M-1 S-2-3 Value Predictor CPU 1 CPU 3 CPU 2 Retry FGetX Unblock Retry Result NP S-3 Failed Predictions Result

Evaluation Microbenchmarks: Loop-based, 2^10 Xn Shared-counter Simple counter, increment by fixed value Queue-based Insert only Random inserts and deletes Simulation platform: SIMICS in-order processors (1,2,4,8,16) GEMS (RUBY) memory system Highly optimized LogTM model for experiments Cholesky & Raytrace benchmarks Both contain a linked-list for memory management. Cholesky could not be completely transactified

Splash Benchmarks Cholesky Raytrace Adding directives to support value prediction

Splash benchmarks Table with TM parameters for 16 processors 13572 7100 Xn size No. of Xn %Stalls Writes %Aborts LogTM / VP-TM Cholesky 24466 40.2 2.2 30 / 18.8 Raytrace 46958 32.3 3.9 20 / 13.4

Observations Value predictor can improve speedup without much overhead Performance gains with increasing number of processors Aborts increases as number of processors increases Is TM scalable ? More benchmarks needed

Extending the value predictor Improving the simulation model Exploring other types of value predictors Expanding application scope Controlling aggressiveness Adding confidence mechanisms. Reducing hardware complexity of the value predictor entry.

Proposed Ideas Value predictor not general enough! Need to reduce conflicts Better backoff schemes Centralized transaction scheduler “ Intelligent” backoff times Expose transactions to the directory begin_Xn and end_Xn messages to the directory ? Count number of memory accesses in transactions Generate backoff time based on count

Proposed Ideas contd. Why is this different from any other scalability research ? Recent work by Bobba shows HTM designs impact performance by almost 80%. Different data/conflict management schemes needed for different applications? STM can help, but performance suffers Can we have both lazy and eager version management? Is HTM on large systems a good idea ?

Proposed Ideas contd. Effectiveness of Nacks/Stalls decreases as number of processors increases Need stalling mechanism without the overhead of deadlocks Stall transactions after restart Use timestamps to avoid starvation Need to understand hardware requirements Verilog model Proposals need hardware evaluation Value predictor Speculative buffer

Experiments/Analysis Need better benchmarks Synchronization intensive SPECJBB , STAMP, Java Grande benchmarks Larger transactions Test up to 64 processors Simulations with SIMICS + GEMS

Contribution Identify scalability bottleneck with TM Value predictor for certain applications Proposal Extending value predictor work Improved backoff schemes Transaction queuing/stalling Hardware evaluation* END Questions ?

Directory M-1 S-2-3 Value Predictor CPU 1 CPU 3 CPU 2 Retry FGetX Unblock Retry FGetX Nack Nack Result M-2 S-3

Overall, TCC’s FPGA implementation adds 14% overhead in the control logic, and 29% in on chip memory as compared to a non-speculative incarnation of our cache.

Prelim Slides

More Related Content

What's hot (19)

Similar to Prelim Slides (20)

Recently uploaded (20)

Prelim Slides