Smpant Transact09

A Case for Using Value Prediction to Improve Performance of Transactional Memory Salil Pant Advisor: Gregory Byrd

Transactional Memory “ Transaction” abstraction for programmers Optimistic speculative execution of critical sections TM can make parallel programming easier Problem - Serializing structures in programs Queues/Lists for memory and task management Linked list traversals for search TM fails to expose concurrency

Conflicts in TM lead to overhead Stalls, Restarts Longer transactions suffer most Different conflict management schemes perform differently Alternative approaches need programmer intervention Distributed, Hierarchical queue approaches Scalability bottleneck Amdahl’s law Our solution – data speculation on conflicting accesses

Motivation for prediction Serializing data is updated in a predictable manner Pointers: head, tail etc Sizes: constant increment/decrements Most conflicts come from a few data structures in the program Unlock parallelism with value prediction No change in the program Like a new conflict management system enqueue(elem* newE){ if (tail != NULL) tail->next = newE; else head = newE; tail = newE; queue_size++; } dequeue() { if(!head) return; elem* temp = head; head = head->next; free(temp); queue_size--; if (queue_size==0) { head=NULL; tail=NULL; } }

Design Base LogTM model In-memory logging of old values during Xn stores Eager conflict detection, timestamps for conflict resolution and deadlock detection Commits are easy, aborts need memory + processor rollback. Identification of predictable addresses Add address to predictor on a Nack Trap stores for added addresses and create stride Predict on future conflicts Stride Value predictor Predictor indexed by load address Predictor getting values from multiple processors Predict 32 bit loads Memory-level and global

Structure of Predictor Conceptual design VP entry, 1 per VP address List of store values, 2 in our design to create stride (RV1-3) List of predicted values, cpus (SV1-5, P1-5) Fortunately, max 4 or 5 VP entries needed so far. Simplifications VP – address and timing

Allow transactions to run in parallel with predicted values Need an extra buffer to hold predicted data. Cannot log predicted load value in memory Needs changes in the coherence protocol More deadlock scenarios Need messages to indicate prediction success/failure Validate predictions when owner commits Check predictions at commit time Do not commit until prediction verified Successful predictions increase concurrency/speedup Value Prediction

Coherence Protocol Actions Directory M 1 CPU 1 CPU 3 CPU 2 Data GetX GetX FGetX Nack Nack Log TM model

Directory M-1 S-2 Value Predictor CPU 1 CPU 3 CPU 2 Nack Nack Pred Retry GetX FGetX FGetX Generating predictions Nack Nack S-2-3 Pred

Successful predictions Directory M-1 S-2-3 Value Predictor CPU 1 CPU 3 CPU 2 Retry FGetX Unblock Retry FGetX Nack Nack Result M-2 S-3

Directory M-1 S-2-3 Value Predictor CPU 1 CPU 3 CPU 2 Retry FGetX Unblock Retry Result NP S-3 Failed Predictions Result

Concurrency vs. Aborts Value prediction creates dependent transactions A transaction cannot commit until prediction verified Multiple predictions per processor can lead to deadlocks

How to know all dependent transactions? Global predictor in our work Pass dependents along with Nacks Limiting the number of predictions per address Abort flag Programmer controlled value prediction. VP can predict for all conflicting accesses Implementation Issues

Evaluation Microbenchmarks: Loop-based, 2^10 Xn Queue-based Insert only Random inserts and deletes Simulation platform: SIMICS in-order processors (1,2,4,8,16) GEMS (RUBY) memory system Controlling number of predictions per address Radiosity & Raytrace benchmarks Both contain a linked-list for memory management. STAMP suite of benchmarks Labyrinth and Intruder benchmark

Splash benchmarks 16 Processor Results for Splash and STAMP benchmarks

Results table 2 Predictions per address for VP-TM

Observations Value predictor increases concurrency for all benchmarks Factors affecting speedup Nacks/Stalls Restarts 1 or 2 predictions per address provides best performance in most cases.

Rationale & Complexity VP adds complexity Speedup enough to justify cost ? Does not degrade performance if not used Guaranteed speedup for all benchmarks ? Tuning for performance Controlling predictions, abort flag Will help TM adoption for multicore architectures

Conclusion Value prediction with TM shown to improve performance. Reduced conflicts Increased concurrency Performance improvement comes with modest hardware increase. Questions?

Related work TLS Easier to predict values in TLS than TM Similar idea can be used Value forwarding Broadcast system Forward vector of dependents along with value Needs extensive changes in the coherence protocol

Overall, TCC’s FPGA implementation adds 14% overhead in the control logic, and 29% in on chip memory as compared to a non-speculative incarnation of our cache.

Smpant Transact09

More Related Content

What's hot (11)

Viewers also liked (20)

Similar to Smpant Transact09 (20)

Recently uploaded (20)

Smpant Transact09