Adaptive Thread Scheduling Techniques for Improving Scalability of Software Transactional Memory

Adaptive Thread Scheduling Techniques for
Improving Scalability of Software
Transactional Memory
Kinson Chan, King Tin Lam, Cho-Li Wang
Presenter: Kinson Chan
Date: 16 February 2010
PDCN 2011, Innsbruck, Austria
DEPARTMENT OF COMPUTER SCIENCE
THE UNIVERSITY OF HONG KONG

Outline
• Motivation –
‣ hardware trend and software transactional memory
• Background –
‣ performance scalability
‣ ratio-based concurrency control and its myth
• Solution –
‣ our rate-based heuristic, Probe.
• Evaluation –
‣ performance comparison
2

Motivation
What is the current computing hardware trend,
and why is software transactional memory relevant?

Hardware trend: multicores
• Multicore processors
‣ a.k.a. chip multiprocessing
‣ multiple cores on a processor die
‣ cores share a common cache
‣ faster data sharing among threads
‣ more running threads per cabinet
• Chip Multithreading
‣ e.g. hyperthreading, coolthreads
‣ more than one threads per core
‣ hide the data load latency
4
L1! L1! L1! L1!
L2! L2! L2! L2!
L3!
1! 2! 3! 4! 5! 6! 7! 8!
a typical modern processor

4
L1! L1! L1! L1!
L2! L2! L2! L2!
L3!
1! 2! 3! 4! 5! 6! 7! 8!
Multiple cores

4
L1! L1! L1! L1!
L2! L2! L2! L2!
L3!
1! 2! 3! 4! 5! 6! 7! 8!
Mutli-thread
per core

Now and future multicores
5
Micro-
architecture
Clock rate Cores
Threads per
core
Threads per
package
Shared cache
Memory
arrangement
IBM Power 7 ~ 3 GHz 4 ~ 8 4 32 Max
4 MB
shared L3
NUMA
Sun Niagara2 1.2 ~ 1.6 GHz 4 ~ 8 8 64 Max
4 MB
shared L2
NUMA
Intel
Westmere
~ 2 GHz 4 ~ 8 2 16 Max
12 ~ 24 MB
shared L3
NUMA
Intel
Harpertown
~ 3 GHz 2 x 2 2 8
2 x 6 MB
shared L3
UMA
AMD
Bulldozer
~ 2 GHz 2 x 6 ~ 2 x 8 1 16 Max
8 MB
shared L3
NUMA
AMD Magny-
Cours
~ 3 GHz 8 modules 2 per module 16 Max
8 MB
shared L3
NUMA
Intel Terascale ~ 4 GHz 80? 1? 80?
80 x 2 KB
dist. cache
NUCA

Now and future multicores
5
Micro-
architecture
Clock rate Cores
Threads per
core
Threads per
package
Shared cache
Memory
arrangement
IBM Power 7 ~ 3 GHz 4 ~ 8 4 32 Max
4 MB
shared L3
NUMA
Sun Niagara2 1.2 ~ 1.6 GHz 4 ~ 8 8 64 Max
4 MB
shared L2
NUMA
Intel
Westmere
~ 2 GHz 4 ~ 8 2 16 Max
12 ~ 24 MB
shared L3
NUMA
Intel
Harpertown
~ 3 GHz 2 x 2 2 8
2 x 6 MB
shared L3
UMA
AMD
Bulldozer
~ 2 GHz 2 x 6 ~ 2 x 8 1 16 Max
8 MB
shared L3
NUMA
AMD Magny-
Cours
~ 3 GHz 8 modules 2 per module 16 Max
8 MB
shared L3
NUMA
Intel Terascale ~ 4 GHz 80? 1? 80?
80 x 2 KB
dist. cache
NUCA
How can we scale our program to have these many threads?

Multi-threading and synchronization
6

6
Coarse grain
locking
Easy / Correct
(few locks,
predictable)
Difﬁcult to scale
(excessive mutual
exclusion)

6
Coarse grain
locking
Easy / Correct
(few locks,
predictable)
Difﬁcult to scale
(excessive mutual
exclusion)
Fine-grain
locking
Error prone
(deadlock, forget to
lock, ...)
Scales better
(allows more
parallelism)

6
Coarse grain
locking
Easy / Correct
(few locks,
predictable)
Difﬁcult to scale
(excessive mutual
exclusion)
Fine-grain
locking
Error prone
(deadlock, forget to
lock, ...)
Scales better
(allows more
parallelism)
Do we have
anything in
between?
Easy / Correct
Scales good

STM optimistic execution
7
Begin Begin
Proceed Proceed
Commit
Commit
Retry
Commit
x=x+4
y=y-4
x=x+2
y=y-2
Begin
Proceed
Begin
Proceed
Commit
Commit
x=x+4
y=y-4
w=w+5
z=w
Thread 1 Thread 2 Thread 3 Thread 1 Thread 2 Thread 3
x=x+2
y=y-2
Success
Success
Success
Success
conflict detection
conflict detection
begin;
x=x+4;
y=y-4;
commit;
begin;
x=x+2;
y=y-2;
commit;
begin;
x=x+4;
y=y-4;
commit;
begin;
w=w+5;
z=w;
commit;
begin;
x = x + 4;
y = y - 4;
commit;

7
Begin Begin
Proceed Proceed
Commit
Commit
Retry
Commit
x=x+4
y=y-4
x=x+2
y=y-2
Begin
Proceed
Begin
Proceed
Commit
Commit
x=x+4
y=y-4
w=w+5
z=w
x=x+2
y=y-2
Success
Success
Success
Success
conflict detection
conflict detection
begin;
x=x+4;
y=y-4;
commit;
begin;
x=x+2;
y=y-2;
commit;
begin;
x=x+4;
y=y-4;
commit;
begin;
w=w+5;
z=w;
commit;

7
Begin Begin
Proceed Proceed
Commit
Commit
Retry
Commit
x=x+4
y=y-4
x=x+2
y=y-2
Begin
Proceed
Begin
Proceed
Commit
Commit
x=x+4
y=y-4
w=w+5
z=w
x=x+2
y=y-2
Success
Success
Success
Success
conflict detection
conflict detection
begin;
x=x+4;
y=y-4;
commit;
begin;
x=x+2;
y=y-2;
commit;
begin;
x=x+4;
y=y-4;
commit;
begin;
w=w+5;
z=w;
commit;
case 1: two transactions conﬂicts:
rollback and retry one of them.

7
Begin Begin
Proceed Proceed
Commit
Commit
Retry
Commit
x=x+4
y=y-4
x=x+2
y=y-2
Begin
Proceed
Begin
Proceed
Commit
Commit
x=x+4
y=y-4
w=w+5
z=w
x=x+2
y=y-2
Success
Success
Success
Success
conflict detection
conflict detection
begin;
x=x+4;
y=y-4;
commit;
begin;
x=x+2;
y=y-2;
commit;
begin;
x=x+4;
y=y-4;
commit;
begin;
w=w+5;
z=w;
commit;
case 1: two transactions conﬂicts:
rollback and retry one of them.
case 2: two transactions do not conﬂict:
they execute together,
achieving better parallelism.

C. J. Rossbach, O. S. Hofmann and Emmett Witchel, Is transactional programming actually easier, In Proceedings of the
15th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, pages 45–56, 2010.
STM is easy
• In the University of Texas at Austin, 237 students taking
Operating System courses were instructed to program the same
problem with coarse locks, fine-grained locks, monitors and
transactions...
8
Development
Time
Errors
Code
Complexity

STM is easy
transactions...
8
Development
Time
Errors
Code
Complexity
LongShort
TMCoarse Fine

STM is easy
transactions...
8
Development
Time
Errors
Code
Complexity
LongShort
TMCoarse Fine
Simple Complex
TMCoarse Fine

STM is easy
transactions...
8
Development
Time
Errors
Code
Complexity
LongShort
TMCoarse Fine
Simple Complex
TMCoarse Fine
Less More
TM Coarse Fine

STM is a research toy
9
STM
SXM
OSTM
DSTM
ASTM
TL2
TinySTM
TLRW
SwissTM
NOrec
TML
RingTM
InvalTM
DeuceTM
D2STM

STM is a research toy
9
STM
SXM
OSTM
DSTM
ASTM
TL2
TinySTM
TLRW
SwissTM
NOrec
TML
RingTM
InvalTM
DeuceTM
D2STM
not
Company Products Research
Sun Dynamic STM library DSTM, TL2, TLRW,
Rock processor, ...
Intel Intel C++ STM compiler McRT-STM, ...
IBM C/C++ for TM on AIX STM extension on X10, ...
Microsoft STM.NET STM on Haskell, ...
AMD ASF instruction set
extension

Background
What affects the transactional memory performance?
How can we adjust concurrency for best performance?

Threads and performance
11
thread#
attempt#
more threads,
more transactional
attempts

11
×thread#
attempt#
more threads,
more transactional
attempts
thread#commitprob.
more threads,
smaller portion of
transactions to
commit

11
× ∝thread#
attempt#
more threads,
more transactional
attempts
thread#commitprob.
more threads,
smaller portion of
transactions to
commit
thread#
performance
concave curve of
performance

11
× ∝thread#
attempt#
more threads,
more transactional
attempts
thread#commitprob.
more threads,
smaller portion of
transactions to
commit
thread#
performance
concave curve of
performance
optimal

Ratio- vs rate-based concurrency controls
12

12
• Concurrency control in STM:
‣ achieve optimal performance by scheduling means.

12
• Different concepts:

12
‣ commit ratio-based heuristics

12
✴ ratio = commits / (commits + aborts)

12
✴ reduce concurrency when ratio gets too low
✴ relax concurrency when ratio gets higher than a threshold

12
‣ commit rate-based heuristics

12
✴ rate = commits / time

12
✴ rate = commits / time
‣ queuing after winner transactions
✴ kernel-level programming, conditional waiting...

Ratio-based solutions
• Ansari, et al.:
‣ introduces total commit [ratio] (TCR)
‣ increases / decreases threads by comparing TCR and set-point (70%)
• Yoo and Lee:
‣ introduces per-thread contention intensity (CI)
‣ (likelihood of a thread to encounter contentions)
‣ stalls for acquiring mutex when CI goes above a value (70%)
• Dolev, et al.:
‣ activates hotspot detection when CI goes above a value (40%)
‣ a thread stalls for acquiring mutex when hotspot is detected
13

14
Myths of ratio-based heuristics
• We want an application finishes faster
‣ i.e. more transactions committed per unit time
‣ (assumption: constant number of transactions)
• High commit ratio ≠ high performance
‣ 1 thread + 100% ratio vs 4 threads + 40% commit ratio
‣ engine rotation ≠ vehicle velocity
• Watching commit ratio is an inexact science
‣ happens to be close estimation, though
• Drawbacks
‣ over-serialization when commit ratio is low
‣ over-relaxation when the commit ratio is high

• At any instance, we can only pick a thread count
‣ “what if” questions not allowed in run-time
• What is high and what is low?
‣ commit ratio goes between 0% and 100%
‣ commit rate depends on transaction lengths
• Changing patterns
‣ transaction nature may change along execution:
‣ getting longer / shorter
‣ getting more / less contentions
• Pre-defined bounds not acceptable
‣ the optimal spot changes across execution timeline.
15
Challenges with rate-based solution

16
Commit ratio vs Commit rate
CommitRatio/%(GreenLine)!
CommitRate(RedLine)!

16
commit ratio

16
commit rate

16
More threads
results in better
performance

16
Excessive
threads kills
performance

16
Changing
application
natures

16
Excessive
threads yields
ﬂuctuations

16
Dropping ratio
Increasing rate
Shorter time

16
Low commit
ratio, but still
scalable...

Solution
Now we know ratio-based solutions are not right.
What shall we do for a rate-based alternative?

Counting variables
18
commits:
number of commits in a time-slice
aborts:
number of aborts in a time-slice
quota:
maximum concurrency
entered:
currently active transactions
peak:
peak concurrency in a time-slice

Counting variables
18
commits:
aborts:
quota:
maximum concurrency
entered:
peak:
commits per timeslice:
commit rate

Counting variables
18
commits:
aborts:
quota:
maximum concurrency
entered:
peak:
commit ratio

Counting variables
18
commits:
aborts:
quota:
maximum concurrency
entered:
peak:
quota ≤ entered:
no more new transactions
stall new
comers with
pthread_sched();

Counting variables
18
commits:
aborts:
quota:
maximum concurrency
entered:
peak:
quota ≥ peak:
unused quota
reduce quota
for a tight limit...

System architecture
19
Thread
creation
Thread
scheduling
Memory
management
Input / output
system
Concurrency
control unit
Conﬂict
detection
engine
Shared memory
Activity
logger
Txn
Activity
logger
Txn
Activity
logger
Txn
Activity
logger
Txn
User code
Transactional
threads
Transactional
memory
system
Operating
system

System architecture
19
Thread
creation
Thread
scheduling
Memory
management
Input / output
system
Concurrency
control unit
Conﬂict
detection
engine
Shared memory
Activity
logger
Txn
Activity
logger
Txn
Activity
logger
Txn
Activity
logger
Txn
User code
Transactions
execute as normal,
with conﬂict
detection
Transactional
threads
Transactional
memory
system
Operating
system

System architecture
19
Thread
creation
Thread
scheduling
Memory
management
Input / output
system
Concurrency
control unit
Conﬂict
detection
engine
Shared memory
Activity
logger
Txn
Activity
logger
Txn
Activity
logger
Txn
Activity
logger
Txn
User code
Concurrency
control unit is
added as a hook,
monitoring the
performance
Transactional
threads
Transactional
memory
system
Operating
system

System architecture
19
Thread
creation
Thread
scheduling
Memory
management
Input / output
system
Concurrency
control unit
Conﬂict
detection
engine
Shared memory
Activity
logger
Txn
Activity
logger
Txn
Activity
logger
Txn
Activity
logger
Txn
User code
Scheduler is
invoked to stall
some new
transactions, if
appropriate
Transactional
threads
Transactional
memory
system
Operating
system

Art of uninformed climbing
20
commitrate
active threads
direction: left

20
commitrate
active threads
direction: left
decreasing rate!
change direction!

20
commitrate
active threads
direction: right

20
commitrate
active threads
direction: right
decreasing rate!
change direction!

20
commitrate
active threads
direction: left
resultant
probing
region

Performance Evaluation
How does our rate-based solution
compares with other ratio-based ones?

Evaluation Platform
• Dell PowerEdge M610 Blade Server
‣ 2x Intel “Nehalem” Xeon E5540 2.53 GHz (8 cores, 16 threads)
‣ ECC DDR3-1066 36 GB main memory
• STAMP Benchmark
‣ original from Stanford University – http://stamp.stanford.edu/
‣ modified version for TinySTM: http://www.tinystm.org/
• TinySTM 0.9.5
‣ open-source version – http://www.tinystm.org/
• Yoo’s and Shrink concurrency control
‣ from EPFL Distributed Programming Laboratory –
http://lpd.epfl.ch/site/research/tmeval/
22

Probing in effect vs other heuristics
23
throttle2 probe2 basic yoo shrink
100
75
50
0
25
100
75
50
0
25
680K
510K
340K
0
170K
1.6M
1.2M
800K
0
400K
4
3
2
0
1
16
12
8
0
4
CommitRt(GreenDottedLine)/%
CommitRate(RedSolidLine)/
TransactionsperSecond
StalledThreads(BlueDashedLine)
Figure 3. Commit Ratio, Commit Rate and Number of Stalled Threads of Some TM Applications
found Probe more favourable than Throttle, as well
r concurrency control policies. We have also found
s are sensitive to the cache sharing, and reﬁned our
ion accordingly.
we may consider new adaptive scheduling policies
International Symposium on Computer Architecture, pages 289–300,
1993.
[10] M. Herlihy, V. Luchangco, and M. Moir. Obstruction free synchroniza-
tion: Double ended queues as an example. In Proceedings of the 23rd
International Conference on Distributed Computing Systems, pages
kmeans-2
16 threads
yada
8 threads
100
75
50
0
25
100
75
50
0
25
680K
510K
340K
0
170K
1.6M
1.2M
800K
0
400K
We have found Probe more favourable than Throttle, as well
as two other concurrency control policies. We have also found
our solutions are sensitive to the cache sharing, and reﬁned our
implementation accordingly.
In future we may consider new adaptive scheduling policies
International Symposium on Computer Architecture, pa
1993.
[10] M. Herlihy, V. Luchangco, and M. Moir. Obstruction free
tion: Double ended queues as an example. In Proceedin
International Conference on Distributed Computing S
dolev

23
100
75
50
0
25
100
75
50
0
25
680K
510K
340K
0
170K
1.6M
1.2M
800K
0
400K
4
3
2
0
1
16
12
8
0
4
ion accordingly.
1993.
kmeans-2
16 threads
yada
8 threads
100
75
50
0
25
100
75
50
0
25
680K
510K
340K
0
170K
1.6M
1.2M
800K
0
400K
1993.
commit ratio
dolev

23
100
75
50
0
25
100
75
50
0
25
680K
510K
340K
0
170K
1.6M
1.2M
800K
0
400K
4
3
2
0
1
16
12
8
0
4
ion accordingly.
1993.
kmeans-2
16 threads
yada
8 threads
100
75
50
0
25
100
75
50
0
25
680K
510K
340K
0
170K
1.6M
1.2M
800K
0
400K
1993.
commit rate
dolev

23
100
75
50
0
25
100
75
50
0
25
680K
510K
340K
0
170K
1.6M
1.2M
800K
0
400K
4
3
2
0
1
16
12
8
0
4
ion accordingly.
1993.
kmeans-2
16 threads
yada
8 threads
100
75
50
0
25
100
75
50
0
25
680K
510K
340K
0
170K
1.6M
1.2M
800K
0
400K
1993.
threads
stalled
dolev

23
100
75
50
0
25
100
75
50
0
25
680K
510K
340K
0
170K
1.6M
1.2M
800K
0
400K
4
3
2
0
1
16
12
8
0
4
ion accordingly.
1993.
kmeans-2
16 threads
yada
8 threads
100
75
50
0
25
100
75
50
0
25
680K
510K
340K
0
170K
1.6M
1.2M
800K
0
400K
1993.
low
commit ratio
Original
TinySTM
dolev

23
100
75
50
0
25
100
75
50
0
25
680K
510K
340K
0
170K
1.6M
1.2M
800K
0
400K
4
3
2
0
1
16
12
8
0
4
ion accordingly.
1993.
kmeans-2
16 threads
yada
8 threads
100
75
50
0
25
100
75
50
0
25
680K
510K
340K
0
170K
1.6M
1.2M
800K
0
400K
1993.
mild
adjustment
mild
adjustment
Probe: rate-based
concurrency control
dolev

23
100
75
50
0
25
100
75
50
0
25
680K
510K
340K
0
170K
1.6M
1.2M
800K
0
400K
4
3
2
0
1
16
12
8
0
4
ion accordingly.
1993.
kmeans-2
16 threads
yada
8 threads
100
75
50
0
25
100
75
50
0
25
680K
510K
340K
0
170K
1.6M
1.2M
800K
0
400K
1993.
higher
commit rate
higher
commit rate
Probe: rate-based
concurrency control
dolev

23
100
75
50
0
25
100
75
50
0
25
680K
510K
340K
0
170K
1.6M
1.2M
800K
0
400K
4
3
2
0
1
16
12
8
0
4
ion accordingly.
1993.
kmeans-2
16 threads
yada
8 threads
100
75
50
0
25
100
75
50
0
25
680K
510K
340K
0
170K
1.6M
1.2M
800K
0
400K
1993.
aim: high
ratio
aim: high
ratio
dolev
Yoo’s and Dolev’s
ratio-based concurrency control

23
100
75
50
0
25
100
75
50
0
25
680K
510K
340K
0
170K
1.6M
1.2M
800K
0
400K
4
3
2
0
1
16
12
8
0
4
ion accordingly.
1993.
kmeans-2
16 threads
yada
8 threads
100
75
50
0
25
100
75
50
0
25
680K
510K
340K
0
170K
1.6M
1.2M
800K
0
400K
1993.
large
adjustment
large
adjustment
dolev

23
100
75
50
0
25
100
75
50
0
25
680K
510K
340K
0
170K
1.6M
1.2M
800K
0
400K
4
3
2
0
1
16
12
8
0
4
ion accordingly.
1993.
kmeans-2
16 threads
yada
8 threads
100
75
50
0
25
100
75
50
0
25
680K
510K
340K
0
170K
1.6M
1.2M
800K
0
400K
1993.
even
lower rate...
even
lower rate...
dolev

Yoo’s Dolev’s Throttle Probe Probe2
yada
vacation-1
kmeans-2
vacation-2
kmeans-1
intruder
genome
ssca
labyrinth
average
+70.5% +61.02% +81.33% +101.79% +81.59%
+38.14% +18.39% +35.68% +54.90% +66.26%
-6.50% -19.56% -11.38% +2.52% +31.59%
+3.19% -19.04% +6.63% +2.42% +8.81%
-1.45% -28.96% -24.49% -1.38% +8.19%
-12.95% -28.03% -35.13% -14.65% +5.26%
+0.46% -11.80% -4.73% -7.55% -1.01%
+0.58% +0.29% -25.08% -26.51% -4.62%
-2.37% -1.34% -27.66% -12.22% -6.73%
+9.96% -3.23% -0.54% +11.03% +21.04%
Performance comparison
24

yada
vacation-1
kmeans-2
vacation-2
kmeans-1
intruder
genome
ssca
labyrinth
average
+70.5% +61.02% +81.33% +101.79% +81.59%
+38.14% +18.39% +35.68% +54.90% +66.26%
-6.50% -19.56% -11.38% +2.52% +31.59%
+3.19% -19.04% +6.63% +2.42% +8.81%
-1.45% -28.96% -24.49% -1.38% +8.19%
-12.95% -28.03% -35.13% -14.65% +5.26%
+0.46% -11.80% -4.73% -7.55% -1.01%
+0.58% +0.29% -25.08% -26.51% -4.62%
-2.37% -1.34% -27.66% -12.22% -6.73%
+9.96% -3.23% -0.54% +11.03% +21.04%
24
Ratio-based Heuristics

yada
vacation-1
kmeans-2
vacation-2
kmeans-1
intruder
genome
ssca
labyrinth
average
+70.5% +61.02% +81.33% +101.79% +81.59%
+38.14% +18.39% +35.68% +54.90% +66.26%
-6.50% -19.56% -11.38% +2.52% +31.59%
+3.19% -19.04% +6.63% +2.42% +8.81%
-1.45% -28.96% -24.49% -1.38% +8.19%
-12.95% -28.03% -35.13% -14.65% +5.26%
+0.46% -11.80% -4.73% -7.55% -1.01%
+0.58% +0.29% -25.08% -26.51% -4.62%
-2.37% -1.34% -27.66% -12.22% -6.73%
+9.96% -3.23% -0.54% +11.03% +21.04%
24
Ratio-based Heuristics
Naive: scales
for 25% ~ 75%
of ratio

yada
vacation-1
kmeans-2
vacation-2
kmeans-1
intruder
genome
ssca
labyrinth
average
+70.5% +61.02% +81.33% +101.79% +81.59%
+38.14% +18.39% +35.68% +54.90% +66.26%
-6.50% -19.56% -11.38% +2.52% +31.59%
+3.19% -19.04% +6.63% +2.42% +8.81%
-1.45% -28.96% -24.49% -1.38% +8.19%
-12.95% -28.03% -35.13% -14.65% +5.26%
+0.46% -11.80% -4.73% -7.55% -1.01%
+0.58% +0.29% -25.08% -26.51% -4.62%
-2.37% -1.34% -27.66% -12.22% -6.73%
+9.96% -3.23% -0.54% +11.03% +21.04%
24
Rate-based Heuristics

yada
vacation-1
kmeans-2
vacation-2
kmeans-1
intruder
genome
ssca
labyrinth
average
+70.5% +61.02% +81.33% +101.79% +81.59%
+38.14% +18.39% +35.68% +54.90% +66.26%
-6.50% -19.56% -11.38% +2.52% +31.59%
+3.19% -19.04% +6.63% +2.42% +8.81%
-1.45% -28.96% -24.49% -1.38% +8.19%
-12.95% -28.03% -35.13% -14.65% +5.26%
+0.46% -11.80% -4.73% -7.55% -1.01%
+0.58% +0.29% -25.08% -26.51% -4.62%
-2.37% -1.34% -27.66% -12.22% -6.73%
+9.96% -3.23% -0.54% +11.03% +21.04%
24
Rate-based Heuristics
Better
number counting
strategy

yada
vacation-1
kmeans-2
vacation-2
kmeans-1
intruder
genome
ssca
labyrinth
average
+70.5% +61.02% +81.33% +101.79% +81.59%
+38.14% +18.39% +35.68% +54.90% +66.26%
-6.50% -19.56% -11.38% +2.52% +31.59%
+3.19% -19.04% +6.63% +2.42% +8.81%
-1.45% -28.96% -24.49% -1.38% +8.19%
-12.95% -28.03% -35.13% -14.65% +5.26%
+0.46% -11.80% -4.73% -7.55% -1.01%
+0.58% +0.29% -25.08% -26.51% -4.62%
-2.37% -1.34% -27.66% -12.22% -6.73%
+9.96% -3.23% -0.54% +11.03% +21.04%
24
over-
relaxation

yada
vacation-1
kmeans-2
vacation-2
kmeans-1
intruder
genome
ssca
labyrinth
average
+70.5% +61.02% +81.33% +101.79% +81.59%
+38.14% +18.39% +35.68% +54.90% +66.26%
-6.50% -19.56% -11.38% +2.52% +31.59%
+3.19% -19.04% +6.63% +2.42% +8.81%
-1.45% -28.96% -24.49% -1.38% +8.19%
-12.95% -28.03% -35.13% -14.65% +5.26%
+0.46% -11.80% -4.73% -7.55% -1.01%
+0.58% +0.29% -25.08% -26.51% -4.62%
-2.37% -1.34% -27.66% -12.22% -6.73%
+9.96% -3.23% -0.54% +11.03% +21.04%
24
over-
reaction

Conclusions
• Trend of multicore urges us to write parallel computer programs
• Software transactional memory is part of future computation
‣ easier to program, less errors, neat code
‣ but it needs concurrency control for the best performance
• Ratio-based vs rate-based concurrency heuristics
‣ ratio-based heuristics are inexact approximations
‣ watching ratio only causes over-reaction / over-relaxation
• Rate-based concurrency heuristics, Probe, outperforms
25

Adaptive Thread Scheduling Techniques for Improving Scalability of Software Transactional Memory

Thank You!
Contact me:
kchan@cs.hku.hk
HKU CS Systems Research Group:
http://www.srg.cs.hku.hk/
Dr. Cho-Li Wang’s Webpage:
http://www.cs.hku.hk/~clwang/

Adaptive Thread Scheduling Techniques for Improving Scalability of Software Transactional Memory

More Related Content

What's hot (6)

Similar to Adaptive Thread Scheduling Techniques for Improving Scalability of Software Transactional Memory (20)

Recently uploaded (20)

Adaptive Thread Scheduling Techniques for Improving Scalability of Software Transactional Memory