SlideShare a Scribd company logo
Jade Alglave – Architecture and Technology
Monica Beckwith – Infrastructure LOB; Advanced Server Team
Applying Concurrency Cookbook
Recipes to SPEC JBB
2 © 2019 Arm Limited
About Us
Dr. Jade Alglave
• Memory model architect at Arm
• Co-developer and maintainer of the
herd+diy toolsuite with Luc Maranget
(INRIA, France)
• Co-developer and maintainer of the
Linux kernel memory model
Monica Beckwith
Managed runtime performance architect
at Arm
• Experience with OpenJDK HotSpot JIT, GC
• Experience with JMM and with strong and weakly
ordered architecture such as x86-64, SPARC, Arm64
and (very briefly) PPC64
3 © 2019 Arm Limited
What We Will Cover Today
Introduction to –
Memory Models (Java Relaxed)
Performance Methodology using Litmus Tests and Tools
Performance Analysis and Measurement using Java Micro-Benchmark Harness (JMH)
Performance Study on Scaling CPU Cores and Simultaneous Multithreading (SMT)
4 © 2019 Arm Limited
What We Will NOT Cover Today
Details of –
Java Relaxed memory model
JMH benchmarking
SPEC JBB 2015 benchmarking
CPU cores and simultaneous multithreading (SMT)
5 © 2019 Arm Limited
Memory Models
- What value can a load read?
6 © 2019 Arm Limited
Multi-threaded hardware with shared memory
structure
A multi-threaded, concurrency-aware program
Processor Threads and Software Threads …
The Ideal Concurrent World of Hardware and Software
* This drawing is heavily inspired by “timethreads“ concept in
Doug Lea’s ‘Concurrent Programming in Java: Design Principles and
Patterns, Second Edition ‘
Object 2Object 1
Thread 1
LockThread 2 help
Thread1 Threadn
Shared Memory
W R W R
* This drawing is heavily inspired by ‘A Tutorial Introduction to the
ARM and POWER Relaxed Memory Models’ by Sewell et. al.
7 © 2019 Arm Limited
Sequentially Consistent Shared Memory
Execution Order == Program Order == Sequential Order
Object
2
Object
1
Thread 1
Lock
Thread 2 help
=+
Thread 1 Thread n
Shared Memory
W R W R
Timeline (Program Order)Timeline (Single Global Execution Order)
A Sequentially
Consistent Machine
• No local reordering
• Writes become visible
simultaneously to all threads
8 © 2019 Arm Limited
Sequential Consistency in Practice
Store Buffering Example
Initially, X and Y are 0 in memory; foo and bar are local (register) variables:
p0 p1
a: X = 1; c: Y = 1;
b: foo = Y; d: bar = X;
What are the permissible values for foo and bar?
On Sequential Consistency, they are the values reachable by interleavings:
{a,b,c,d} {c,d,a,b} {a,c,b,d}
Therefore we cannot have foo and bar both equal to 0.
9 © 2019 Arm Limited
The Real Concurrent World of Hardware
Multi Processor Threads/Cores with Tiered Memory Structure
CPUn
CPU0
L1D$
L2
LLC
L1 I$
CPU1
L2
L2
Usually shared
between all
cores
Could be multi-threaded
(SMT)
Usually private
Memory Controller
DDR Banks
L1D$L1 I$
L1D$L1 I$
IO
Can have Load and Store
buffers
Can have out of order
execution
10 © 2019 Arm Limited
Strong Models based on
Total Store Ordering
(TSO)
CPUn
CPU0
L1
D$ L2
LLC
L1
I$
CPU1
L2
L2
Memory Controller
L1
D$
L1
I$
L1
D$
L1
I$
IO
Life In The Real World Without Sequential Consistency
Relaxed vs Strong Memory Model
Strong Memory Models Weaker Memory Models
X86, SPARC POWER, Arm v7
• A thread can see it’s own
write before other threads
• All other threads see the
write simultaneously:
Multiple Copy Atomic Model
• Local reordering is allowed
• All threads are not
guaranteed to see the write
simultaneously: Not Multiple
Copy Atomic Model
11 © 2019 Arm Limited
Strong Models based on
Total Store Ordering
(TSO)
CPUn
CPU0
L1
D$ L2
LLC
L1
I$
CPU1
L2
L2
Memory Controller
L1
D$
L1
I$
L1
D$
L1
I$
IO
Life In The Real World Without Sequential Consistency
Relaxed vs Strong Memory Model
Weaker Memory Models
X86, SPARC ARM v8
• A thread can see it’s own
write before other threads
• All other threads see the
write simultaneously:
Multiple Copy Atomic Model
• Local reordering is allowed
• All threads are guaranteed to
see the write simultaneously:
Multiple Copy Atomic Model
12 © 2019 Arm Limited
• Can we reason about our concurrent programs following Sequential Consistency?
• Probably if we had a formal, preferably executable, memory models to ensure that we
understand the guarantees given by architectures and programming languages.
• Here’s where Jade would come in talking about her cool tools that allow programmers
to explore the consequences of a given memory model or generate vast families of
litmus tests to run against hardware.litmus tests
Going Back To Our Store Buffer Example
herd
litmus testcat model
Is this behavior allowed by the cat model?
Yes/No
litmus
on HW
litmus test
Is this behavior observed on HW?
Yes/No
litmus test
configuration file (~cat model)
diy
diy.inria.fr
13 © 2019 Arm Limited
X86 SB
{x=0; y=0;}
P0 | P1 ;
MOV [x],$1 | MOV [y],$1 ;
MOV EAX,[y] | MOV EAX,[x] ;
exists (0:EAX=0 / 1:EAX=0)
Hardware architecture and test name
Initial state (x and y are shared memory location)
Thread names
Sequence of instructions displayed as
columns
Question: can we observe this final state of
given that x=0; y=0?
Store Buffer Litmus Test on a TSO Hardware
14 © 2019 Arm Limited
Armed With Knowledge
On TSO hardware, can we observe the final state of foo=0 and bar=0; given that
X=0; Y=0?
...
...
Yes! All production architectures allow the outcome where both foo and bar equal 0.
So, what do we do? …
Use mfence as needed.
15 © 2019 Arm Limited
Performance
Methodology
- Using Litmus Tests and Tools
To Avoid Barriers Where-ever
Possible
16 © 2019 Arm Limited
What & The Why Of Barriers / Fences?
Barriers ensure ordering properties
Barriers enforce strong order
Barriers (when inserted correctly) restore sequential consistency
Barriers can be potentially expensive
Data Memory Barriers on Arm:
DMB SY (full system)
DMB ST (wait for store to complete)
DMB LD (wait for only loads to complete)
17 © 2019 Arm Limited
Normal Load-Stores
No Barriers
Litmus test
AArch64 MP
{
0:X1=x; 0:X3=y;
1:X1=y; 1:X3=x;
}
P0 | P1 ;
MOV W0,#1 | LDR W0,[X1] ;
STR W0,[X1] | LDR W2,[X3] ;
MOV W2,#1 | ;
STR W2,[X3] | ;
exists
(1:X0=1 / 1:X2=0)
Check for any reorder
Check if X0 = 1 and X2 = 0 can exist on P1.
18 © 2019 Arm Limited
Normal Stores
Load Barrier
Litmus test
AArch64 MP+DMB.LD
{
0:X1=x; 0:X3=y;
1:X1=y; 1:X3=x;
}
P0 | P1 ;
MOV W0,#1 | LDR W0,[X1] ;
STR W0,[X1] | DMB LD;
MOV W2,#1 | LDR W2,[X3] ;
STR W2,[X3] | ;
exists
(1:X0=1 / 1:X2=0)
Check for Store reorder
19 © 2019 Arm Limited
Load & Store Barriers
Litmus test
AArch64 MP+DMB.LD+DMB.ST
{
0:X1=x; 0:X3=y;
1:X1=y; 1:X3=x;
}
P0 | P1 ;
MOV W0,#1 | LDR W0,[X1] ;
STR W0,[X1] | DMB LD;
DMB ST | LDR W2, [X3];
MOV W2,#1 | ;
STR W2,[X3] | ;
exists
(1:X0=1 / 1:X2=0)
Generated assembler
#START _litmus_P1
ldr w4,[x1]
dmb ld
ldr w5,[x2]
#START _litmus_P0
mov w7,#1
str w7,[x0]
dmb st
mov w6,#1
str w6,[x2]
Test MP+DMB.ST+DMB.LD Allowed
Histogram (3 states)
499999:>1:X0=0; 1:X2=0;
20 :>1:X0=0; 1:X2=1;
499981:>1:X0=1; 1:X2=1;
No
Witnesses
Positive: 0, Negative: 1000000
Condition exists (1:X0=1 / 1:X2=0) is NOT vali
Hash=4d15dccdb1da0ce51fac17dea068d047
Observation MP+DMB.ST+DMB.LD Never 0 1000000
20 © 2019 Arm Limited
But Aren’t Barriers Expensive?
Acquire – Release (implicit barrier) semantic – One way barriers
LDAR - All loads and stores that are after an LDAR in program order, … must be observed
after the LDAR
STLR - All loads and stores preceding an STLR …, must be observed before the STLR
Thinking about lock-free?
21 © 2019 Arm Limited
Performance Analysis and
Measurement
- Using Java Micro-
Benchmarking Harness (JMH)
22 © 2019 Arm Limited
JMM Rule 1 for Volatile
Stores
Order 1
Normal/Volatile Load
Normal/Volatile Store
Can’t
Reorder
Order 2 Volatile Store
Any load/store
(normal or volatile)
followed by a ‘volatile
store’ can’t be
reordered.
23 © 2019 Arm Limited
Volatile Stores - Barriers With DMBs
JMM rule: Any load/store followed by Volatile Store can’t be re-ordered
Litmus test
{
0:X1=x; 0:X3=y;
1:X1=y; 1:X3=x;
}
P0 | P1 ;
MOV W0,#1 | LDR W0,[X1] ;
STR W0,[X1] | DMB LD ;
LDR W2,[X1] | LDR W2,[X3] ;
DMB SY | ;
STR W2,[X3] | ;
exists
(1:X0=1 / 1:X2=0)
Results
Test MP Allowed
States 3
1:X0=0; 1:X2=0;
1:X0=0; 1:X2=1;
1:X0=1; 1:X2=1;
No
Witnesses
Positive: 0 Negative: 3
Condition exists (1:X0=1 / 1:X2=0)
Observation MP Never 0 3
24 © 2019 Arm Limited
Volatile Stores - Can Barriers Be Replaced By STLR?
JMM rule: Any load/store followed by Volatile Store can’t be re-ordered
Litmus test
{
0:X1=x; 0:X3=y;
1:X1=y; 1:X3=x;
}
P0 | P1 ;
MOV W0,#1 | LDR W0,[X1] ;
STR W0,[X1] | DMB LD ;
MOV W2,#1 | LDR W2,[X3] ;
STLR W2,[X3] | ;
exists
(1:X0=1 / 1:X2=0)
Results
Test MP Allowed
States 3
1:X0=0; 1:X2=0;
1:X0=0; 1:X2=1;
1:X0=1; 1:X2=1;
No
Witnesses
Positive: 0 Negative: 3
Condition exists (1:X0=1 / 1:X2=0)
Observation MP Never 0 3
25 © 2019 Arm Limited
JMM Rule 2 for Volatile
Stores
• A ‘volatile store’ followed by any
normal load/store CAN be
reordered.
• A ‘volatile store’ followed by any
volatile load/store CANNOT be
reordered.
Order 1
Can Reorder
Volatile Store
Can’t
Reorder
Order 2
Normal Load
Normal Store
Volatile Load
Volatile Store
26 © 2019 Arm Limited
Volatile Stores – Can Barriers Be Replaced By STLR?
STLR doesn’t guarantee that a subsequent volatile load/store will not be reordered
static class
TestNormalLoadPostVolatileStores {
volatile int intField1;
int intNorm1;
public
TestNormalLoadPostVolatileStores() {
intField1 = 32;
intNorm1 = intField1;
}
}
JMH Test Code
{
0:X1=x; 0:X3=y;
1:X1=y; 1:X3=x;
}
P0 | P1 ;
LDR W0,[X1] | LDR W0,[X1] ;
MOV W2,#1 | MOV W2,#1 ;
STLR W2,[X3] | STR W2,[X3] ;
exists
(0:X0=1 / 1:X0=1)
Litmus Test Positive Event Structure
27 © 2019 Arm Limited
Volatile Stores – Can Barriers Be Replaced By STLR?
Success! STLR + LDAR of volatiles provide that guarantee
Litmus test
{
0:X1=x; 0:X3=y;
1:X1=y; 1:X3=x;
}
P0 | P1 ;
LDR W0,[X1] | LDAR W0,[X1] ;
MOV W2,#1 | MOV W2,#1 ;
STLR W2,[X3] | STR W2,[X3] ;
exists
(0:X0=1 / 1:X0=1)
Results
Test LB Allowed
States 3
0:X0=0; 1:X0=0;
0:X0=0; 1:X0=1;
0:X0=1; 1:X0=0;
No
Witnesses
Positive: 0 Negative: 3
Condition exists (0:X0=1 / 1:X0=1)
Observation LB Never 0 3
28 © 2019 Arm Limited
Volatile Stores – JMH Profiles
Success! STLR + LDAR of volatiles provide that guarantee
Load Acquire – Store Release Pair
stlr w11, [x10] ;*putfield intField1 ;
add x10, x2, #0xc
ldar w11, [x10] ;*getfield intField1
Data Memory Barrier (inner share-ability domain
str w10, [x2,#12]
dmb ish ;*putfield intField1
ldr w11, [x2,#12]
dmb ishld ;*getfield intField1
36% faster on max SMT count!!
29 © 2019 Arm Limited
JMM Rule 1 for Volatile
Loads
Order 1 Volatile Load
Can’t
Reorder
Order 2
Normal/Volatile Load
Normal/Volatile Store
A ‘volatile load’ followed by any
load/store (normal or volatile) can’t be
reordered.
30 © 2019 Arm Limited
Volatile Load - Barriers With DMBs
JMM rule: A Volatile Load followed by any load/store can’t be re-ordered
Litmus test
{
0:X1=x; 0:X3=y;
1:X1=y; 1:X3=x;
}
P0 | P1 ;
LDR W0,[X1] | LDR W0,[X1] ;
DMB SY | MOV W2,#1 ;
MOV W2,#1 | DMB SY ;
STR W2,[X3] | STR W2,[X3] ;
exists
(0:X0=1 / 1:X0=1)
Results
Test LB Allowed
States 3
0:X0=0; 1:X0=0;
0:X0=0; 1:X0=1;
0:X0=1; 1:X0=0;
No
Witnesses
Positive: 0 Negative: 3
Condition exists (0:X0=1 / 1:X0=1)
Observation LB Never 0 3
31 © 2019 Arm Limited
Volatile Stores – Can Barriers Be Replaced By LDAR?
Success! LDAR provides the right guarantee
Litmus test
{
0:X1=x; 0:X3=y;
1:X1=y; 1:X3=x;
}
P0 | P1 ;
LDAR W0,[X1] | LDR W0,[X1] ;
| MOV W2,#1 ;
MOV W2,#1 | DMB SY ;
STR W2,[X3] | STR W2,[X3] ;
exists
(0:X0=1 / 1:X0=1)
Results
Test LB Allowed
States 3
0:X0=0; 1:X0=0;
0:X0=0; 1:X0=1;
0:X0=1; 1:X0=0;
No
Witnesses
Positive: 0 Negative: 3
Condition exists (0:X0=1 / 1:X0=1)
Observation LB Never 0 3
32 © 2019 Arm Limited
Performance Study
- Scaling CPU Cores /
Simultaneous Multithreading
(SMT)
33 © 2019 Arm Limited
0.9
0.95
1
1.05
1.1
1 4 Max
lse wolse wdmb
Applying Cook Book Recipes to SPECJBB
Bigger is Better
Core or
Thread
Count
With LSE;
With LDAR
(baseline)
Without
LSE;
With LDAR
With LSE;
With DMB
1 1.00 0.97 0.92
4 1.00 1.00 0.95
Max 1.00 1.01 1.00
•Fences/Barriers(e.g. DMB ST, DMB LD, DMB SY)
•Atomics/LSE (e.g. LDREX/STREX or CAS)
34 © 2019 Arm Limited
Single Core Performance
The Quest and Guarantee of Sequential Consistency
Hardware improvements measured on Java micro-benchmarks (OpenJDK JDK11):
• Object/memory allocations up to 2.4x faster
• Object/array initializations up to 5x faster
– Smart issuing and cost reduction of
SW barriers (i.e. DMB) required
by Arm’s relaxed memory model
• Copy chars up to 1.6x faster
• New atomic instructions improve locking
throughput and contention latency by up to 2x
Cortex-A72 Code Neoverse N1
0.21% dmb ishst ;*new (0 cycles) 0.00%
7.73% (~3.5 cycles) ldr x11, [sp,#8]
1.75% ldr w17, [x11,#12];*getfield 0.06%
mov x2, x0
0.51% ldp w0, w18,[x11,#16];*getfield 0.11%
0.42% ldp w3, w1, [x11,#24];*getfield
org.openjdk.bench.vm.compiler.generated.StoreAfterStore_testAllocAndZeroStore_jmhTest::testAllocAndZeroStore_avgt_jmhStub
35 © 2019 Arm Limited
Ares Single Core Performance
Hardware improvements measured on SPECJBB (OpenJDK JDK11):
• Neoverse N1 CPU improves performance from Cortex-A72 by 1.7x
Software improvements measured on SPECJBB:
• JDK11 improves performance vs JDK8 on Arm by up to 14%
36 © 2019 Arm Limited
Resources
http://g.oswego.edu/dl/jmm/cookbook.html
https://www.cl.cam.ac.uk/~pes20/ppc-supplemental/test7.pdf
http://hg.openjdk.java.net/code-tools/jmh-jdk-
microbenchmarks/file/92c55597888e/README.md
http://infocenter.arm.com/help/topic/com.arm.doc.genc007826/Barrier_Litmus_Tests_an
d_Cookbook_A08.pdf
37 © 2019 Arm Limited
Appendix
38 © 2019 Arm Limited
The herd+diy Toolsuite
The tool suite supports and provides a formal consistency model for all of the following:
- Arm, IBM Power, Intel x86
- Nvidia GPUs
- C/C++
- Linux C
This means that a user can experiment with the concurrency implemented at all these levels and
generate systematic families of tests to probe implementations.
diy.inria.fr
39 © 2019 Arm Limited
A Store Barrier Litmus Test
PodWW Rfe PodRR Fre
Fre PodWR Fre PodWR
A litmus test source has three main sections:
The initial state defines the initial values of registers and memory locations. Initialisation
to zero may be omitted.
The code section defines the code to be run concurrently — above there are two threads.
Yes we know, our X86 assembler syntax is a mistake.
The final condition applies to the final values of registers and memory locations.
40 © 2019 Arm Limited
Executing the model: herd
herd
litmus testcat model
Is this behavior allowed by the cat model?
Yes/No
The herd tool allows a user to execute a formal model, written in the cat language.
Given a litmus test and a cat model, herd runs the litmus test against the cat model:
herd tries to determine whether the model allows the final state given in the test can be reached.
41 © 2019 Arm Limited
Running tests on hardware: litmus
litmus
on HW
litmus test
Is this behavior observed on HW?
Yes/No
The litmus tool allows a user to run a litmus test against hardware.
The tool gathers all the final states that were observed on hardware during multiple runs of the test.
We can then compare the output of herd and litmus, to check whether they are in accord.
42 © 2019 Arm Limited
Generating tests: diy
litmus test
configuration file (~cat model)
diy
The diy tool allows a user to generate interesting families of litmus tests.
It takes as input a configuration file, where a user should list the features of interest to them.
We can use families of diy-generated tests to run validation campaigns,
comparing the cat model and prototypes.
The Cloud to Edge Infrastructure Foundation
for a World of 1T Intelligent Devices
Thank You!

More Related Content

What's hot (20)

PDF
JVM JIT-compiler overview @ JavaOne Moscow 2013
Vladimir Ivanov
 
PDF
Thesis - LLVM toolchain support as a plug-in for Eclipse CDT
TuononenP
 
PDF
"JIT compiler overview" @ JEEConf 2013, Kiev, Ukraine
Vladimir Ivanov
 
PDF
LAS16-500: The Rise and Fall of Assembler and the VGIC from Hell
Linaro
 
PDF
BKK16-409 VOSY Switch Port to ARMv8 Platforms and ODP Integration
Linaro
 
PDF
Intrinsic Methods in HotSpot VM
Kris Mok
 
PDF
BKK16-308 The tool called Auto-Tuned Optimization System (ATOS)
Linaro
 
PDF
HKG15-301: OVS implemented via ODP & vendor SDKs
Linaro
 
PDF
JerryScript: An ultra-lighteweight JavaScript Engine for the Internet of Things
Samsung Open Source Group
 
PDF
JVM JIT compilation overview by Vladimir Ivanov
ZeroTurnaround
 
PDF
LAS16-305: Smart City Big Data Visualization on 96Boards
Linaro
 
PDF
Kickstarting IOT using NodeRED
Rajesh Sola
 
PDF
LAS16-301: OpenStack on Aarch64, running in production, upstream improvements...
Linaro
 
PDF
Las16 309 - lua jit arm64 port - status
Linaro
 
PDF
Ostech war story using mainline linux for an android tv bsp
Neil Armstrong
 
PDF
JMC/JFR: Kotlin spezial
Miro Wengner
 
PDF
PL-4051, An Introduction to SPIR for OpenCL Application Developers and Compil...
AMD Developer Central
 
PDF
LAS16-TR03: Upstreaming 201
Linaro
 
PDF
BKK16-502 Suspend to Idle
Linaro
 
PDF
LAS16-200: SCMI - System Management and Control Interface
Linaro
 
JVM JIT-compiler overview @ JavaOne Moscow 2013
Vladimir Ivanov
 
Thesis - LLVM toolchain support as a plug-in for Eclipse CDT
TuononenP
 
"JIT compiler overview" @ JEEConf 2013, Kiev, Ukraine
Vladimir Ivanov
 
LAS16-500: The Rise and Fall of Assembler and the VGIC from Hell
Linaro
 
BKK16-409 VOSY Switch Port to ARMv8 Platforms and ODP Integration
Linaro
 
Intrinsic Methods in HotSpot VM
Kris Mok
 
BKK16-308 The tool called Auto-Tuned Optimization System (ATOS)
Linaro
 
HKG15-301: OVS implemented via ODP & vendor SDKs
Linaro
 
JerryScript: An ultra-lighteweight JavaScript Engine for the Internet of Things
Samsung Open Source Group
 
JVM JIT compilation overview by Vladimir Ivanov
ZeroTurnaround
 
LAS16-305: Smart City Big Data Visualization on 96Boards
Linaro
 
Kickstarting IOT using NodeRED
Rajesh Sola
 
LAS16-301: OpenStack on Aarch64, running in production, upstream improvements...
Linaro
 
Las16 309 - lua jit arm64 port - status
Linaro
 
Ostech war story using mainline linux for an android tv bsp
Neil Armstrong
 
JMC/JFR: Kotlin spezial
Miro Wengner
 
PL-4051, An Introduction to SPIR for OpenCL Application Developers and Compil...
AMD Developer Central
 
LAS16-TR03: Upstreaming 201
Linaro
 
BKK16-502 Suspend to Idle
Linaro
 
LAS16-200: SCMI - System Management and Control Interface
Linaro
 

Similar to Applying Concurrency Cookbook Recipes to SPEC JBB (20)

PDF
Advanced High-Performance Computing Features of the Open Power ISA
Ganesan Narayanasamy
 
PDF
XT Best Practices
Jeff Larkin
 
PDF
Advanced High-Performance Computing Features of the OpenPOWER ISA
Ganesan Narayanasamy
 
PDF
Implementing AI: High Performance Architectures: Arm SVE and Supercomputer Fu...
KTN
 
PPT
20081114 Friday Food iLabt Bart Joris
imec.archive
 
PPT
Arm architecture
Pantech ProLabs India Pvt Ltd
 
PPT
IBM SAN Volume Controller Performance Analysis
brettallison
 
PPTX
Exploring Compiler Optimization Opportunities for the OpenMP 4.x Accelerator...
Akihiro Hayashi
 
PDF
Kvm for ibm_z_systems_v1.1.2_limits
Krystel Hery
 
PDF
Java Memory Model
Łukasz Koniecki
 
PPT
x86_1.ppt
jeronimored
 
PDF
G108277 ds8000-resiliency-lagos-v1905c
Tony Pearson
 
PDF
Optimization in Programming languages
Ankit Pandey
 
PPT
EMBEDDED SYSTEMS 4&5
PRADEEP
 
PDF
Embedded system Design introduction _ Karakola
JohanAspro
 
PDF
RISC-V 30908 patra
RISC-V International
 
PPTX
opt-mem-trx
Miguel Gamboa
 
DOCX
Fall 2016 Insurance Case Study – Finance 360Loss ControlLoss.docx
lmelaine
 
PDF
Kauli SSPにおけるVyOSの導入事例
Kazuhito Ohkawa
 
PDF
HeapStats: Troubleshooting with Serviceability and the New Runtime Monitoring...
Yuji Kubota
 
Advanced High-Performance Computing Features of the Open Power ISA
Ganesan Narayanasamy
 
XT Best Practices
Jeff Larkin
 
Advanced High-Performance Computing Features of the OpenPOWER ISA
Ganesan Narayanasamy
 
Implementing AI: High Performance Architectures: Arm SVE and Supercomputer Fu...
KTN
 
20081114 Friday Food iLabt Bart Joris
imec.archive
 
IBM SAN Volume Controller Performance Analysis
brettallison
 
Exploring Compiler Optimization Opportunities for the OpenMP 4.x Accelerator...
Akihiro Hayashi
 
Kvm for ibm_z_systems_v1.1.2_limits
Krystel Hery
 
Java Memory Model
Łukasz Koniecki
 
x86_1.ppt
jeronimored
 
G108277 ds8000-resiliency-lagos-v1905c
Tony Pearson
 
Optimization in Programming languages
Ankit Pandey
 
EMBEDDED SYSTEMS 4&5
PRADEEP
 
Embedded system Design introduction _ Karakola
JohanAspro
 
RISC-V 30908 patra
RISC-V International
 
opt-mem-trx
Miguel Gamboa
 
Fall 2016 Insurance Case Study – Finance 360Loss ControlLoss.docx
lmelaine
 
Kauli SSPにおけるVyOSの導入事例
Kazuhito Ohkawa
 
HeapStats: Troubleshooting with Serviceability and the New Runtime Monitoring...
Yuji Kubota
 
Ad

More from Monica Beckwith (20)

PPTX
The ilities of software engineering.pptx
Monica Beckwith
 
PPTX
A G1GC Saga-KCJUG.pptx
Monica Beckwith
 
PDF
ZGC-SnowOne.pdf
Monica Beckwith
 
PDF
QCon London.pdf
Monica Beckwith
 
PPTX
Enabling Java: Windows on Arm64 - A Success Story!
Monica Beckwith
 
PPTX
Intro to Garbage Collection
Monica Beckwith
 
PPTX
OpenJDK Concurrent Collectors
Monica Beckwith
 
PDF
OPENJDK: IN THE NEW AGE OF CONCURRENT GARBAGE COLLECTORS
Monica Beckwith
 
PDF
The Performance Engineer's Guide to Java (HotSpot) Virtual Machine
Monica Beckwith
 
PDF
Garbage First Garbage Collector: Where the Rubber Meets the Road!
Monica Beckwith
 
PDF
JFokus Java 9 contended locking performance
Monica Beckwith
 
PDF
Java Performance Engineer's Survival Guide
Monica Beckwith
 
PDF
The Performance Engineer's Guide To (OpenJDK) HotSpot Garbage Collection - Th...
Monica Beckwith
 
PDF
The Performance Engineer's Guide To HotSpot Just-in-Time Compilation
Monica Beckwith
 
PDF
Java 9: The (G1) GC Awakens!
Monica Beckwith
 
PDF
Game of Performance: A Song of JIT and GC
Monica Beckwith
 
PDF
Way Improved :) GC Tuning Confessions - presented at JavaOne2015
Monica Beckwith
 
PDF
GC Tuning Confessions Of A Performance Engineer - Improved :)
Monica Beckwith
 
PDF
GC Tuning Confessions Of A Performance Engineer
Monica Beckwith
 
PPTX
Garbage First Garbage Collector (G1 GC) - Migration to, Expectations and Adva...
Monica Beckwith
 
The ilities of software engineering.pptx
Monica Beckwith
 
A G1GC Saga-KCJUG.pptx
Monica Beckwith
 
ZGC-SnowOne.pdf
Monica Beckwith
 
QCon London.pdf
Monica Beckwith
 
Enabling Java: Windows on Arm64 - A Success Story!
Monica Beckwith
 
Intro to Garbage Collection
Monica Beckwith
 
OpenJDK Concurrent Collectors
Monica Beckwith
 
OPENJDK: IN THE NEW AGE OF CONCURRENT GARBAGE COLLECTORS
Monica Beckwith
 
The Performance Engineer's Guide to Java (HotSpot) Virtual Machine
Monica Beckwith
 
Garbage First Garbage Collector: Where the Rubber Meets the Road!
Monica Beckwith
 
JFokus Java 9 contended locking performance
Monica Beckwith
 
Java Performance Engineer's Survival Guide
Monica Beckwith
 
The Performance Engineer's Guide To (OpenJDK) HotSpot Garbage Collection - Th...
Monica Beckwith
 
The Performance Engineer's Guide To HotSpot Just-in-Time Compilation
Monica Beckwith
 
Java 9: The (G1) GC Awakens!
Monica Beckwith
 
Game of Performance: A Song of JIT and GC
Monica Beckwith
 
Way Improved :) GC Tuning Confessions - presented at JavaOne2015
Monica Beckwith
 
GC Tuning Confessions Of A Performance Engineer - Improved :)
Monica Beckwith
 
GC Tuning Confessions Of A Performance Engineer
Monica Beckwith
 
Garbage First Garbage Collector (G1 GC) - Migration to, Expectations and Adva...
Monica Beckwith
 
Ad

Recently uploaded (20)

PDF
The Future of Mobile Is Context-Aware—Are You Ready?
iProgrammer Solutions Private Limited
 
PDF
NewMind AI Weekly Chronicles – July’25, Week III
NewMind AI
 
PPTX
Dev Dives: Automate, test, and deploy in one place—with Unified Developer Exp...
AndreeaTom
 
PDF
Generative AI vs Predictive AI-The Ultimate Comparison Guide
Lily Clark
 
PDF
AI Unleashed - Shaping the Future -Starting Today - AIOUG Yatra 2025 - For Co...
Sandesh Rao
 
PPTX
cloud computing vai.pptx for the project
vaibhavdobariyal79
 
PPTX
The Future of AI & Machine Learning.pptx
pritsen4700
 
PDF
How ETL Control Logic Keeps Your Pipelines Safe and Reliable.pdf
Stryv Solutions Pvt. Ltd.
 
PDF
Brief History of Internet - Early Days of Internet
sutharharshit158
 
PDF
Responsible AI and AI Ethics - By Sylvester Ebhonu
Sylvester Ebhonu
 
PDF
Make GenAI investments go further with the Dell AI Factory
Principled Technologies
 
PDF
Structs to JSON: How Go Powers REST APIs
Emily Achieng
 
PDF
State-Dependent Conformal Perception Bounds for Neuro-Symbolic Verification
Ivan Ruchkin
 
PDF
Tea4chat - another LLM Project by Kerem Atam
a0m0rajab1
 
PDF
Economic Impact of Data Centres to the Malaysian Economy
flintglobalapac
 
PDF
MASTERDECK GRAPHSUMMIT SYDNEY (Public).pdf
Neo4j
 
PPTX
AVL ( audio, visuals or led ), technology.
Rajeshwri Panchal
 
PPTX
Agentic AI in Healthcare Driving the Next Wave of Digital Transformation
danielle hunter
 
PDF
Google I/O Extended 2025 Baku - all ppts
HusseinMalikMammadli
 
PDF
Per Axbom: The spectacular lies of maps
Nexer Digital
 
The Future of Mobile Is Context-Aware—Are You Ready?
iProgrammer Solutions Private Limited
 
NewMind AI Weekly Chronicles – July’25, Week III
NewMind AI
 
Dev Dives: Automate, test, and deploy in one place—with Unified Developer Exp...
AndreeaTom
 
Generative AI vs Predictive AI-The Ultimate Comparison Guide
Lily Clark
 
AI Unleashed - Shaping the Future -Starting Today - AIOUG Yatra 2025 - For Co...
Sandesh Rao
 
cloud computing vai.pptx for the project
vaibhavdobariyal79
 
The Future of AI & Machine Learning.pptx
pritsen4700
 
How ETL Control Logic Keeps Your Pipelines Safe and Reliable.pdf
Stryv Solutions Pvt. Ltd.
 
Brief History of Internet - Early Days of Internet
sutharharshit158
 
Responsible AI and AI Ethics - By Sylvester Ebhonu
Sylvester Ebhonu
 
Make GenAI investments go further with the Dell AI Factory
Principled Technologies
 
Structs to JSON: How Go Powers REST APIs
Emily Achieng
 
State-Dependent Conformal Perception Bounds for Neuro-Symbolic Verification
Ivan Ruchkin
 
Tea4chat - another LLM Project by Kerem Atam
a0m0rajab1
 
Economic Impact of Data Centres to the Malaysian Economy
flintglobalapac
 
MASTERDECK GRAPHSUMMIT SYDNEY (Public).pdf
Neo4j
 
AVL ( audio, visuals or led ), technology.
Rajeshwri Panchal
 
Agentic AI in Healthcare Driving the Next Wave of Digital Transformation
danielle hunter
 
Google I/O Extended 2025 Baku - all ppts
HusseinMalikMammadli
 
Per Axbom: The spectacular lies of maps
Nexer Digital
 

Applying Concurrency Cookbook Recipes to SPEC JBB

  • 1. Jade Alglave – Architecture and Technology Monica Beckwith – Infrastructure LOB; Advanced Server Team Applying Concurrency Cookbook Recipes to SPEC JBB
  • 2. 2 © 2019 Arm Limited About Us Dr. Jade Alglave • Memory model architect at Arm • Co-developer and maintainer of the herd+diy toolsuite with Luc Maranget (INRIA, France) • Co-developer and maintainer of the Linux kernel memory model Monica Beckwith Managed runtime performance architect at Arm • Experience with OpenJDK HotSpot JIT, GC • Experience with JMM and with strong and weakly ordered architecture such as x86-64, SPARC, Arm64 and (very briefly) PPC64
  • 3. 3 © 2019 Arm Limited What We Will Cover Today Introduction to – Memory Models (Java Relaxed) Performance Methodology using Litmus Tests and Tools Performance Analysis and Measurement using Java Micro-Benchmark Harness (JMH) Performance Study on Scaling CPU Cores and Simultaneous Multithreading (SMT)
  • 4. 4 © 2019 Arm Limited What We Will NOT Cover Today Details of – Java Relaxed memory model JMH benchmarking SPEC JBB 2015 benchmarking CPU cores and simultaneous multithreading (SMT)
  • 5. 5 © 2019 Arm Limited Memory Models - What value can a load read?
  • 6. 6 © 2019 Arm Limited Multi-threaded hardware with shared memory structure A multi-threaded, concurrency-aware program Processor Threads and Software Threads … The Ideal Concurrent World of Hardware and Software * This drawing is heavily inspired by “timethreads“ concept in Doug Lea’s ‘Concurrent Programming in Java: Design Principles and Patterns, Second Edition ‘ Object 2Object 1 Thread 1 LockThread 2 help Thread1 Threadn Shared Memory W R W R * This drawing is heavily inspired by ‘A Tutorial Introduction to the ARM and POWER Relaxed Memory Models’ by Sewell et. al.
  • 7. 7 © 2019 Arm Limited Sequentially Consistent Shared Memory Execution Order == Program Order == Sequential Order Object 2 Object 1 Thread 1 Lock Thread 2 help =+ Thread 1 Thread n Shared Memory W R W R Timeline (Program Order)Timeline (Single Global Execution Order) A Sequentially Consistent Machine • No local reordering • Writes become visible simultaneously to all threads
  • 8. 8 © 2019 Arm Limited Sequential Consistency in Practice Store Buffering Example Initially, X and Y are 0 in memory; foo and bar are local (register) variables: p0 p1 a: X = 1; c: Y = 1; b: foo = Y; d: bar = X; What are the permissible values for foo and bar? On Sequential Consistency, they are the values reachable by interleavings: {a,b,c,d} {c,d,a,b} {a,c,b,d} Therefore we cannot have foo and bar both equal to 0.
  • 9. 9 © 2019 Arm Limited The Real Concurrent World of Hardware Multi Processor Threads/Cores with Tiered Memory Structure CPUn CPU0 L1D$ L2 LLC L1 I$ CPU1 L2 L2 Usually shared between all cores Could be multi-threaded (SMT) Usually private Memory Controller DDR Banks L1D$L1 I$ L1D$L1 I$ IO Can have Load and Store buffers Can have out of order execution
  • 10. 10 © 2019 Arm Limited Strong Models based on Total Store Ordering (TSO) CPUn CPU0 L1 D$ L2 LLC L1 I$ CPU1 L2 L2 Memory Controller L1 D$ L1 I$ L1 D$ L1 I$ IO Life In The Real World Without Sequential Consistency Relaxed vs Strong Memory Model Strong Memory Models Weaker Memory Models X86, SPARC POWER, Arm v7 • A thread can see it’s own write before other threads • All other threads see the write simultaneously: Multiple Copy Atomic Model • Local reordering is allowed • All threads are not guaranteed to see the write simultaneously: Not Multiple Copy Atomic Model
  • 11. 11 © 2019 Arm Limited Strong Models based on Total Store Ordering (TSO) CPUn CPU0 L1 D$ L2 LLC L1 I$ CPU1 L2 L2 Memory Controller L1 D$ L1 I$ L1 D$ L1 I$ IO Life In The Real World Without Sequential Consistency Relaxed vs Strong Memory Model Weaker Memory Models X86, SPARC ARM v8 • A thread can see it’s own write before other threads • All other threads see the write simultaneously: Multiple Copy Atomic Model • Local reordering is allowed • All threads are guaranteed to see the write simultaneously: Multiple Copy Atomic Model
  • 12. 12 © 2019 Arm Limited • Can we reason about our concurrent programs following Sequential Consistency? • Probably if we had a formal, preferably executable, memory models to ensure that we understand the guarantees given by architectures and programming languages. • Here’s where Jade would come in talking about her cool tools that allow programmers to explore the consequences of a given memory model or generate vast families of litmus tests to run against hardware.litmus tests Going Back To Our Store Buffer Example herd litmus testcat model Is this behavior allowed by the cat model? Yes/No litmus on HW litmus test Is this behavior observed on HW? Yes/No litmus test configuration file (~cat model) diy diy.inria.fr
  • 13. 13 © 2019 Arm Limited X86 SB {x=0; y=0;} P0 | P1 ; MOV [x],$1 | MOV [y],$1 ; MOV EAX,[y] | MOV EAX,[x] ; exists (0:EAX=0 / 1:EAX=0) Hardware architecture and test name Initial state (x and y are shared memory location) Thread names Sequence of instructions displayed as columns Question: can we observe this final state of given that x=0; y=0? Store Buffer Litmus Test on a TSO Hardware
  • 14. 14 © 2019 Arm Limited Armed With Knowledge On TSO hardware, can we observe the final state of foo=0 and bar=0; given that X=0; Y=0? ... ... Yes! All production architectures allow the outcome where both foo and bar equal 0. So, what do we do? … Use mfence as needed.
  • 15. 15 © 2019 Arm Limited Performance Methodology - Using Litmus Tests and Tools To Avoid Barriers Where-ever Possible
  • 16. 16 © 2019 Arm Limited What & The Why Of Barriers / Fences? Barriers ensure ordering properties Barriers enforce strong order Barriers (when inserted correctly) restore sequential consistency Barriers can be potentially expensive Data Memory Barriers on Arm: DMB SY (full system) DMB ST (wait for store to complete) DMB LD (wait for only loads to complete)
  • 17. 17 © 2019 Arm Limited Normal Load-Stores No Barriers Litmus test AArch64 MP { 0:X1=x; 0:X3=y; 1:X1=y; 1:X3=x; } P0 | P1 ; MOV W0,#1 | LDR W0,[X1] ; STR W0,[X1] | LDR W2,[X3] ; MOV W2,#1 | ; STR W2,[X3] | ; exists (1:X0=1 / 1:X2=0) Check for any reorder Check if X0 = 1 and X2 = 0 can exist on P1.
  • 18. 18 © 2019 Arm Limited Normal Stores Load Barrier Litmus test AArch64 MP+DMB.LD { 0:X1=x; 0:X3=y; 1:X1=y; 1:X3=x; } P0 | P1 ; MOV W0,#1 | LDR W0,[X1] ; STR W0,[X1] | DMB LD; MOV W2,#1 | LDR W2,[X3] ; STR W2,[X3] | ; exists (1:X0=1 / 1:X2=0) Check for Store reorder
  • 19. 19 © 2019 Arm Limited Load & Store Barriers Litmus test AArch64 MP+DMB.LD+DMB.ST { 0:X1=x; 0:X3=y; 1:X1=y; 1:X3=x; } P0 | P1 ; MOV W0,#1 | LDR W0,[X1] ; STR W0,[X1] | DMB LD; DMB ST | LDR W2, [X3]; MOV W2,#1 | ; STR W2,[X3] | ; exists (1:X0=1 / 1:X2=0) Generated assembler #START _litmus_P1 ldr w4,[x1] dmb ld ldr w5,[x2] #START _litmus_P0 mov w7,#1 str w7,[x0] dmb st mov w6,#1 str w6,[x2] Test MP+DMB.ST+DMB.LD Allowed Histogram (3 states) 499999:>1:X0=0; 1:X2=0; 20 :>1:X0=0; 1:X2=1; 499981:>1:X0=1; 1:X2=1; No Witnesses Positive: 0, Negative: 1000000 Condition exists (1:X0=1 / 1:X2=0) is NOT vali Hash=4d15dccdb1da0ce51fac17dea068d047 Observation MP+DMB.ST+DMB.LD Never 0 1000000
  • 20. 20 © 2019 Arm Limited But Aren’t Barriers Expensive? Acquire – Release (implicit barrier) semantic – One way barriers LDAR - All loads and stores that are after an LDAR in program order, … must be observed after the LDAR STLR - All loads and stores preceding an STLR …, must be observed before the STLR Thinking about lock-free?
  • 21. 21 © 2019 Arm Limited Performance Analysis and Measurement - Using Java Micro- Benchmarking Harness (JMH)
  • 22. 22 © 2019 Arm Limited JMM Rule 1 for Volatile Stores Order 1 Normal/Volatile Load Normal/Volatile Store Can’t Reorder Order 2 Volatile Store Any load/store (normal or volatile) followed by a ‘volatile store’ can’t be reordered.
  • 23. 23 © 2019 Arm Limited Volatile Stores - Barriers With DMBs JMM rule: Any load/store followed by Volatile Store can’t be re-ordered Litmus test { 0:X1=x; 0:X3=y; 1:X1=y; 1:X3=x; } P0 | P1 ; MOV W0,#1 | LDR W0,[X1] ; STR W0,[X1] | DMB LD ; LDR W2,[X1] | LDR W2,[X3] ; DMB SY | ; STR W2,[X3] | ; exists (1:X0=1 / 1:X2=0) Results Test MP Allowed States 3 1:X0=0; 1:X2=0; 1:X0=0; 1:X2=1; 1:X0=1; 1:X2=1; No Witnesses Positive: 0 Negative: 3 Condition exists (1:X0=1 / 1:X2=0) Observation MP Never 0 3
  • 24. 24 © 2019 Arm Limited Volatile Stores - Can Barriers Be Replaced By STLR? JMM rule: Any load/store followed by Volatile Store can’t be re-ordered Litmus test { 0:X1=x; 0:X3=y; 1:X1=y; 1:X3=x; } P0 | P1 ; MOV W0,#1 | LDR W0,[X1] ; STR W0,[X1] | DMB LD ; MOV W2,#1 | LDR W2,[X3] ; STLR W2,[X3] | ; exists (1:X0=1 / 1:X2=0) Results Test MP Allowed States 3 1:X0=0; 1:X2=0; 1:X0=0; 1:X2=1; 1:X0=1; 1:X2=1; No Witnesses Positive: 0 Negative: 3 Condition exists (1:X0=1 / 1:X2=0) Observation MP Never 0 3
  • 25. 25 © 2019 Arm Limited JMM Rule 2 for Volatile Stores • A ‘volatile store’ followed by any normal load/store CAN be reordered. • A ‘volatile store’ followed by any volatile load/store CANNOT be reordered. Order 1 Can Reorder Volatile Store Can’t Reorder Order 2 Normal Load Normal Store Volatile Load Volatile Store
  • 26. 26 © 2019 Arm Limited Volatile Stores – Can Barriers Be Replaced By STLR? STLR doesn’t guarantee that a subsequent volatile load/store will not be reordered static class TestNormalLoadPostVolatileStores { volatile int intField1; int intNorm1; public TestNormalLoadPostVolatileStores() { intField1 = 32; intNorm1 = intField1; } } JMH Test Code { 0:X1=x; 0:X3=y; 1:X1=y; 1:X3=x; } P0 | P1 ; LDR W0,[X1] | LDR W0,[X1] ; MOV W2,#1 | MOV W2,#1 ; STLR W2,[X3] | STR W2,[X3] ; exists (0:X0=1 / 1:X0=1) Litmus Test Positive Event Structure
  • 27. 27 © 2019 Arm Limited Volatile Stores – Can Barriers Be Replaced By STLR? Success! STLR + LDAR of volatiles provide that guarantee Litmus test { 0:X1=x; 0:X3=y; 1:X1=y; 1:X3=x; } P0 | P1 ; LDR W0,[X1] | LDAR W0,[X1] ; MOV W2,#1 | MOV W2,#1 ; STLR W2,[X3] | STR W2,[X3] ; exists (0:X0=1 / 1:X0=1) Results Test LB Allowed States 3 0:X0=0; 1:X0=0; 0:X0=0; 1:X0=1; 0:X0=1; 1:X0=0; No Witnesses Positive: 0 Negative: 3 Condition exists (0:X0=1 / 1:X0=1) Observation LB Never 0 3
  • 28. 28 © 2019 Arm Limited Volatile Stores – JMH Profiles Success! STLR + LDAR of volatiles provide that guarantee Load Acquire – Store Release Pair stlr w11, [x10] ;*putfield intField1 ; add x10, x2, #0xc ldar w11, [x10] ;*getfield intField1 Data Memory Barrier (inner share-ability domain str w10, [x2,#12] dmb ish ;*putfield intField1 ldr w11, [x2,#12] dmb ishld ;*getfield intField1 36% faster on max SMT count!!
  • 29. 29 © 2019 Arm Limited JMM Rule 1 for Volatile Loads Order 1 Volatile Load Can’t Reorder Order 2 Normal/Volatile Load Normal/Volatile Store A ‘volatile load’ followed by any load/store (normal or volatile) can’t be reordered.
  • 30. 30 © 2019 Arm Limited Volatile Load - Barriers With DMBs JMM rule: A Volatile Load followed by any load/store can’t be re-ordered Litmus test { 0:X1=x; 0:X3=y; 1:X1=y; 1:X3=x; } P0 | P1 ; LDR W0,[X1] | LDR W0,[X1] ; DMB SY | MOV W2,#1 ; MOV W2,#1 | DMB SY ; STR W2,[X3] | STR W2,[X3] ; exists (0:X0=1 / 1:X0=1) Results Test LB Allowed States 3 0:X0=0; 1:X0=0; 0:X0=0; 1:X0=1; 0:X0=1; 1:X0=0; No Witnesses Positive: 0 Negative: 3 Condition exists (0:X0=1 / 1:X0=1) Observation LB Never 0 3
  • 31. 31 © 2019 Arm Limited Volatile Stores – Can Barriers Be Replaced By LDAR? Success! LDAR provides the right guarantee Litmus test { 0:X1=x; 0:X3=y; 1:X1=y; 1:X3=x; } P0 | P1 ; LDAR W0,[X1] | LDR W0,[X1] ; | MOV W2,#1 ; MOV W2,#1 | DMB SY ; STR W2,[X3] | STR W2,[X3] ; exists (0:X0=1 / 1:X0=1) Results Test LB Allowed States 3 0:X0=0; 1:X0=0; 0:X0=0; 1:X0=1; 0:X0=1; 1:X0=0; No Witnesses Positive: 0 Negative: 3 Condition exists (0:X0=1 / 1:X0=1) Observation LB Never 0 3
  • 32. 32 © 2019 Arm Limited Performance Study - Scaling CPU Cores / Simultaneous Multithreading (SMT)
  • 33. 33 © 2019 Arm Limited 0.9 0.95 1 1.05 1.1 1 4 Max lse wolse wdmb Applying Cook Book Recipes to SPECJBB Bigger is Better Core or Thread Count With LSE; With LDAR (baseline) Without LSE; With LDAR With LSE; With DMB 1 1.00 0.97 0.92 4 1.00 1.00 0.95 Max 1.00 1.01 1.00 •Fences/Barriers(e.g. DMB ST, DMB LD, DMB SY) •Atomics/LSE (e.g. LDREX/STREX or CAS)
  • 34. 34 © 2019 Arm Limited Single Core Performance The Quest and Guarantee of Sequential Consistency Hardware improvements measured on Java micro-benchmarks (OpenJDK JDK11): • Object/memory allocations up to 2.4x faster • Object/array initializations up to 5x faster – Smart issuing and cost reduction of SW barriers (i.e. DMB) required by Arm’s relaxed memory model • Copy chars up to 1.6x faster • New atomic instructions improve locking throughput and contention latency by up to 2x Cortex-A72 Code Neoverse N1 0.21% dmb ishst ;*new (0 cycles) 0.00% 7.73% (~3.5 cycles) ldr x11, [sp,#8] 1.75% ldr w17, [x11,#12];*getfield 0.06% mov x2, x0 0.51% ldp w0, w18,[x11,#16];*getfield 0.11% 0.42% ldp w3, w1, [x11,#24];*getfield org.openjdk.bench.vm.compiler.generated.StoreAfterStore_testAllocAndZeroStore_jmhTest::testAllocAndZeroStore_avgt_jmhStub
  • 35. 35 © 2019 Arm Limited Ares Single Core Performance Hardware improvements measured on SPECJBB (OpenJDK JDK11): • Neoverse N1 CPU improves performance from Cortex-A72 by 1.7x Software improvements measured on SPECJBB: • JDK11 improves performance vs JDK8 on Arm by up to 14%
  • 36. 36 © 2019 Arm Limited Resources http://g.oswego.edu/dl/jmm/cookbook.html https://www.cl.cam.ac.uk/~pes20/ppc-supplemental/test7.pdf http://hg.openjdk.java.net/code-tools/jmh-jdk- microbenchmarks/file/92c55597888e/README.md http://infocenter.arm.com/help/topic/com.arm.doc.genc007826/Barrier_Litmus_Tests_an d_Cookbook_A08.pdf
  • 37. 37 © 2019 Arm Limited Appendix
  • 38. 38 © 2019 Arm Limited The herd+diy Toolsuite The tool suite supports and provides a formal consistency model for all of the following: - Arm, IBM Power, Intel x86 - Nvidia GPUs - C/C++ - Linux C This means that a user can experiment with the concurrency implemented at all these levels and generate systematic families of tests to probe implementations. diy.inria.fr
  • 39. 39 © 2019 Arm Limited A Store Barrier Litmus Test PodWW Rfe PodRR Fre Fre PodWR Fre PodWR A litmus test source has three main sections: The initial state defines the initial values of registers and memory locations. Initialisation to zero may be omitted. The code section defines the code to be run concurrently — above there are two threads. Yes we know, our X86 assembler syntax is a mistake. The final condition applies to the final values of registers and memory locations.
  • 40. 40 © 2019 Arm Limited Executing the model: herd herd litmus testcat model Is this behavior allowed by the cat model? Yes/No The herd tool allows a user to execute a formal model, written in the cat language. Given a litmus test and a cat model, herd runs the litmus test against the cat model: herd tries to determine whether the model allows the final state given in the test can be reached.
  • 41. 41 © 2019 Arm Limited Running tests on hardware: litmus litmus on HW litmus test Is this behavior observed on HW? Yes/No The litmus tool allows a user to run a litmus test against hardware. The tool gathers all the final states that were observed on hardware during multiple runs of the test. We can then compare the output of herd and litmus, to check whether they are in accord.
  • 42. 42 © 2019 Arm Limited Generating tests: diy litmus test configuration file (~cat model) diy The diy tool allows a user to generate interesting families of litmus tests. It takes as input a configuration file, where a user should list the features of interest to them. We can use families of diy-generated tests to run validation campaigns, comparing the cat model and prototypes.
  • 43. The Cloud to Edge Infrastructure Foundation for a World of 1T Intelligent Devices Thank You!