Heterogeneous Particle based Simulation (SIGGRAPH ASIA 2011)

HETEROGENEOUS PARTICLE
BASED SIMULATION
Takahiro Harada, AMD

2 Harada, Heterogeneous Particle-based Simulation
 Large number of particles
 Particles with identical size
– Work granularity is almost the same
– Good for the wide SIMD architecture
PARTICLE BASED SIMULATION ON THE GPU
Harada et al. 2007

PARTICLE BASED SIMULATION
 Collision
 Integration
 Acceleration structure is used for efficient collide
– Uniform grid → Suited for the GPU
– Less divergence
𝑓𝑐𝑜𝑙𝑙𝑖𝑑𝑒 = 𝑓𝑖𝑗
𝑣 +=
𝑓
𝑚
∆𝑡
𝑥 += 𝑣∆𝑡
𝑑𝑣
𝑑𝑡
=
𝑓
𝑚
𝑑𝑥
𝑑𝑡
= 𝑣

DIVERGENCE ON SIMD
0 1 2 3 4 5 6 7
Void Kernel()
{
if(A)
FuncA();
else if(B)
FuncB();
else
FuncC();
}

PARTICLE BASED SIMULATION ON THE GPU
 Particle collision using a uniform grid 0 1 2 3 4 5 6 7
Void Kernel()
{
prepare();
collide(Cell0);
collide(Cell1);
collide(Cell2);
collide(Cell3);
collide(Cell4);
collide(Cell5);
collide(Cell6);
collide(Cell7);
collide(Cell8);
}
Cell0 Cell1 Cell2
Cell3 Cell4 Cell5
Cell6 Cell7 Cell8

MIXED PARTICLE SIMULATION
 Not only small particles
 Difficulty for GPUs
– Large particles interact with small particles
– Large-large collision

CHALLENGE
 Non uniform work granularity
– Small-small(SS) collision
 Uniform, GPU
– Large-large(LL) collision
 Non Uniform, CPU
– Large-small(LS) collision
 Non Uniform, CPU

FUSION ARCHITECTURE
 CPU and GPU are:
– On the same die
– Much closer
– Efficient data sharing
 CPU and GPU are good at different works
– CPU: serial computation, conditional branch
– GPU: parallel computation
 Able to dispatch works to:
– Serial work with varying granularity → CPU
– Parallel work with the uniform granularity → GPU

MIXED PARTICLE SIMULATION
 Benefit from Fusion Architecture
– Different works in a simulation
– CPU & GPU are working together
– Shares data

METHOD

TWO SIMULATIONS
 Small particles
 Large particles
Build
Acc. Structure
SS
Collision
S
Integration
Build
Acc. Structure
LL
Collision
L
Integration
LS
Collision
Position
Velocity
Force
Grid
Position
Velocity
Force

 Small particles
 Large particles
Uniform Work
Non Uniform Work
CLASSIFY BY WORK GRANULARITY
Build
Acc. Structure
SS
Collision
S
Integration
L
Integration
Position
Velocity
Force
Grid
Position
Velocity
Force LL
Collision
LS
Collision
Build
Acc. Structure

 Small particles
 Large particles
GPU
CPU
CLASSIFY BY WORK GRANULARITY, ASSIGN PROCESSOR
Build
Acc. Structure
SS
Collision
S
Integration
L
Integration
Position
Velocity
Force
Grid
Position
Velocity
Force LL
Collision
LS
Collision
Build
Acc. Structure

 Small particles
 Large particles
 Grid, small particle data has to be shared with the CPU for LS collision
– Allocated as zero copy buffer
GPU
CPU
DATA SHARING
Build
Acc. Structure
SS
Collision
S
Integration
L
Integration
Position
Velocity
Force
Grid
Position
Velocity
Force LL
Collision
Build
Acc. Structure
Position
Velocity
Grid
Force
LS
Collision

 Small particles
 Large particles
 Grid, small particle data has to be shared with the CPU for LS collision
– Allocated as zero copy buffer
GPU
CPU
SYNCHRONIZATION
Position
Velocity
Force
Grid
Position
Velocity
Force
SS
Collision
S
Integration
L
Integration
LL
Collision
Position
Velocity
Grid
Force
Synchronization
LS
Collision
Build
Acc. Structure
Build
Acc. Structure
Synchronization

GPU
CPU
VISUALIZING WORKLOADS
Build
Acc. Structure
SS
Collision
S
Integration
Position
Velocity
Force
Grid
Position
Velocity
Force LL
Collision
LS
Collision
Synchronization
L
Integration
 Small particles
 Large particles
 Grid construction can be moved at the end of the pipeline
– Unbalanced workload

 Small particles
 Large particles
 To get better load balancing
– The sync is for passing the force buffer filled by the CPU to the GPU
– Move the LL collision after the sync
GPU
CPU
LOAD BALANCING
Build
Acc. Structure
SS
Collision
S
Integration
Position
Velocity
Force
Grid
Position
Velocity
Force LL
Collision
Synchronization
L
Integration
LS
Collision

GPUWork
CPUWork

MULTI THREADING
(4 THREADS)

FURTHER OPTIMIZATION
GPU
CPU0
CPU1
CPU2
Build
Acc.
Structure
SS
Collision
S
Integ.
LL
Collision
L
Integ.
LS
Collision
Synchronization
1. Not optimized for “Llano” which is a 4 core CPU
– Only 2 CPU core were used
– Can use 2 more cores for LS collision
2. LL collision was not optimized
– CPU waits when the GPU was constructing a grid
– Use CPU to improve SS collision

OPTIMIZATION1: MULTITHREADING LARGE-SMALL COLLISION
 Cannot split the work by large particle indices
– More than 1 large particle can collide with a small particle
– Have to lock the memory on write → Inefficient
 Prepare a local buffer for a thread
– A buffer storing force on small particles
– Lock free
 Local buffers are merged to one
L0
S0
S1
L1
Thread0
Thread1
Thread2

GPU
Build
Acc. Structure
SS
Collision
S
Integ.
CPU0
LL
Collision
L
Integ.
CPU1
CPU2
LS
Collision
Synchronization

GPU
Build
Acc. Structure
SS
Collision
S
Integ.
CPU0
LL
Collision
L
Integ.
CPU1
CPU2
LS
Collision
LS
Collision
LS
Collision Synchronization
MergeMergeMerge
Synchronization

 Spatially coherent memory layout improves cache utilization
 As particles move, spatial locality decreases
OPTIMIZATION2: IMPROVING SMALL-SMALL COLLISION

 Spatially coherent memory layout improves cache utilization
 As particles move, spatial locality decreases

 Sort particles by spatial location to improve cache utilization
– Z curve
SPATIAL SORT

 Requirements
– Full sort was over the budget
– Full sort is not “a must”
– Sort is an optional computation for performance improvement
– Incremental sort
– Use multiple threads
 Solution
– Used generalized “Odd-even transition sort”
CHOOSE SORT

BLOCK TRANSITION SORT
 Generalized “Odd-even transition sort”
 Instead of sorting 2 adjacent elements, sort adjacent 2 blocks
 Iterate until convergence
 Use a thread to sort 2 adjacent blocks
– 6 blocks for 3 threads
– Radix sort
Odd-even transition sort
Block transition sort

GPU
Build
Acc. Structure
SS
Collision
S
Integ.
CPU0
LL
Collision
L
Integ.
CPU1
CPU2
LS
Collision
LS
Collision
LS
MergeMergeMerge
Synchronization

GPU
Build
Acc. Structure
SS
Collision
S
Integ.
CPU0
CPU1
CPU2
LS
Collision
LS
Collision
LS
MergeMergeMerge
LL
Coll.
L
Integ.
Synchronization
S Sorting
S Sorting
S Sorting
Synchronization

DEMO
GPUWork
CPUWork

CONCLUSIONS
 Realized a simulation that handles variable sized particles by leveraging the best features of both the CPU
and GPU on AMD’s Fusion Architecture
– The CPU is used for works with non identical compute granularity
– The GPU is used for highly parallel works
 Memory sharing between the CPU and GPU is the key for the efficiency
– Avoid wasteful memory copies

REFERENCE
 Takahiro Harada, Seiichi Koshizuka, Yoichiro Kawaguchi, Smoothed Particle Hydrodynamics on GPUs,
Proc. of Computer Graphics International, 63-70(2007)
 Justin Hensley, Takahiro Harada, Chapter X OpenCL Case Study:Mixed Particle Simulation,
Heterogeneous Computing with OpenCL, Morgan Kaufmann(2011)

Heterogeneous Particle based Simulation (SIGGRAPH ASIA 2011)

More Related Content

What's hot (20)

Viewers also liked (10)

Similar to Heterogeneous Particle based Simulation (SIGGRAPH ASIA 2011) (20)

Recently uploaded (20)

Heterogeneous Particle based Simulation (SIGGRAPH ASIA 2011)