SlideShare a Scribd company logo
HETEROGENEOUS PARTICLE
BASED SIMULATION
Takahiro Harada, AMD
2 Harada, Heterogeneous Particle-based Simulation
 Large number of particles
 Particles with identical size
– Work granularity is almost the same
– Good for the wide SIMD architecture
PARTICLE BASED SIMULATION ON THE GPU
Harada et al. 2007
3 Harada, Heterogeneous Particle-based Simulation
PARTICLE BASED SIMULATION
 Collision
 Integration
 Acceleration structure is used for efficient collide
– Uniform grid → Suited for the GPU
– Less divergence
𝑓𝑐𝑜𝑙𝑙𝑖𝑑𝑒 = 𝑓𝑖𝑗
𝑣 +=
𝑓
𝑚
∆𝑡
đ‘„ += 𝑣∆𝑡
𝑑𝑣
𝑑𝑡
=
𝑓
𝑚
đ‘‘đ‘„
𝑑𝑡
= 𝑣
4 Harada, Heterogeneous Particle-based Simulation
DIVERGENCE ON SIMD
0 1 2 3 4 5 6 7
Void Kernel()
{
if(A)
FuncA();
else if(B)
FuncB();
else
FuncC();
}
5 Harada, Heterogeneous Particle-based Simulation
PARTICLE BASED SIMULATION ON THE GPU
 Particle collision using a uniform grid 0 1 2 3 4 5 6 7
Void Kernel()
{
prepare();
collide(Cell0);
collide(Cell1);
collide(Cell2);
collide(Cell3);
collide(Cell4);
collide(Cell5);
collide(Cell6);
collide(Cell7);
collide(Cell8);
}
Cell0 Cell1 Cell2
Cell3 Cell4 Cell5
Cell6 Cell7 Cell8
6 Harada, Heterogeneous Particle-based Simulation
MIXED PARTICLE SIMULATION
 Not only small particles
 Difficulty for GPUs
– Large particles interact with small particles
– Large-large collision
7 Harada, Heterogeneous Particle-based Simulation
CHALLENGE
 Non uniform work granularity
– Small-small(SS) collision
 Uniform, GPU
– Large-large(LL) collision
 Non Uniform, CPU
– Large-small(LS) collision
 Non Uniform, CPU
8 Harada, Heterogeneous Particle-based Simulation
FUSION ARCHITECTURE
 CPU and GPU are:
– On the same die
– Much closer
– Efficient data sharing
 CPU and GPU are good at different works
– CPU: serial computation, conditional branch
– GPU: parallel computation
 Able to dispatch works to:
– Serial work with varying granularity → CPU
– Parallel work with the uniform granularity → GPU
9 Harada, Heterogeneous Particle-based Simulation
MIXED PARTICLE SIMULATION
 Benefit from Fusion Architecture
– Different works in a simulation
– CPU & GPU are working together
– Shares data
10 Harada, Heterogeneous Particle-based Simulation
METHOD
11 Harada, Heterogeneous Particle-based Simulation
TWO SIMULATIONS
 Small particles
 Large particles
Build
Acc. Structure
SS
Collision
S
Integration
Build
Acc. Structure
LL
Collision
L
Integration
LS
Collision
Position
Velocity
Force
Grid
Position
Velocity
Force
12 Harada, Heterogeneous Particle-based Simulation
 Small particles
 Large particles
Uniform Work
Non Uniform Work
CLASSIFY BY WORK GRANULARITY
Build
Acc. Structure
SS
Collision
S
Integration
L
Integration
Position
Velocity
Force
Grid
Position
Velocity
Force LL
Collision
LS
Collision
Build
Acc. Structure
13 Harada, Heterogeneous Particle-based Simulation
 Small particles
 Large particles
GPU
CPU
CLASSIFY BY WORK GRANULARITY, ASSIGN PROCESSOR
Build
Acc. Structure
SS
Collision
S
Integration
L
Integration
Position
Velocity
Force
Grid
Position
Velocity
Force LL
Collision
LS
Collision
Build
Acc. Structure
14 Harada, Heterogeneous Particle-based Simulation
 Small particles
 Large particles
 Grid, small particle data has to be shared with the CPU for LS collision
– Allocated as zero copy buffer
GPU
CPU
DATA SHARING
Build
Acc. Structure
SS
Collision
S
Integration
L
Integration
Position
Velocity
Force
Grid
Position
Velocity
Force LL
Collision
Build
Acc. Structure
Position
Velocity
Grid
Force
LS
Collision
15 Harada, Heterogeneous Particle-based Simulation
 Small particles
 Large particles
 Grid, small particle data has to be shared with the CPU for LS collision
– Allocated as zero copy buffer
GPU
CPU
SYNCHRONIZATION
Position
Velocity
Force
Grid
Position
Velocity
Force
SS
Collision
S
Integration
L
Integration
LL
Collision
Position
Velocity
Grid
Force
Synchronization
LS
Collision
Build
Acc. Structure
Build
Acc. Structure
Synchronization
16 Harada, Heterogeneous Particle-based Simulation
GPU
CPU
VISUALIZING WORKLOADS
Build
Acc. Structure
SS
Collision
S
Integration
Position
Velocity
Force
Grid
Position
Velocity
Force LL
Collision
LS
Collision
Synchronization
L
Integration
 Small particles
 Large particles
 Grid construction can be moved at the end of the pipeline
– Unbalanced workload
17 Harada, Heterogeneous Particle-based Simulation
 Small particles
 Large particles
 To get better load balancing
– The sync is for passing the force buffer filled by the CPU to the GPU
– Move the LL collision after the sync
GPU
CPU
LOAD BALANCING
Build
Acc. Structure
SS
Collision
S
Integration
Position
Velocity
Force
Grid
Position
Velocity
Force LL
Collision
Synchronization
L
Integration
LS
Collision
18 Harada, Heterogeneous Particle-based Simulation
GPUWork
CPUWork
19 Harada, Heterogeneous Particle-based Simulation
MULTI THREADING
(4 THREADS)
20 Harada, Heterogeneous Particle-based Simulation
FURTHER OPTIMIZATION
GPU
CPU0
CPU1
CPU2
Build
Acc.
Structure
SS
Collision
S
Integ.
LL
Collision
L
Integ.
LS
Collision
Synchronization
1. Not optimized for “Llano” which is a 4 core CPU
– Only 2 CPU core were used
– Can use 2 more cores for LS collision
2. LL collision was not optimized
– CPU waits when the GPU was constructing a grid
– Use CPU to improve SS collision
21 Harada, Heterogeneous Particle-based Simulation
OPTIMIZATION1: MULTITHREADING LARGE-SMALL COLLISION
 Cannot split the work by large particle indices
– More than 1 large particle can collide with a small particle
– Have to lock the memory on write → Inefficient
 Prepare a local buffer for a thread
– A buffer storing force on small particles
– Lock free
 Local buffers are merged to one
L0
S0
S1
L1
Thread0
Thread1
Thread2
22 Harada, Heterogeneous Particle-based Simulation
OPTIMIZATION1: MULTITHREADING LARGE-SMALL COLLISION
GPU
Build
Acc. Structure
SS
Collision
S
Integ.
CPU0
LL
Collision
L
Integ.
CPU1
CPU2
LS
Collision
Synchronization
23 Harada, Heterogeneous Particle-based Simulation
OPTIMIZATION1: MULTITHREADING LARGE-SMALL COLLISION
GPU
Build
Acc. Structure
SS
Collision
S
Integ.
CPU0
LL
Collision
L
Integ.
CPU1
CPU2
LS
Collision
LS
Collision
LS
Collision Synchronization
MergeMergeMerge
Synchronization
24 Harada, Heterogeneous Particle-based Simulation
 Spatially coherent memory layout improves cache utilization
 As particles move, spatial locality decreases
OPTIMIZATION2: IMPROVING SMALL-SMALL COLLISION
25 Harada, Heterogeneous Particle-based Simulation
 Spatially coherent memory layout improves cache utilization
 As particles move, spatial locality decreases
OPTIMIZATION2: IMPROVING SMALL-SMALL COLLISION
26 Harada, Heterogeneous Particle-based Simulation
 Sort particles by spatial location to improve cache utilization
– Z curve
SPATIAL SORT
27 Harada, Heterogeneous Particle-based Simulation
 Sort particles by spatial location to improve cache utilization
– Z curve
SPATIAL SORT
28 Harada, Heterogeneous Particle-based Simulation
 Requirements
– Full sort was over the budget
– Full sort is not “a must”
– Sort is an optional computation for performance improvement
– Incremental sort
– Use multiple threads
 Solution
– Used generalized “Odd-even transition sort”
CHOOSE SORT
29 Harada, Heterogeneous Particle-based Simulation
BLOCK TRANSITION SORT
 Generalized “Odd-even transition sort”
 Instead of sorting 2 adjacent elements, sort adjacent 2 blocks
 Iterate until convergence
 Use a thread to sort 2 adjacent blocks
– 6 blocks for 3 threads
– Radix sort
Odd-even transition sort
Block transition sort
30 Harada, Heterogeneous Particle-based Simulation
OPTIMIZATION2: IMPROVING SMALL-SMALL COLLISION
GPU
Build
Acc. Structure
SS
Collision
S
Integ.
CPU0
LL
Collision
L
Integ.
CPU1
CPU2
LS
Collision
LS
Collision
LS
Collision Synchronization
MergeMergeMerge
Synchronization
31 Harada, Heterogeneous Particle-based Simulation
OPTIMIZATION2: IMPROVING SMALL-SMALL COLLISION
GPU
Build
Acc. Structure
SS
Collision
S
Integ.
CPU0
CPU1
CPU2
LS
Collision
LS
Collision
LS
Collision Synchronization
MergeMergeMerge
LL
Coll.
L
Integ.
Synchronization
S Sorting
S Sorting
S Sorting
Synchronization
32 Harada, Heterogeneous Particle-based Simulation
DEMO
GPUWork
CPUWork
33 Harada, Heterogeneous Particle-based Simulation
DEMO
GPUWork
CPUWork
34 Harada, Heterogeneous Particle-based Simulation
CONCLUSIONS
 Realized a simulation that handles variable sized particles by leveraging the best features of both the CPU
and GPU on AMD’s Fusion Architecture
– The CPU is used for works with non identical compute granularity
– The GPU is used for highly parallel works
 Memory sharing between the CPU and GPU is the key for the efficiency
– Avoid wasteful memory copies
35 Harada, Heterogeneous Particle-based Simulation
REFERENCE
 Takahiro Harada, Seiichi Koshizuka, Yoichiro Kawaguchi, Smoothed Particle Hydrodynamics on GPUs,
Proc. of Computer Graphics International, 63-70(2007)
 Justin Hensley, Takahiro Harada, Chapter X OpenCL Case Study:Mixed Particle Simulation,
Heterogeneous Computing with OpenCL, Morgan Kaufmann(2011)

More Related Content

What's hot (20)

PDF
OpenGL NVIDIA Command-List: Approaching Zero Driver Overhead
Tristan Lorach
 
PPTX
FlameWorks GTC 2014
Simon Green
 
PPTX
Beyond porting
Cass Everitt
 
PPSX
Advancements in-tiled-rendering
mistercteam
 
PPTX
Approaching zero driver overhead
Cass Everitt
 
PDF
Parallel Implementation of K Means Clustering on CUDA
prithan
 
PPTX
Triangle Visibility buffer
Wolfgang Engel
 
PDF
Let's talk about Garbage Collection
Haim Yadid
 
PPTX
Future Directions for Compute-for-Graphics
Electronic Arts / DICE
 
PPSX
TressFX The Fast and The Furry by Nicolas Thibieroz
AMD Developer Central
 
PDF
PG-4034, Using OpenGL and DirectX for Heterogeneous Compute, by Karl Hillesland
AMD Developer Central
 
PDF
GS-4147, TressFX 2.0, by Bill-Bilodeau
AMD Developer Central
 
PDF
Optimizing the graphics pipeline with compute
WuBinbo
 
PPSX
Oit And Indirect Illumination Using Dx11 Linked Lists
Holger Gruen
 
PDF
CC-4000, Characterizing APU Performance in HadoopCL on Heterogeneous Distribu...
AMD Developer Central
 
PDF
PG-4037, Fast modal analysis with NX Nastran and GPUs, by Leonard Hoffnung
AMD Developer Central
 
PDF
Forward+ (EUROGRAPHICS 2012)
Takahiro Harada
 
PDF
GS-4108, Direct Compute in Gaming, by Bill Bilodeau
AMD Developer Central
 
PDF
WT-4065, Superconductor: GPU Web Programming for Big Data Visualization, by ...
AMD Developer Central
 
PDF
Large scale logistic regression and linear support vector machines using spark
Mila, Université de Montréal
 
OpenGL NVIDIA Command-List: Approaching Zero Driver Overhead
Tristan Lorach
 
FlameWorks GTC 2014
Simon Green
 
Beyond porting
Cass Everitt
 
Advancements in-tiled-rendering
mistercteam
 
Approaching zero driver overhead
Cass Everitt
 
Parallel Implementation of K Means Clustering on CUDA
prithan
 
Triangle Visibility buffer
Wolfgang Engel
 
Let's talk about Garbage Collection
Haim Yadid
 
Future Directions for Compute-for-Graphics
Electronic Arts / DICE
 
TressFX The Fast and The Furry by Nicolas Thibieroz
AMD Developer Central
 
PG-4034, Using OpenGL and DirectX for Heterogeneous Compute, by Karl Hillesland
AMD Developer Central
 
GS-4147, TressFX 2.0, by Bill-Bilodeau
AMD Developer Central
 
Optimizing the graphics pipeline with compute
WuBinbo
 
Oit And Indirect Illumination Using Dx11 Linked Lists
Holger Gruen
 
CC-4000, Characterizing APU Performance in HadoopCL on Heterogeneous Distribu...
AMD Developer Central
 
PG-4037, Fast modal analysis with NX Nastran and GPUs, by Leonard Hoffnung
AMD Developer Central
 
Forward+ (EUROGRAPHICS 2012)
Takahiro Harada
 
GS-4108, Direct Compute in Gaming, by Bill Bilodeau
AMD Developer Central
 
WT-4065, Superconductor: GPU Web Programming for Big Data Visualization, by ...
AMD Developer Central
 
Large scale logistic regression and linear support vector machines using spark
Mila, Université de Montréal
 

Viewers also liked (10)

PDF
A Parallel Constraint Solver for a Rigid Body Simulation (SIGGRAPH ASIA 2011)
Takahiro Harada
 
PDF
Introducing Firerender for 3DS Max
Takahiro Harada
 
PDF
A 2.5D Culling for Forward+ (SIGGRAPH ASIA 2012)
Takahiro Harada
 
PDF
Physics Tutorial, GPU Physics (GDC2010)
Takahiro Harada
 
PDF
[2016 GDC] Multiplatform GPU Ray-Tracing Solutions With FireRender and FireRays
Takahiro Harada
 
PDF
Introduction to Monte Carlo Ray Tracing (CEDEC 2013)
Takahiro Harada
 
PDF
çąșçŽ‡çš„ăƒ©ă‚€ăƒˆă‚«ăƒȘăƒłă‚°ă€€ç†è«–ăšćźŸèŁ… (CEDEC2016)
Takahiro Harada
 
PDF
Introduction to Monte Carlo Ray Tracing, OpenCL Implementation (CEDEC 2014)
Takahiro Harada
 
ODP
è‡Ș由ăȘăƒ‡ăƒŒă‚ż
Takatsugu Nokubi
 
PDF
Introduction to Bidirectional Path Tracing (BDPT) & Implementation using Open...
Takahiro Harada
 
A Parallel Constraint Solver for a Rigid Body Simulation (SIGGRAPH ASIA 2011)
Takahiro Harada
 
Introducing Firerender for 3DS Max
Takahiro Harada
 
A 2.5D Culling for Forward+ (SIGGRAPH ASIA 2012)
Takahiro Harada
 
Physics Tutorial, GPU Physics (GDC2010)
Takahiro Harada
 
[2016 GDC] Multiplatform GPU Ray-Tracing Solutions With FireRender and FireRays
Takahiro Harada
 
Introduction to Monte Carlo Ray Tracing (CEDEC 2013)
Takahiro Harada
 
çąșçŽ‡çš„ăƒ©ă‚€ăƒˆă‚«ăƒȘăƒłă‚°ă€€ç†è«–ăšćźŸèŁ… (CEDEC2016)
Takahiro Harada
 
Introduction to Monte Carlo Ray Tracing, OpenCL Implementation (CEDEC 2014)
Takahiro Harada
 
è‡Ș由ăȘăƒ‡ăƒŒă‚ż
Takatsugu Nokubi
 
Introduction to Bidirectional Path Tracing (BDPT) & Implementation using Open...
Takahiro Harada
 
Ad

Similar to Heterogeneous Particle based Simulation (SIGGRAPH ASIA 2011) (20)

PDF
Gpu Join Presentation
Suman Karumuri
 
PDF
GPU Rigid Body Simulation GDC 2013
ecoumans
 
PPTX
Lec07 threading hw
Taras Zakharchenko
 
PPTX
Optimization of Electrical Machines in the Cloud with SyMSpace by LCM
cloudSME
 
PPTX
PACT_conference_2019_Tutorial_02_gpgpusim.pptx
ssuser30e7d2
 
PPTX
Travelling salesman problem
Dimitris Mavrommatis
 
PDF
A Pioneering Approach to Parallel Array Processing in Quantitative and Mathem...
Yigal D. Jhirad
 
PPTX
Thesis Presentation
Spondon Saha
 
PPT
Threading Successes 04 Hellgate
guest40fc7cd
 
PDF
Esa act mtimpe_talk
Advanced-Concepts-Team
 
PDF
Optimization of parameter settings for GAMG solver in simple solver, OpenFOAM...
Masashi Imano
 
ODP
Benchmarking MongoDB and CouchBase
Christopher Choi
 
PPTX
Lec04 gpu architecture
Taras Zakharchenko
 
PPTX
md-presentHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHation.pptx
fatimalazaar2004
 
PPT
Mysql talk
LogicMonitor
 
PDF
DistibutedDB_Querying on distributed databases
VivekMITAnnaUniversi
 
PDF
A Parallel Algorithm Template for Updating Single-Source Shortest Paths in La...
Subhajit Sahu
 
PPTX
DeepLearningAlgorithmAccelerationOnHardwarePlatforms_V2.0
Sahil Kaw
 
PDF
post119s1-file2
Venkata Suhas Maringanti
 
PDF
A kind and gentle introducton to rac
Riyaj Shamsudeen
 
Gpu Join Presentation
Suman Karumuri
 
GPU Rigid Body Simulation GDC 2013
ecoumans
 
Lec07 threading hw
Taras Zakharchenko
 
Optimization of Electrical Machines in the Cloud with SyMSpace by LCM
cloudSME
 
PACT_conference_2019_Tutorial_02_gpgpusim.pptx
ssuser30e7d2
 
Travelling salesman problem
Dimitris Mavrommatis
 
A Pioneering Approach to Parallel Array Processing in Quantitative and Mathem...
Yigal D. Jhirad
 
Thesis Presentation
Spondon Saha
 
Threading Successes 04 Hellgate
guest40fc7cd
 
Esa act mtimpe_talk
Advanced-Concepts-Team
 
Optimization of parameter settings for GAMG solver in simple solver, OpenFOAM...
Masashi Imano
 
Benchmarking MongoDB and CouchBase
Christopher Choi
 
Lec04 gpu architecture
Taras Zakharchenko
 
md-presentHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHation.pptx
fatimalazaar2004
 
Mysql talk
LogicMonitor
 
DistibutedDB_Querying on distributed databases
VivekMITAnnaUniversi
 
A Parallel Algorithm Template for Updating Single-Source Shortest Paths in La...
Subhajit Sahu
 
DeepLearningAlgorithmAccelerationOnHardwarePlatforms_V2.0
Sahil Kaw
 
post119s1-file2
Venkata Suhas Maringanti
 
A kind and gentle introducton to rac
Riyaj Shamsudeen
 
Ad

Recently uploaded (20)

PPT
Interview paper part 3, It is based on Interview Prep
SoumyadeepGhosh39
 
PDF
Fl Studio 24.2.2 Build 4597 Crack for Windows Free Download 2025
faizk77g
 
PDF
DevBcn - Building 10x Organizations Using Modern Productivity Metrics
Justin Reock
 
PDF
Exolore The Essential AI Tools in 2025.pdf
Srinivasan M
 
PDF
Python basic programing language for automation
DanialHabibi2
 
PDF
Smart Trailers 2025 Update with History and Overview
Paul Menig
 
PDF
"Beyond English: Navigating the Challenges of Building a Ukrainian-language R...
Fwdays
 
PPTX
UiPath Academic Alliance Educator Panels: Session 2 - Business Analyst Content
DianaGray10
 
PDF
Newgen 2022-Forrester Newgen TEI_13 05 2022-The-Total-Economic-Impact-Newgen-...
darshakparmar
 
PDF
CIFDAQ Token Spotlight for 9th July 2025
CIFDAQ
 
PDF
Using FME to Develop Self-Service CAD Applications for a Major UK Police Force
Safe Software
 
PDF
Empower Inclusion Through Accessible Java Applications
Ana-Maria Mihalceanu
 
PDF
July Patch Tuesday
Ivanti
 
PPTX
MSP360 Backup Scheduling and Retention Best Practices.pptx
MSP360
 
PPTX
"Autonomy of LLM Agents: Current State and Future Prospects", Oles` Petriv
Fwdays
 
PDF
SWEBOK Guide and Software Services Engineering Education
Hironori Washizaki
 
PPTX
Q2 FY26 Tableau User Group Leader Quarterly Call
lward7
 
PDF
Reverse Engineering of Security Products: Developing an Advanced Microsoft De...
nwbxhhcyjv
 
PDF
Jak MƚP w Europie ƚrodkowo-Wschodniej odnajdują się w ƛwiecie AI
dominikamizerska1
 
PPTX
WooCommerce Workshop: Bring Your Laptop
Laura Hartwig
 
Interview paper part 3, It is based on Interview Prep
SoumyadeepGhosh39
 
Fl Studio 24.2.2 Build 4597 Crack for Windows Free Download 2025
faizk77g
 
DevBcn - Building 10x Organizations Using Modern Productivity Metrics
Justin Reock
 
Exolore The Essential AI Tools in 2025.pdf
Srinivasan M
 
Python basic programing language for automation
DanialHabibi2
 
Smart Trailers 2025 Update with History and Overview
Paul Menig
 
"Beyond English: Navigating the Challenges of Building a Ukrainian-language R...
Fwdays
 
UiPath Academic Alliance Educator Panels: Session 2 - Business Analyst Content
DianaGray10
 
Newgen 2022-Forrester Newgen TEI_13 05 2022-The-Total-Economic-Impact-Newgen-...
darshakparmar
 
CIFDAQ Token Spotlight for 9th July 2025
CIFDAQ
 
Using FME to Develop Self-Service CAD Applications for a Major UK Police Force
Safe Software
 
Empower Inclusion Through Accessible Java Applications
Ana-Maria Mihalceanu
 
July Patch Tuesday
Ivanti
 
MSP360 Backup Scheduling and Retention Best Practices.pptx
MSP360
 
"Autonomy of LLM Agents: Current State and Future Prospects", Oles` Petriv
Fwdays
 
SWEBOK Guide and Software Services Engineering Education
Hironori Washizaki
 
Q2 FY26 Tableau User Group Leader Quarterly Call
lward7
 
Reverse Engineering of Security Products: Developing an Advanced Microsoft De...
nwbxhhcyjv
 
Jak MƚP w Europie ƚrodkowo-Wschodniej odnajdują się w ƛwiecie AI
dominikamizerska1
 
WooCommerce Workshop: Bring Your Laptop
Laura Hartwig
 

Heterogeneous Particle based Simulation (SIGGRAPH ASIA 2011)

  • 2. 2 Harada, Heterogeneous Particle-based Simulation  Large number of particles  Particles with identical size – Work granularity is almost the same – Good for the wide SIMD architecture PARTICLE BASED SIMULATION ON THE GPU Harada et al. 2007
  • 3. 3 Harada, Heterogeneous Particle-based Simulation PARTICLE BASED SIMULATION  Collision  Integration  Acceleration structure is used for efficient collide – Uniform grid → Suited for the GPU – Less divergence 𝑓𝑐𝑜𝑙𝑙𝑖𝑑𝑒 = 𝑓𝑖𝑗 𝑣 += 𝑓 𝑚 ∆𝑡 đ‘„ += 𝑣∆𝑡 𝑑𝑣 𝑑𝑡 = 𝑓 𝑚 đ‘‘đ‘„ 𝑑𝑡 = 𝑣
  • 4. 4 Harada, Heterogeneous Particle-based Simulation DIVERGENCE ON SIMD 0 1 2 3 4 5 6 7 Void Kernel() { if(A) FuncA(); else if(B) FuncB(); else FuncC(); }
  • 5. 5 Harada, Heterogeneous Particle-based Simulation PARTICLE BASED SIMULATION ON THE GPU  Particle collision using a uniform grid 0 1 2 3 4 5 6 7 Void Kernel() { prepare(); collide(Cell0); collide(Cell1); collide(Cell2); collide(Cell3); collide(Cell4); collide(Cell5); collide(Cell6); collide(Cell7); collide(Cell8); } Cell0 Cell1 Cell2 Cell3 Cell4 Cell5 Cell6 Cell7 Cell8
  • 6. 6 Harada, Heterogeneous Particle-based Simulation MIXED PARTICLE SIMULATION  Not only small particles  Difficulty for GPUs – Large particles interact with small particles – Large-large collision
  • 7. 7 Harada, Heterogeneous Particle-based Simulation CHALLENGE  Non uniform work granularity – Small-small(SS) collision  Uniform, GPU – Large-large(LL) collision  Non Uniform, CPU – Large-small(LS) collision  Non Uniform, CPU
  • 8. 8 Harada, Heterogeneous Particle-based Simulation FUSION ARCHITECTURE  CPU and GPU are: – On the same die – Much closer – Efficient data sharing  CPU and GPU are good at different works – CPU: serial computation, conditional branch – GPU: parallel computation  Able to dispatch works to: – Serial work with varying granularity → CPU – Parallel work with the uniform granularity → GPU
  • 9. 9 Harada, Heterogeneous Particle-based Simulation MIXED PARTICLE SIMULATION  Benefit from Fusion Architecture – Different works in a simulation – CPU & GPU are working together – Shares data
  • 10. 10 Harada, Heterogeneous Particle-based Simulation METHOD
  • 11. 11 Harada, Heterogeneous Particle-based Simulation TWO SIMULATIONS  Small particles  Large particles Build Acc. Structure SS Collision S Integration Build Acc. Structure LL Collision L Integration LS Collision Position Velocity Force Grid Position Velocity Force
  • 12. 12 Harada, Heterogeneous Particle-based Simulation  Small particles  Large particles Uniform Work Non Uniform Work CLASSIFY BY WORK GRANULARITY Build Acc. Structure SS Collision S Integration L Integration Position Velocity Force Grid Position Velocity Force LL Collision LS Collision Build Acc. Structure
  • 13. 13 Harada, Heterogeneous Particle-based Simulation  Small particles  Large particles GPU CPU CLASSIFY BY WORK GRANULARITY, ASSIGN PROCESSOR Build Acc. Structure SS Collision S Integration L Integration Position Velocity Force Grid Position Velocity Force LL Collision LS Collision Build Acc. Structure
  • 14. 14 Harada, Heterogeneous Particle-based Simulation  Small particles  Large particles  Grid, small particle data has to be shared with the CPU for LS collision – Allocated as zero copy buffer GPU CPU DATA SHARING Build Acc. Structure SS Collision S Integration L Integration Position Velocity Force Grid Position Velocity Force LL Collision Build Acc. Structure Position Velocity Grid Force LS Collision
  • 15. 15 Harada, Heterogeneous Particle-based Simulation  Small particles  Large particles  Grid, small particle data has to be shared with the CPU for LS collision – Allocated as zero copy buffer GPU CPU SYNCHRONIZATION Position Velocity Force Grid Position Velocity Force SS Collision S Integration L Integration LL Collision Position Velocity Grid Force Synchronization LS Collision Build Acc. Structure Build Acc. Structure Synchronization
  • 16. 16 Harada, Heterogeneous Particle-based Simulation GPU CPU VISUALIZING WORKLOADS Build Acc. Structure SS Collision S Integration Position Velocity Force Grid Position Velocity Force LL Collision LS Collision Synchronization L Integration  Small particles  Large particles  Grid construction can be moved at the end of the pipeline – Unbalanced workload
  • 17. 17 Harada, Heterogeneous Particle-based Simulation  Small particles  Large particles  To get better load balancing – The sync is for passing the force buffer filled by the CPU to the GPU – Move the LL collision after the sync GPU CPU LOAD BALANCING Build Acc. Structure SS Collision S Integration Position Velocity Force Grid Position Velocity Force LL Collision Synchronization L Integration LS Collision
  • 18. 18 Harada, Heterogeneous Particle-based Simulation GPUWork CPUWork
  • 19. 19 Harada, Heterogeneous Particle-based Simulation MULTI THREADING (4 THREADS)
  • 20. 20 Harada, Heterogeneous Particle-based Simulation FURTHER OPTIMIZATION GPU CPU0 CPU1 CPU2 Build Acc. Structure SS Collision S Integ. LL Collision L Integ. LS Collision Synchronization 1. Not optimized for “Llano” which is a 4 core CPU – Only 2 CPU core were used – Can use 2 more cores for LS collision 2. LL collision was not optimized – CPU waits when the GPU was constructing a grid – Use CPU to improve SS collision
  • 21. 21 Harada, Heterogeneous Particle-based Simulation OPTIMIZATION1: MULTITHREADING LARGE-SMALL COLLISION  Cannot split the work by large particle indices – More than 1 large particle can collide with a small particle – Have to lock the memory on write → Inefficient  Prepare a local buffer for a thread – A buffer storing force on small particles – Lock free  Local buffers are merged to one L0 S0 S1 L1 Thread0 Thread1 Thread2
  • 22. 22 Harada, Heterogeneous Particle-based Simulation OPTIMIZATION1: MULTITHREADING LARGE-SMALL COLLISION GPU Build Acc. Structure SS Collision S Integ. CPU0 LL Collision L Integ. CPU1 CPU2 LS Collision Synchronization
  • 23. 23 Harada, Heterogeneous Particle-based Simulation OPTIMIZATION1: MULTITHREADING LARGE-SMALL COLLISION GPU Build Acc. Structure SS Collision S Integ. CPU0 LL Collision L Integ. CPU1 CPU2 LS Collision LS Collision LS Collision Synchronization MergeMergeMerge Synchronization
  • 24. 24 Harada, Heterogeneous Particle-based Simulation  Spatially coherent memory layout improves cache utilization  As particles move, spatial locality decreases OPTIMIZATION2: IMPROVING SMALL-SMALL COLLISION
  • 25. 25 Harada, Heterogeneous Particle-based Simulation  Spatially coherent memory layout improves cache utilization  As particles move, spatial locality decreases OPTIMIZATION2: IMPROVING SMALL-SMALL COLLISION
  • 26. 26 Harada, Heterogeneous Particle-based Simulation  Sort particles by spatial location to improve cache utilization – Z curve SPATIAL SORT
  • 27. 27 Harada, Heterogeneous Particle-based Simulation  Sort particles by spatial location to improve cache utilization – Z curve SPATIAL SORT
  • 28. 28 Harada, Heterogeneous Particle-based Simulation  Requirements – Full sort was over the budget – Full sort is not “a must” – Sort is an optional computation for performance improvement – Incremental sort – Use multiple threads  Solution – Used generalized “Odd-even transition sort” CHOOSE SORT
  • 29. 29 Harada, Heterogeneous Particle-based Simulation BLOCK TRANSITION SORT  Generalized “Odd-even transition sort”  Instead of sorting 2 adjacent elements, sort adjacent 2 blocks  Iterate until convergence  Use a thread to sort 2 adjacent blocks – 6 blocks for 3 threads – Radix sort Odd-even transition sort Block transition sort
  • 30. 30 Harada, Heterogeneous Particle-based Simulation OPTIMIZATION2: IMPROVING SMALL-SMALL COLLISION GPU Build Acc. Structure SS Collision S Integ. CPU0 LL Collision L Integ. CPU1 CPU2 LS Collision LS Collision LS Collision Synchronization MergeMergeMerge Synchronization
  • 31. 31 Harada, Heterogeneous Particle-based Simulation OPTIMIZATION2: IMPROVING SMALL-SMALL COLLISION GPU Build Acc. Structure SS Collision S Integ. CPU0 CPU1 CPU2 LS Collision LS Collision LS Collision Synchronization MergeMergeMerge LL Coll. L Integ. Synchronization S Sorting S Sorting S Sorting Synchronization
  • 32. 32 Harada, Heterogeneous Particle-based Simulation DEMO GPUWork CPUWork
  • 33. 33 Harada, Heterogeneous Particle-based Simulation DEMO GPUWork CPUWork
  • 34. 34 Harada, Heterogeneous Particle-based Simulation CONCLUSIONS  Realized a simulation that handles variable sized particles by leveraging the best features of both the CPU and GPU on AMD’s Fusion Architecture – The CPU is used for works with non identical compute granularity – The GPU is used for highly parallel works  Memory sharing between the CPU and GPU is the key for the efficiency – Avoid wasteful memory copies
  • 35. 35 Harada, Heterogeneous Particle-based Simulation REFERENCE  Takahiro Harada, Seiichi Koshizuka, Yoichiro Kawaguchi, Smoothed Particle Hydrodynamics on GPUs, Proc. of Computer Graphics International, 63-70(2007)  Justin Hensley, Takahiro Harada, Chapter X OpenCL Case Study:Mixed Particle Simulation, Heterogeneous Computing with OpenCL, Morgan Kaufmann(2011)