SlideShare a Scribd company logo
International Journal of Distributed and Parallel Systems (IJDPS) Vol.6, No.5, September 2015
DOI:10.5121/ijdps.2015.6501 1
A PROGRESSIVE MESH METHOD FOR PHYSICAL
SIMULATIONS USING LATTICE BOLTZMANN
METHOD ON SINGLE-NODE MULTI-GPU
ARCHITECTURES
Julien Duchateau1
, François Rousselle1
, Nicolas Maquignon1
, Gilles Roussel1
,
Christophe Renaud1
1
Laboratoire d’Informatique, Signal, Image de la Côte d’Opale
Université du Littoral Côte d’Opale, Calais, France
ABSTRACT
In this paper, a new progressive mesh algorithm is introduced in order to perform fast physical simulations
by the use of a lattice Boltzmann method (LBM) on a single-node multi-GPU architecture. This algorithm is
able to mesh automatically the simulation domain according to the propagation of fluids. This method can
also be useful in order to perform several types of physical simulations. In this paper, we associate this
algorithm with a multiphase and multicomponent lattice Boltzmann model (MPMC–LBM) because it is
able to perform various types of simulations on complex geometries. The use of this algorithm combined
with the massive parallelism of GPUs[5] allows to obtain very good performance in comparison with the
staticmesh method used in literature. Several simulations are shown in order to evaluate the algorithm.
KEYWORDS
Progressive mesh, Lattice Boltzmann method,single-node multi-GPU, parallel computing.
1. INTRODUCTION
The lattice Boltzmann method (LBM) is a computational fluid dynamics (CFD) method. It is a
relatively recent technique which is able to approximate Navier-Stokes equations by a collision-
propagation scheme [1]. Lattice Boltzmann method however differs from standard approaches as
finite element method (FEM) or finite volume method (FVM) by its mesoscopic approach. It is an
interesting alternative which is able to simulate complex phenomena on complex geometries. Its
high parallelization makes also this method attractive in order to perform simulations on parallel
hardware. Moreover, the emergence of high-performance computing (HPC) architectures using
GPUs [5] is also a great interest for many researchers.
Parallelization is indeed an important asset of lattice Boltzmann method. However, perform
simulations on large complex geometries can be very costly in computational resources. This
paper introduces a new progressive mesh algorithm in order to perform physical simulations on
complex geometries by the use of a multiphase and multicomponent lattice Boltzmann method.
The algorithm is able to automatically mesh the simulation domain according to the propagation
of fluids. Moreover, the integration of this algorithm on single-node multi-GPU architecture is
also an important matter which is studied in this paper. This method is an interesting alternative
which has never been exploited at the best of our knowledge.
International Journal of Distributed and Parallel Systems (IJDPS) Vol.6, No.5, September 2015
2
Section 2 first describes the multiphase and multicomponent lattice Boltzmann method. It is able
to simulate the behavior of fluids with several physical states (phase) and it is also able to model
several fluids (component) interacting with each other. Section 3 presents then several recent
works involving lattice Boltzmann method on GPUs. Section 4 mostly concerns the main
contribution of this paper: the inclusion of a progressive mesh method in the simulation code. The
principles of the method and the definition of an adapted criterion are firstly introduced. The
integration on a single-node multi-GPU architecture is then described. An analysis concerning
performance is also studied in section 5. The conclusion and future works are finally presented in
the last section.
2. THE LATTICE BOLTZMANN METHOD
2.1. The Single relaxation time Bhatnagar-Gross-Krook (SRT-BGK) Boltzmann
equation
The lattice Boltzmann method is based on three main discretizations: space, time and velocities.
Velocity space is reduced to a finite number of well-defined vectors. Figures 1(a) and 1(b)
illustrate this discrete scheme for D2Q9 and D3Q19 model.
The simulation grid is therefore discretized as a Cartesian grid and calculation steps are achieved
on this entire grid. The discrete Boltzmann equation[1] with a single relaxation timeBhatnagar-
Gross-Krook (SRT-BGK) collision term is defined by the following equation:
݂௜ሺ‫ݔ‬ ൅ ݁௜, ‫ݐ‬ ൅ Δ௧ሻ െ ݂௜ሺ‫,ݔ‬ ‫ݐ‬ሻ ൌ	
1
߬
ቀ݂௜ሺ‫,ݔ‬ ‫ݐ‬ሻ െ ݂௜
௘௤
ሺ‫,ݔ‬ ‫ݐ‬ሻቁ (1)
݂௜
௘௤
ሺ‫,ݔ‬ ‫ݐ‬ሻ ൌ	߱௜ߩሺ‫,ݔ‬ ‫ݐ‬ሻ ቆ1 ൅	
݁௜‫	ݑ‬
ܿ௦
ଶ
൅	
ሺ݁௜‫ݑ‬ሻଶ
2ܿ௦
ସ
െ	
‫ݑ‬ଶ
2ܿ௦
ଶ
ቇ (2)
ܿ௦
ଶ
ൌ 	
1
3
൬
Δ௫
Δ௧
൰
ଶ
(3)
The function ݂௜ሺ‫,ݔ‬ ‫ݐ‬ሻ corresponds to the discrete density distribution function along velocity
vector ݁௜ at a position‫ݔ‬ and a time	‫.ݐ‬ The parameter ߬ corresponds to the relaxation time of the
simulation. The value ߩ is the fluid density and ‫ݑ‬ corresponds to the fluid velocity. Δ௫andΔ௧ are
the spatial and temporal steps of the simulation respectively. Parameters ‫ݓ‬௜ are weighting values
defined according to the lattice Boltzmann scheme and can be found in [1].Macroscopic
quantities as density ߩ and velocity ‫ݑ‬ are finally computed as follows:
(a) D2Q9 scheme (b) D3Q19 scheme
Figure 1: Example of Lattice Boltzmann schemes
International Journal of Distributed and Parallel Systems (IJDPS) Vol.6, No.5, September 2015
3
ߩሺ‫,ݔ‬ ‫ݐ‬ሻ ൌ 	 ෍ ݂௜ሺ‫,ݔ‬ ‫ݐ‬ሻ
௜
(4)
ߩሺ‫,ݔ‬ ‫ݐ‬ሻ‫ݑ‬ሺ‫,ݔ‬ ‫ݐ‬ሻ ൌ	 ෍ ݂௜ሺ‫,ݔ‬ ‫ݐ‬ሻ݁௜
௜
(5)
2.2. Multiphase and Multi Component Lattice Boltzmann Model
Multiphase and multicomponent models (MPMC) allow performing complex simulations
involving several physical components. In this section, a MPMC-LBM model based on the work
achieved by Bao& Schaeffer [4] is presented.It includes several interaction forces based on
pseudo-potential. It is calculated as follows:
߰ఈ ൌ	ඨ
2ሺ‫݌‬ఈ െ ܿ௦
ଶߩఈሻ
ܿ௦
ଶ݃ఈఈ
(6)
The term ‫݌‬ఈ is the pressure term. It is calculated by the use of an equation of state as the Peng-
Robinson equation:
‫݌‬ఈ ൌ	
ߩఈܴఈܶఈ
1 െ ܾఈߩఈ
െ	
ܽఈߠሺܶఈሻߩఈ
ଶ
1 ൅ 2ܾఈ െ ܾଶߩଶ
(7)
Internal forces are then computed. The internal fluid interaction force is expressed as follows [2]
[3]:
‫ܨ‬ఈఈሺ‫ݔ‬ሻ ൌ 	െߚ
݃ఈ
2
ܿ௦
ଶ
߰ఈሺ‫ݔ‬ሻ ෍ ‫ݓ‬௜
௫ᇲ
߰ఈሺ‫ݔ‬ᇱሻሺ‫ݔ‬ᇱ
െ ‫ݔ‬ሻ െ	
1 െ ߚ
2
݃ఈ
2
ܿ௦
ଶ
߰ఈሺ‫ݔ‬ሻ‫ݓ‬௜߰ఈ
ଶሺ‫ݔ‬ᇱሻሺ‫ݔ‬ᇱ
െ ‫ݔ‬ሻ (8)
The valueߚ is a weighting term generally fixed to 1.16 according to [2] [3]. The inter-component
force is also introduced as follows [4]:
‫ܨ‬ఈఈᇲሺ‫ݔ‬ሻ ൌ 	െ	
݃ఈఈᇱ
2
ܿ௦
ଶ
߰ఈሺ‫ݔ‬ሻ ෍ ‫ݓ‬௜
௫ᇲ
߰ఈሺ‫ݔ‬ᇱሻሺ‫ݔ‬ᇱ
െ ‫ݔ‬ሻ (9)
Additional forces can be added into the simulation code as the gravity force, or a fluid-structure
interaction [3]. The incorporation of the force term is then achieved by a modifiedcollision
operator expressed as follows:
݂ఈ,௜ሺ‫ݔ‬ ൅ ݁௜, ‫ݐ‬ ൅ Δ௧ሻ െ ݂ఈ,௜ሺ‫,ݔ‬ ‫ݐ‬ሻ ൌ	
1
߬
ቀ݂ఈ,௜ሺ‫,ݔ‬ ‫ݐ‬ሻ െ ݂ఈ,௜
௘௤
ሺ‫,ݔ‬ ‫ݐ‬ሻቁ ൅ 	Δ݂ఈ,௜ (10)
Δ݂ఈ,௜ ൌ 	݂ఈ,௜
௘௤
ሺߩఈ, ‫ݑ‬ఈ ൅ Δ‫ݑ‬ఈሻ െ	݂ఈ,௜
௘௤
ሺߩఈ, ‫ݑ‬ఈሻ (11)
Δ‫ݑ‬ఈ ൌ
‫ܨ‬ఈΔ௧
ߩఈ
(12)
Macroscopic quantities for each component are finally computed by the use of equations (4) and
(5).
3. LATTICE BOLTZMANN METHODS AND GPUS
The mass parallelism of GPUs has been quickly exploited in order to perform fast simulations[7]
[8] using lattice Boltzmann method. Recent works have shown that GPUs are also used with
multiphase and multicomponent models [16] [14]. The main aspects of GPU optimizations are
International Journal of Distributed and Parallel Systems (IJDPS) Vol.6, No.5, September 2015
decomposed into several categories
overlap of memory transfers with computations ….
optimize global memory bandwidt
Concerning LBM, an adapted data structure such as the Structure of Array (SoA
studied and has proven to be efficient on GPU
Several access patterns are also
pattern, consists of using two calculation grids in GPU global memory in order to manage the
temporal and spatial dependency of the data (Equation
reading distribution functions from A and writing them to B, and reading from B
reciprocally. This pattern is commonly used and offers very goo
single GPU. Several techniques are however presented in literature in order to reduce
significantly the computational memory cost without loss of information such as grids
compression [6], Swap algorithm
technique is used in order to save memory due to spatial and temporal data dependency.
Recent works involving implementation of l
of several GPUs are also available. A first solution, proposed in
entire simulation domain into sub
LBM kernels on each sub-domain in parallel. CPU threads are used to handle each CUDA
context. Communications between sub
Zero-copy feature allows to perform efficient communications by a mapping between CPU and
GPU pointers. Data must however be read and written only once in order to obtain good
performance.
Some approaches have finally been proposed
constituted of multiple GPUs by the use of MPI in combination with CUDA
our case, we only dispose of one computing node with multiple GPUs thus we don't
these architectures in this paper.
4. A PROGRESSIVE MESH ALG
ON SINGLE-NODE MULTI-GPU
4.1. Motivation
Works described in the previous section consider that the entire simulation domain is
divided into subdomains according to the number of
subdomains are therefore calculated in parallel.
Figure 2: Division of the simulation domain: the entire domain is decomposed into subdomains according
International Journal of Distributed and Parallel Systems (IJDPS) Vol.6, No.5, September 2015
decomposed into several categories [10] [9] as thread level parallelism, GPU memory acce
overlap of memory transfers with computations …. Data coalescence is needed in order to
optimize global memory bandwidth. This implies several conditions as described in [9
Concerning LBM, an adapted data structure such as the Structure of Array (SoA) has been well
studied and has proven to be efficient on GPU [7].
also described in the literature. The first one, named A
pattern, consists of using two calculation grids in GPU global memory in order to manage the
temporal and spatial dependency of the data (Equation (10)). Simulation steps alternate between
reading distribution functions from A and writing them to B, and reading from B and writing to A
This pattern is commonly used and offers very good performance [10] [11] [9
single GPU. Several techniques are however presented in literature in order to reduce
significantly the computational memory cost without loss of information such as grids
Swap algorithm [6] or A-A pattern technique [12]. In this paper, the A
technique is used in order to save memory due to spatial and temporal data dependency.
ks involving implementation of lattice Boltzmann method on a single-node composed
ilable. A first solution, proposed in [13] [17], consists in dividing the
simulation domain into subdomains according to the number of GPUs and performing
domain in parallel. CPU threads are used to handle each CUDA
context. Communications between sub-domains are performed using zero-copy memory transfers.
to perform efficient communications by a mapping between CPU and
GPU pointers. Data must however be read and written only once in order to obtain good
Some approaches have finally been proposed recently to perform simulations on several no
constituted of multiple GPUs by the use of MPI in combination with CUDA [19][18
our case, we only dispose of one computing node with multiple GPUs thus we don't
PROGRESSIVE MESH ALGORITHM FOR LATTICE BOLTZMANN METHOD
GPU ARCHITECTURES
in the previous section consider that the entire simulation domain is
divided into subdomains according to the number of GPUs, as shown on Figure 2. All
therefore calculated in parallel.
Figure 2: Division of the simulation domain: the entire domain is decomposed into subdomains according
to the number of GPUs.
International Journal of Distributed and Parallel Systems (IJDPS) Vol.6, No.5, September 2015
4
as thread level parallelism, GPU memory access,
Data coalescence is needed in order to
l conditions as described in [9].
) has been well
described in the literature. The first one, named A-B access
pattern, consists of using two calculation grids in GPU global memory in order to manage the
imulation steps alternate between
and writing to A
[10] [11] [9] on a
single GPU. Several techniques are however presented in literature in order to reduce
significantly the computational memory cost without loss of information such as grids
In this paper, the A-A pattern
technique is used in order to save memory due to spatial and temporal data dependency.
node composed
, consists in dividing the
domains according to the number of GPUs and performing
domain in parallel. CPU threads are used to handle each CUDA
copy memory transfers.
to perform efficient communications by a mapping between CPU and
GPU pointers. Data must however be read and written only once in order to obtain good
to perform simulations on several nodes
[18][21] [15]. In
our case, we only dispose of one computing node with multiple GPUs thus we don't focus on
OLTZMANN METHODS
in the previous section consider that the entire simulation domain is meshed and
Figure 2. All
Figure 2: Division of the simulation domain: the entire domain is decomposed into subdomains according
International Journal of Distributed and Parallel Systems (IJDPS) Vol.6, No.5, September 2015
In this paper, a new approach is considered.
does not requires to be fully meshed at the beginning
new progressive mesh method
propagation of the simulated flui
beginning of the simulation (Figure 3(a))
propagation of the fluid as can be seen of Figure 3
the simulation geometry (Figure 3(c))
simulations. It is also a real advantage for an application on industrial structures mostly composed
of pipes or channels. It can indeed save a lot of memory and calculations
geometry used for the simulation.
Figure 3: Example of a 3D simu
created at the beginning of the simulation, (b) several subdomains are created following the propagation of
fluid, (c) all subdomains are created and completely adapt to the simulation g
The progressive mesh algorithm firstly needs the introduction of a
create a new subdomain to the simulation. This
existing subdomains. Calculations on
optimization factor.
4.2. Definition of a Criterion for the Progressive Mesh
The definition of a criterion is an
for the simulation. This criterion needs to represent eff
velocity seems like a good choice in order to define an efficient criterion
fluid velocity between two iterations
dispersion. Our criterion is therefore defined as follows
‖‫ܥ‬ఈሺ‫ݔ‬ሻ‖
The symbol ‖. ‖ଶstands for the Euclidean norm in this paper.
for all active subdomains on the
boundary, a new subdomain is created next to this boundary as shown on Figure 4.
generally fixed to 0 in this paper in order to
each subdomain.
(a) (b) (c)
International Journal of Distributed and Parallel Systems (IJDPS) Vol.6, No.5, September 2015
approach is considered. For most simulations, the entire domain generally
fully meshed at the beginning of the simulation. We propose therefore a
method in order to dynamically create the mesh according to the
fluid. The idea consists in defining a first subdomain at the
(Figure 3(a)). Several subdomains can then be created following the
propagation of the fluid as can be seen of Figure 3(b). This method finally adapts automatically
(Figure 3(c)). This method is therefore applicable for any geometry and
. It is also a real advantage for an application on industrial structures mostly composed
It can indeed save a lot of memory and calculations according to the
geometry used for the simulation.
Figure 3: Example of a 3D simulation using the progressive mesh algorithm: (a) a first subdomain is
created at the beginning of the simulation, (b) several subdomains are created following the propagation of
fluid, (c) all subdomains are created and completely adapt to the simulation geometry.
The progressive mesh algorithm firstly needs the introduction of an adapted criterion in order to
w subdomain to the simulation. This new subdomain needs then to be connected to
Calculations on single-node multi-GPU architecture are finally
of a Criterion for the Progressive Mesh
n important aspect in order to efficiently create new subdomains
s criterion needs to represent efficiently the propagation of fluid. The fluid
ike a good choice in order to define an efficient criterion. The difference of the
fluid velocity between two iterations is considered in order to observe efficiently
Our criterion is therefore defined as follows for thecomponent	ߙ:
ሺ ሻ‖ଶ ൌ	‖‫ݑ‬ఈሺ‫,ݔ‬ ‫ݐ‬ ൅ Δ௧ሻ െ ‫ݑ‬ఈሺ‫,ݔ‬ ‫ݐ‬ሻ‖ଶ
stands for the Euclidean norm in this paper. This criterion needs to be calculated
subdomains on the boundaries. If the criterion exceeds anarbitrary threshold
boundary, a new subdomain is created next to this boundary as shown on Figure 4.The value
in this paper in order to detect any change of velocity on the boundaries of
(a) (b) (c)
International Journal of Distributed and Parallel Systems (IJDPS) Vol.6, No.5, September 2015
5
simulations, the entire domain generally
We propose therefore a
create the mesh according to the
domain at the
Several subdomains can then be created following the
automatically to
This method is therefore applicable for any geometry and
. It is also a real advantage for an application on industrial structures mostly composed
according to the
lation using the progressive mesh algorithm: (a) a first subdomain is
created at the beginning of the simulation, (b) several subdomains are created following the propagation of
eometry.
criterion in order to
s then to be connected to
GPU architecture are finally an important
important aspect in order to efficiently create new subdomains
fluid. The fluid
he difference of the
efficiently the fluid
(13)
This criterion needs to be calculated
thresholdܵon a
The valueܵ is
n the boundaries of
International Journal of Distributed and Parallel Systems (IJDPS) Vol.6, No.5, September 2015
Figure 4: The criterion ǁC_α (x) ǁ_2 is calculated on the boundary. If the criterion exceeds the threshold S
then a new subdomain is created next to the boundary.
4.3. Algorithm
This section describes the algorithm for the
model with the inclusion of our progressive mesh algorithm.
summarize the previous sections. The calculation of the criterion and the creation of new
subdomains are achieved at the last step of the algorithm in order to not disturb the simulation
process. Figure 5 describes our resulti
Figure 5: Algorithm for the multiphase and multicomponent Lattice Boltzmann model with the inclusion of
our progressive mesh method. For colors, please refer to the PDF version of this paper.
International Journal of Distributed and Parallel Systems (IJDPS) Vol.6, No.5, September 2015
ǁ_2 is calculated on the boundary. If the criterion exceeds the threshold S
then a new subdomain is created next to the boundary.
the algorithm for the multiphase and multicomponent lattice Boltzmann
model with the inclusion of our progressive mesh algorithm. It is also useful in order to
summarize the previous sections. The calculation of the criterion and the creation of new
subdomains are achieved at the last step of the algorithm in order to not disturb the simulation
describes our resulting algorithm.
Figure 5: Algorithm for the multiphase and multicomponent Lattice Boltzmann model with the inclusion of
our progressive mesh method. For colors, please refer to the PDF version of this paper.
International Journal of Distributed and Parallel Systems (IJDPS) Vol.6, No.5, September 2015
6
_2 is calculated on the boundary. If the criterion exceeds the threshold S
attice Boltzmann
It is also useful in order to
summarize the previous sections. The calculation of the criterion and the creation of new
subdomains are achieved at the last step of the algorithm in order to not disturb the simulation
Figure 5: Algorithm for the multiphase and multicomponent Lattice Boltzmann model with the inclusion of
our progressive mesh method. For colors, please refer to the PDF version of this paper.
International Journal of Distributed and Parallel Systems (IJDPS) Vol.6, No.5, September 2015
4.4. Integration on Single-Node Multi
Efficiency of inter-GPU communications is surely the most difficult task in order to obtain good
performance. Indeed, our simulations are composed of numerous subdomains which are a
dynamically. The repartition of GPUs to the different subdomains is an important factor of
optimization. An efficient assignment can have an important impact on the performance of the
simulation. Indeed, it can reduce the communication time between subdomains and so reduce the
simulation time.
4.4.1. Overlap Communications with Computations
Several data exchanges are needed for this type of model. The computation of interaction
inter-component ‫ܨ‬௘௫௧ implies to have access to neighboring values of the pseudo
propagation step of LBM also implies to communicate several distribution functions
GPUs (Figure 6). Aligned buffers ma
In order to obtain a simulation time as short as possible, it is necessary to overlap data transfer
with algorithm calculations. Indeed, overlapping computations and communications allows
obtain a significant performance gain by reducing the waiting
the computation process into 2 steps
Computations on the needed boundaries are firstly done. Communi
subdomains are also done while computing
performed simultaneously with calculations which allow
In most cases for lattice Boltzmann method, memory is
page-locked memory which allow go
[17][13] [15].A different approach
In most recent HPC architectures, several GPUs can be connected to the same PCIe. To improve
performance, Nvidia launched GPUDirect with CUDA 4.0.
Figure 6: Schematic example for communication of distribution functions in 2D: red arrows
corresponds to ݂௜ values to communicate between subdomains. For colors, please refer to the PDF
version of this paper.
International Journal of Distributed and Parallel Systems (IJDPS) Vol.6, No.5, September 2015
Node Multi-GPU Architecture
GPU communications is surely the most difficult task in order to obtain good
Indeed, our simulations are composed of numerous subdomains which are a
of GPUs to the different subdomains is an important factor of
optimization. An efficient assignment can have an important impact on the performance of the
simulation. Indeed, it can reduce the communication time between subdomains and so reduce the
Communications with Computations
needed for this type of model. The computation of interaction
implies to have access to neighboring values of the pseudo-potential. The
propagation step of LBM also implies to communicate several distribution functions
ligned buffers may be used for data transactions.
In order to obtain a simulation time as short as possible, it is necessary to overlap data transfer
with algorithm calculations. Indeed, overlapping computations and communications allows
significant performance gain by reducing the waiting time of data. The idea is to separate
omputation process into 2 steps: boundary calculations and interior
Computations on the needed boundaries are firstly done. Communications between neighboring
domains are also done while computing the interior. The different communications are thus
calculations which allow good efficiency.
attice Boltzmann method, memory is transferred via zero-copy transactions to
locked memory which allow good overlapping between communications and computations
different approach is studied in this paper concerning inter-GPU communications.
In most recent HPC architectures, several GPUs can be connected to the same PCIe. To improve
e, Nvidia launched GPUDirect with CUDA 4.0.This technology allows to perform
Figure 6: Schematic example for communication of distribution functions in 2D: red arrows
values to communicate between subdomains. For colors, please refer to the PDF
International Journal of Distributed and Parallel Systems (IJDPS) Vol.6, No.5, September 2015
7
GPU communications is surely the most difficult task in order to obtain good
Indeed, our simulations are composed of numerous subdomains which are added
of GPUs to the different subdomains is an important factor of
optimization. An efficient assignment can have an important impact on the performance of the
simulation. Indeed, it can reduce the communication time between subdomains and so reduce the
needed for this type of model. The computation of interaction ‫ܨ‬௜௡௧ and
potential. The
propagation step of LBM also implies to communicate several distribution functions݂௜ between
In order to obtain a simulation time as short as possible, it is necessary to overlap data transfer
with algorithm calculations. Indeed, overlapping computations and communications allows to
time of data. The idea is to separate
calculations.
cations between neighboring
the interior. The different communications are thus
copy transactions to
od overlapping between communications and computations
GPU communications.
In most recent HPC architectures, several GPUs can be connected to the same PCIe. To improve
This technology allows to perform
Figure 6: Schematic example for communication of distribution functions in 2D: red arrows
values to communicate between subdomains. For colors, please refer to the PDF
International Journal of Distributed and Parallel Systems (IJDPS) Vol.6, No.5, September 2015
Peer-to-Peer transfers and memory accesses between two compatible GPUs. The idea is to
perform data transfer using Peer-
zero-copy transactions for others. This method allows to communicate data by bypassing the use
of the CPU and therefore to accelerate the transfer (Figure
improves performance and the efficiency of the simulatio
Figure 7: GPUDirect technology (source Nvidia).
4.4.2 Optimization of Data Transfer between
The repartition of GPUs is an important factor of optimization for this type of applications.
Communications cost is generally a bottleneck for
exchanges between sub domains
associated with one GPU.The first
belonging to the same GPU. In this case, the communication cost is extremely low because
communications are performed on the same
concern communications between
however made between Peer-to-
goal to optimize dynamically the repartition of
For a new sub domain	‫,ܩ‬ the function
Where ‫′ܩ‬ denotes neighboring subdomains to
ߛሺ‫,ܩ‬ ‫ܩ‬ᇱሻ ൌ	ቐ0.5 ∗ ‫݂݋݁ݖ݅ݏ‬ሺ‫ݎ݂݁ݏ݊ܽݎݐ‬
‫݂݋݁ݖ݅ݏ‬ሺ‫ݎ݂݁ݏ݊ܽݎݐ‬ሻ
The function ߛሺ‫,ܩ‬ ‫ܩ‬ᇱሻ compares the different ways
subdomain and its neighbors. An arbitrary weighting
to-Peer communications. The function
The function ‫ܨ‬ሺ‫ܩ‬ሻ needs therefo
cost. This function is calculated for all available GPUs and the GPU with the minimum value is
assigned to this subdomain. In order to keep load balancing, all GPUs have to be assigned
dynamically and the same GPU could not be assigned two times as long as others GPUs are not
assigned. Figure 8 explains via a
International Journal of Distributed and Parallel Systems (IJDPS) Vol.6, No.5, September 2015
Peer transfers and memory accesses between two compatible GPUs. The idea is to
-to-Peer data transactions for GPUs sharing the same I/O hub
copy transactions for others. This method allows to communicate data by bypassing the use
of the CPU and therefore to accelerate the transfer (Figure 7). The use of this type of transaction
improves performance and the efficiency of the simulation code.
Figure 7: GPUDirect technology (source Nvidia).
Data Transfer between GPUs
The repartition of GPUs is an important factor of optimization for this type of applications.
Communications cost is generally a bottleneck for multi-GPU simulations. Three ways of data
sub domains are defined. A first assumption assumes that one sub domain
The first way concerns communications between
In this case, the communication cost is extremely low because
communications are performed on the same GPU global memory. The second and the third ways
communications between sub domains belonging to different GPUs. A distinction is
-Peer exchanges and zero-copy exchanges. This section has for
goal to optimize dynamically the repartition of GPUs to new sub domains.
function ‫ܨ‬ is defined as follows:
‫ܨ‬ሺ‫ܩ‬ሻ ൌ	෍ ߛ
ீᇲ
ሺ‫,ܩ‬ ‫ܩ‬ᇱ
ሻ
denotes neighboring subdomains to ‫ܩ‬ and ߛሺ‫,ܩ‬ ‫ܩ‬ᇱ
ሻ is defined as follows:
0	݂݅	‫ܷܲܩ‬ሺ‫ܩ‬ሻ ൌ ‫ܷܲܩ‬ሺ‫ܩ‬ᇱ
ሻ
‫ݎ݂݁ݏ݊ܽݎݐ‬ሻ	݂݅	‫ܷܲܩ‬ሺ‫ܩ‬ሻ ് ‫ܷܲܩ‬ሺ‫ܩ‬ᇱሻܽ݊݀	‫ܷܲܩ‬ሺ‫ܩ‬ሻ	ܿܽ݊	ܲ2ܲ	‫ܷܲܩ‬ሺ
ሻ	݂݅	‫ܷܲܩ‬ሺ‫ܩ‬ሻ ് ‫ܷܲܩ‬ሺ‫ܩ‬ᇱሻܽ݊݀	‫ܷܲܩ‬ሺ‫ܩ‬ሻ	݊‫ܷܲܩ	ܲ2ܲ	ݐ݋‬ሺ‫ܩ‬ᇱ
ሻ
compares the different ways of communications between the new
subdomain and its neighbors. An arbitrary weighting value is included in order to promote Peer
Peer communications. The function ‫	ܨ‬performs the calculation of ߛfor all active neighbors.
needs therefore to be minimized in order to obtain the best communication
This function is calculated for all available GPUs and the GPU with the minimum value is
assigned to this subdomain. In order to keep load balancing, all GPUs have to be assigned
y and the same GPU could not be assigned two times as long as others GPUs are not
Figure 8 explains via a simple example the principle of this optimization.
International Journal of Distributed and Parallel Systems (IJDPS) Vol.6, No.5, September 2015
8
Peer transfers and memory accesses between two compatible GPUs. The idea is to
s sharing the same I/O hub and
copy transactions for others. This method allows to communicate data by bypassing the use
The use of this type of transaction
The repartition of GPUs is an important factor of optimization for this type of applications.
Three ways of data
sub domain is
concerns communications between sub domains
In this case, the communication cost is extremely low because
econd and the third ways
belonging to different GPUs. A distinction is
copy exchanges. This section has for
(14)
ሺ‫ܩ‬ᇱ
ሻ
ሻ
(15)
of communications between the new
is included in order to promote Peer-
for all active neighbors.
re to be minimized in order to obtain the best communication
This function is calculated for all available GPUs and the GPU with the minimum value is
assigned to this subdomain. In order to keep load balancing, all GPUs have to be assigned
y and the same GPU could not be assigned two times as long as others GPUs are not
International Journal of Distributed and Parallel Systems (IJDPS) Vol.6, No.5, September 2015
5. RESULTS AND PERFORMAN
5.1. Hardware
8 NVIDIA Tesla C2050 graphics cards Fermi architecture based machine are used to perform
simulations. Table 1 describes some Tesla
communications for our architecture are also described in Figure 9.
Table 1: Tesla C2050 Hardware specifications
Figure 9: Peer-to-Peer communications accessibility for our architecture.
CUDA compute capability
Total amount of global memory
(14) Multiprocessors, (32) scalar processors/MP
GPU clock rate
L2 cache
Total amount of shared memory per block
Total number of registers available per block
Figure 8: Schematic example in 2D for the optimization of the repartition of GPUs. The function
‫ܨ‬ሺ‫ܩ‬ሻ is calculated for all available GPUs and the GPU which have the minimum value is chosen. For
colors, please refer to the PDF version of this paper.
International Journal of Distributed and Parallel Systems (IJDPS) Vol.6, No.5, September 2015
ESULTS AND PERFORMANCE
8 NVIDIA Tesla C2050 graphics cards Fermi architecture based machine are used to perform
simulations. Table 1 describes some Tesla C2050 hardware specifications.
communications for our architecture are also described in Figure 9.
Table 1: Tesla C2050 Hardware specifications
Peer communications accessibility for our architecture.
CUDA compute capability 2.0
Total amount of global memory 2687 MBytes
(14) Multiprocessors, (32) scalar processors/MP 448 CUDA cores
GPU clock rate 1147 MHz
L2 cache size 786432 bytes
Total amount of shared memory per block 49152 bytes
Total number of registers available per block 32768
Figure 8: Schematic example in 2D for the optimization of the repartition of GPUs. The function
is calculated for all available GPUs and the GPU which have the minimum value is chosen. For
colors, please refer to the PDF version of this paper.
International Journal of Distributed and Parallel Systems (IJDPS) Vol.6, No.5, September 2015
9
8 NVIDIA Tesla C2050 graphics cards Fermi architecture based machine are used to perform
C2050 hardware specifications. Peer-to-Peer
2.0
2687 MBytes
448 CUDA cores
1147 MHz
786432 bytes
49152 bytes
32768
Figure 8: Schematic example in 2D for the optimization of the repartition of GPUs. The function
is calculated for all available GPUs and the GPU which have the minimum value is chosen. For
International Journal of Distributed and Parallel Systems (IJDPS) Vol.6, No.5, September 2015
10
5.2. Simulations
Two simulations are considered on large simulation domain in order to evaluate the performance
of our contribution. Both simulations include the use of two physical components. The geometry
however differs between these simulations. The first simulation is based on a simple geometry
composed of 1024*256*256 calculation cells where a fluid fills all simulation domains during the
simulation (Figure 10). The second simulation is based on a complex geometry composed of
1024*1024*128 calculations cells where the fluid moves within channels (Figure 11).
5.3. Performance
This section deals with the performance obtained by our method. A comparison between the
progressive mesh algorithm and the static mesh method generally used in literature is shown. The
optimization of the repartition of GPUs on subdomains is also studied. The performance metric
generally used for lattice Boltzmann method is the Million Lattice nodes Updates Per Second
(MLUPS). It is calculated as follows:
ܲ݁‫݂ݎ‬ெ௅௎௉ௌ ൌ
݀‫݁ݖ݅ݏ	݊݅ܽ݉݋‬ ∗ ݊‫ݏ݊݋݅ݐܽݎ݁ݐ݅	݂݋	ݎܾ݁݉ݑ‬
‫݁݉݅ݐ	݊݋݅ݐ݈ܽݑ݉݅ݏ‬
(16)
This classical approach generally used in literature in order to perform simulations consists in
equally dividing the simulation domain according to the number of GPUs. It offers generally
good performance as communications can be overlapped with calculations. The use of Peer-to-
Peer communications also has a beneficial effect on the performance, as shown on Figure 13.
Peer-to-Peer communications allow obtaining a performance gain between 8 and 12% according
Figure 10: A two-component leakage simulation on a simple geometry with a domain size of
1024*256*256 cells.
Figure 11: A two-component leakage simulation on a complex geometry composed of channels with a
domain size of 1024*1024*128 cells.
International Journal of Distributed and Parallel Systems (IJDPS) Vol.6, No.5, September 2015
to the number of GPUs used for the simulation described in Figure 10. Ze
communications offer a good scaling
of Peer-to-Peer communications, as shown on Figure 12.
The inclusion of the progressive mesh also has
performance. Sub domains of size 128*128*12
and 14 describes performance in terms of calculations and memory consumption for the
simulation presented on Figure 10. Note that the progressive mesh algorithm obtains excellent
performance at the beginning of the simulation. The addition of
simulation has for consequence a decrease of performance until the convergence of the
simulation. In this particular case, all simulation domain is meshed at the end of the simulation
shown on Figure 14, which leads to a
mesh. In terms of memory consumption, fast apparitions of news
lead to have the entire simulation domain in memory after a few iterations.
Figure 12: Comparison of performance between Peer
communications for the simulation shown on Figure 10.
Figure 13: Comparison of performance between the progressive mesh method and the static mesh
method for the simulation shown on Figure 10. The inclusion of the optimization for GPU assignment
International Journal of Distributed and Parallel Systems (IJDPS) Vol.6, No.5, September 2015
to the number of GPUs used for the simulation described in Figure 10. Ze
communications offer a good scaling but an almost perfect scaling is obtained with the inclusion
Peer communications, as shown on Figure 12.
f the progressive mesh also has an important beneficial effect on the simulation
of size 128*128*128 are considered for these simulations.
and 14 describes performance in terms of calculations and memory consumption for the
simulation presented on Figure 10. Note that the progressive mesh algorithm obtains excellent
performance at the beginning of the simulation. The addition of sub domains
simulation has for consequence a decrease of performance until the convergence of the
simulation. In this particular case, all simulation domain is meshed at the end of the simulation
which leads to a very slight decrease of performance compared to the static
In terms of memory consumption, fast apparitions of news sub domains are noted which
to have the entire simulation domain in memory after a few iterations.
Figure 12: Comparison of performance between Peer-to-Peer communications with zero
communications for the simulation shown on Figure 10.
Comparison of performance between the progressive mesh method and the static mesh
method for the simulation shown on Figure 10. The inclusion of the optimization for GPU assignment
is also presented.
International Journal of Distributed and Parallel Systems (IJDPS) Vol.6, No.5, September 2015
11
to the number of GPUs used for the simulation described in Figure 10. Zero-copy
scaling is obtained with the inclusion
an important beneficial effect on the simulation
8 are considered for these simulations. Figures 13
and 14 describes performance in terms of calculations and memory consumption for the
simulation presented on Figure 10. Note that the progressive mesh algorithm obtains excellent
b domains during the
simulation has for consequence a decrease of performance until the convergence of the
simulation. In this particular case, all simulation domain is meshed at the end of the simulation, as
t decrease of performance compared to the static
are noted which
with zero-copy
Comparison of performance between the progressive mesh method and the static mesh
method for the simulation shown on Figure 10. The inclusion of the optimization for GPU assignment
International Journal of Distributed and Parallel Systems (IJDPS) Vol.6, No.5, September 2015
Figure 13 also compares performance between two different assignments for GPUs. The first one
is a simple assignation which assigns to ne
uses the optimization method presented in sec
leads to an important difference of performance
noted at the convergence of this simulation between the two approaches.
due to the fact that the communication cost is more
optimized assignment. Since subdomains are added dynamically and connected to e
therefore important to optimize these communications in order to reduce the simulation time.
The same comparison is also done for the simulation presented on Figure 11, as shown on Figures
15 and 16. The main difference in this situation
complex and channelized. Physical simulations on channelized geometry are especially pre
on industrial structures.
In this case, the progressive mesh method shows excellent results. In terms of memory,
method is easily able to simulate on a global simulation domain of size 1024*1024*128
while the static mesh method is unable to perform the simulation. The amount of
is indeed too important for this simulation.
consumption during the simulation.
less important than the static mesh method. A gain of approximatively 50% of memory is noted
for this particular simulation. This is due
automatically adapts to the evolution of the simulation and so only needed zones of the global
simulation domain are meshed.
Figure 14: Comparison of memory
mesh method for the simulation shown on Figure 10.
International Journal of Distributed and Parallel Systems (IJDPS) Vol.6, No.5, September 2015
Figure 13 also compares performance between two different assignments for GPUs. The first one
is a simple assignation which assigns to new subdomain the first available GPU. The second one
presented in section 4.4.2. The comparison of these two methods
leads to an important difference of performance. Indeed, a difference of approximatively 30% is
simulation between the two approaches. This difference is mostly
due to the fact that the communication cost is more important for a simple assignment than an
optimized assignment. Since subdomains are added dynamically and connected to each other, it is
these communications in order to reduce the simulation time.
The same comparison is also done for the simulation presented on Figure 11, as shown on Figures
15 and 16. The main difference in this situation is the geometry of the simulation which is more
complex and channelized. Physical simulations on channelized geometry are especially pre
In this case, the progressive mesh method shows excellent results. In terms of memory,
method is easily able to simulate on a global simulation domain of size 1024*1024*128
while the static mesh method is unable to perform the simulation. The amount of needed
is indeed too important for this simulation. Figure 15 shows the evolution of memory
consumption during the simulation. The memory cost at the convergence of the simulation is far
less important than the static mesh method. A gain of approximatively 50% of memory is noted
for this particular simulation. This is due to the fact that the progressive mesh method
automatically adapts to the evolution of the simulation and so only needed zones of the global
Figure 14: Comparison of memory consumption between the progressive mesh method and the static
mesh method for the simulation shown on Figure 10.
International Journal of Distributed and Parallel Systems (IJDPS) Vol.6, No.5, September 2015
12
Figure 13 also compares performance between two different assignments for GPUs. The first one
The second one
tion 4.4.2. The comparison of these two methods
. Indeed, a difference of approximatively 30% is
This difference is mostly
assignment than an
ach other, it is
these communications in order to reduce the simulation time.
The same comparison is also done for the simulation presented on Figure 11, as shown on Figures
is the geometry of the simulation which is more
complex and channelized. Physical simulations on channelized geometry are especially present
In this case, the progressive mesh method shows excellent results. In terms of memory, this
method is easily able to simulate on a global simulation domain of size 1024*1024*128 and more
needed memory
the evolution of memory
The memory cost at the convergence of the simulation is far
less important than the static mesh method. A gain of approximatively 50% of memory is noted
to the fact that the progressive mesh method
automatically adapts to the evolution of the simulation and so only needed zones of the global
consumption between the progressive mesh method and the static
International Journal of Distributed and Parallel Systems (IJDPS) Vol.6, No.5, September 2015
Figure 15: Comparison of memory consumption between the
method for the simulation shown on Figure 11.
The comparison of the repartition
performance gain (19%) is still noted for this simulation. This proves that a
method is important in order to obtain good performance.
not need to be fully meshed brings an important gain in performance. The geometry has therefore
an important impact on the performance on
Figure 16: Comparison of performance between a simple repartition of GPUs with an optimized assignment
of GPUs for the simulation shown on Figure 11.
6. CONCLUSION
In this paper, an efficient progressive
Boltzmann method is presented. This progressive mesh method can be a useful tool in order to
perform several types of physical simulations. Its main advantage is that subdomains are
automatically added to the simulation by the use of an adapted criterion. This method is also able
to save a lot of memory and calculations
International Journal of Distributed and Parallel Systems (IJDPS) Vol.6, No.5, September 2015
Figure 15: Comparison of memory consumption between the progressive mesh method and the static mesh
method for the simulation shown on Figure 11.
The comparison of the repartition of GPUs is also described in Figure 16. A
%) is still noted for this simulation. This proves that a dynamic optimization
method is important in order to obtain good performance. Moreover, the fact that the domain does
not need to be fully meshed brings an important gain in performance. The geometry has therefore
an important impact on the performance on the progressive mesh method.
Figure 16: Comparison of performance between a simple repartition of GPUs with an optimized assignment
of GPUs for the simulation shown on Figure 11.
In this paper, an efficient progressive mesh algorithm for physical simulations using the lattice
Boltzmann method is presented. This progressive mesh method can be a useful tool in order to
perform several types of physical simulations. Its main advantage is that subdomains are
ded to the simulation by the use of an adapted criterion. This method is also able
and calculations in order to perform simulations on large installations.
International Journal of Distributed and Parallel Systems (IJDPS) Vol.6, No.5, September 2015
13
progressive mesh method and the static mesh
. An important
dynamic optimization
Moreover, the fact that the domain does
not need to be fully meshed brings an important gain in performance. The geometry has therefore
Figure 16: Comparison of performance between a simple repartition of GPUs with an optimized assignment
mesh algorithm for physical simulations using the lattice
Boltzmann method is presented. This progressive mesh method can be a useful tool in order to
perform several types of physical simulations. Its main advantage is that subdomains are
ded to the simulation by the use of an adapted criterion. This method is also able
in order to perform simulations on large installations.
International Journal of Distributed and Parallel Systems (IJDPS) Vol.6, No.5, September 2015
14
The integration of the progressive mesh method on single-node multi-GPU architecture is also
treated. A dynamic optimization of the repartition of GPUs to subdomains is an important factor
in order to obtain good performance. The combination of all these contributions allows therefore
performing fast physical simulations on all types of geometry. The progressive mesh method is
therefore an interesting alternative because it allows obtaining similar or better performances than
the usual static mesh method.
The progressive mesh algorithm is however limited to the memory of the GPU which is generally
far more inferior to the CPU RAM. The creation of new subdomains is indeed possible while
there is a sufficient amount of memory on the GPUs. Extensions of this work to cases that require
more memory than all GPUs can handle is now under investigation. Data transfer optimizations
with the CPU host will therefore be essential to keep good performances.
ACKNOWLEDGEMENTS
This work has been made possible thanks to collaboration between academic and industrial
groups, gathered by the INNOCOLD association.
REFERENCES
[1] B. Chopard, J.L. Falcone J. Latt, The Lattice Boltzmann advection diffusion model revisited, The
European Physical Journal - Special Topics,Vol. 171, pp. 245-249, 2009.
[2] S. Gong, P. Cheng, Numerical investigation of droplet motion and coalescence by an improved lattice
Boltzmann model for phase transitionsand multiphase flows, Computers & Fluids , Vol. 53, pp. 93-
104, 2012.
[3] S. Gong, P. Cheng, A lattice Boltzmann method for liquid vapor phase change heat transfer,
Computers & Fluids, Vol. 54, pp. 93-104, 2012.
[4] J. Bao, L. Schaeffer, Lattice Boltzmann equation model for multicomponent multi-phase flow with
high density ratios, Applied MathematicalModelling, 2012.
[5] Nvidia, C. U. D. A. (2011). Nvidia cuda c programming guide. NVIDIA Corporation, 120, 18.8
[6] M. Wittmann, T. Zeiser, G. Hager, G. Wellein, Comparison of different propagation steps for Lattice
Boltzmann methods, Computers and Mathematicswith Applications, Vol. 65 pp. 924-935, 2013.
[7] J. Tölke, Implementation of a Lattice Boltzmann kernel using the compute unified device architecture
developed by nVIDIA, Computing andVisualization in Science, 1-11, 2008.
[8] J. Tölke, M. Krafczyk, TeraFLOP computing on a desktop PC with GPUs for 3D CFD, International
Journal of Computational Fluid Dynamics 22(7), pp. 443-456, 2008.
[9] F. Kuznik, C.Obrecht, G. Rusaouën, J-J. Roux, LBM based flow simulation using GPU computing
processor, Computers and Mathematics withApplications 27, 2009.
[10] C. Obrecht, F. Kuznik, B. Tourancheau, J-J. Roux, A new approach to the lattice Boltzmann method
for graphics processing units, Computersand Mathematics with Applications 61, pp. 3628-3638, 2011.
[11] P.R. Rinaldi, E.A Dari, M.J. Vénere, A. Clausse, A Lattice-Boltzmannsolver for 3D fluid on GPU,
Simulation Modeling Pratice and Theory 25,pp. 163-171, 2012.
[12] P. Bailey, J. Myre, S. Walsh, D. Lilja, M. Saar, Accelerating lattice boltzmann fluid flows using
graphics processors, International Conferenceon Parallel Processing, pp. 550-557, 2009.
[13] C. Obrecht, F. Kuznik, B. Tourancheau, J-J. Roux, Multi-GPU implementation of the lattice
Boltzmann method, Computers and Mathematicswith Applications, 80, pp. 269-275, 2013.
[14] X. Li, Y. Zhang, X. Wang, W. Ge, GPU-based numerical simulation of multi-phase flow in porous
media using multiple-relaxation-time latticeBoltzmann method, Chemical Engineering Science, Vol.
102, pp. 209-219,2013.
[15] M. Januszewski, M. Kostur, Sailfish: A flexible multi-GPU implementationof the lattice Boltzmann
method, Computer Physics Communications,Vol. 185, pp. 2350-2368, 2014.
[16] F. Jiang, C. Hu, Numerical simulation of a rising CO2 droplet in the initial accelerating stage by a
multiphase lattice Boltzmann method,Applied Ocean Research, Vol. 45, pp. 1-9, 2014.
International Journal of Distributed and Parallel Systems (IJDPS) Vol.6, No.5, September 2015
15
[17] C. Obrecht, F. Kuznik, B. Tourancheau, and J.-J. Roux, Multi-GPU Implementation of a Hybrid
Thermal Lattice Boltzmann Solver using theTheLMA Framework, Computers and Fluids, Vol. 80,
pp. 269275, 2013.
[18] C. Rosales, Multiphase LBM Distributed over Multiple GPUs, CLUSTER’11 Proceedings of the
2011 IEEE International Conference onCluster Computing, pp. 1-7, 2011.
[19] C. Obrecht, F. Kuznik, B. Tourancheau, J-J. Roux, Scalable lattice Boltzmann solvers for CUDA
GPU clusters, Parallel Computing, Vol.39, pp. 259-270, 2013.
[20] J. Habich, C. Feichtinger, H. Köstler, G. Hager, G. Wellein, Performance engineering for the lattice
Boltzmann method on GPGPUs: Architecturalrequirements and performance results, Computer &
Fluids, Vol. 80, pp.276-282, 2013.
[21] C. Feichtinger, J. Habich, H. Köstler, U. Rüde, T. Aoki, Performance Modeling and Analysis of
Heterogeneous Lattice Boltzmann Simulationson CPU-GPU Clusters, Parallel Computing, 2014.
AUTHORS
Julien Duchateau is a PhD student in computer science at the Université du Littoral Côte d’Opale in France.
His main research interest are massive parallelism on CPUs and GPUs, physical simulations and computer
graphics.
François Rousselle is an associate professor in computer science at the Université du Littoral Côte d’Opale
in France. His main research interests are computer graphics, physical simulations, virtual reality and
massive parallelism.
Nicolas Maquignon is a PhD student in simulation and numerical physics at the Université du Littoral Côte
d’Opale. His main research interests are numerical physics, numerical mathematics and numerical
modeling.
Christophe Renaud is a professor in computer science at the Université du Littoral Côte d’Opale in France.
His main research interests are computer graphics, virtual reality, physical simulations and massive
parallelism.
Gilles Roussel is an associate professor in automatic at the Université du Littoral Côte d’Opale in France.
His main research interests are automatic, signal processing, physical simulations and industrial computing.

More Related Content

What's hot (18)

PDF
A Novel Framework and Policies for On-line Block of Cores Allotment for Multi...
ijcsa
 
PDF
A NOVEL GRAPH REPRESENTATION FOR SKELETON-BASED ACTION RECOGNITION
sipij
 
PDF
A multi path routing algorithm for ip
Alvianus Dengen
 
PDF
Design and Implementation of a Cache Hierarchy-Aware Task Scheduling for Para...
csandit
 
PDF
NUPHBP13483
Patricia Francisconi
 
PDF
RADIAL BASIS FUNCTION PROCESS NEURAL NETWORK TRAINING BASED ON GENERALIZED FR...
cseij
 
PDF
Parallel Batch-Dynamic Graphs: Algorithms and Lower Bounds
Subhajit Sahu
 
PDF
Md simulations modified
shahmeermateen
 
PDF
Scalable and Adaptive Graph Querying with MapReduce
Kyong-Ha Lee
 
PDF
Multi Object Tracking Methods Based on Particle Filter and HMM
IJTET Journal
 
PDF
SASUM: A Sharing-based Approach to Fast Approximate Subgraph Matching for Lar...
Kyong-Ha Lee
 
PDF
cug2011-praveen
Praveen Narayanan
 
PDF
Hardware Architecture for Calculating LBP-Based Image Region Descriptors
Marek Kraft
 
PDF
NETWORK-AWARE DATA PREFETCHING OPTIMIZATION OF COMPUTATIONS IN A HETEROGENEOU...
IJCNCJournal
 
PDF
A tutorial on CGAL polyhedron for subdivision algorithms
Radu Ursu
 
PDF
Performance comparison of row per slave and rows set
eSAT Publishing House
 
PPT
UIC Panella Thesis
Marco Santambrogio
 
PPTX
cuTau Leaping
Amritesh Srivastava
 
A Novel Framework and Policies for On-line Block of Cores Allotment for Multi...
ijcsa
 
A NOVEL GRAPH REPRESENTATION FOR SKELETON-BASED ACTION RECOGNITION
sipij
 
A multi path routing algorithm for ip
Alvianus Dengen
 
Design and Implementation of a Cache Hierarchy-Aware Task Scheduling for Para...
csandit
 
RADIAL BASIS FUNCTION PROCESS NEURAL NETWORK TRAINING BASED ON GENERALIZED FR...
cseij
 
Parallel Batch-Dynamic Graphs: Algorithms and Lower Bounds
Subhajit Sahu
 
Md simulations modified
shahmeermateen
 
Scalable and Adaptive Graph Querying with MapReduce
Kyong-Ha Lee
 
Multi Object Tracking Methods Based on Particle Filter and HMM
IJTET Journal
 
SASUM: A Sharing-based Approach to Fast Approximate Subgraph Matching for Lar...
Kyong-Ha Lee
 
cug2011-praveen
Praveen Narayanan
 
Hardware Architecture for Calculating LBP-Based Image Region Descriptors
Marek Kraft
 
NETWORK-AWARE DATA PREFETCHING OPTIMIZATION OF COMPUTATIONS IN A HETEROGENEOU...
IJCNCJournal
 
A tutorial on CGAL polyhedron for subdivision algorithms
Radu Ursu
 
Performance comparison of row per slave and rows set
eSAT Publishing House
 
UIC Panella Thesis
Marco Santambrogio
 
cuTau Leaping
Amritesh Srivastava
 

Similar to A PROGRESSIVE MESH METHOD FOR PHYSICAL SIMULATIONS USING LATTICE BOLTZMANN METHOD ON SINGLE-NODE MULTI-GPU ARCHITECTURES (20)

PDF
PROBABILISTIC DIFFUSION IN RANDOM NETWORK G...
ijfcstjournal
 
PDF
Lattice Boltzmann Method And Its Applications In Engineering 1st Edition Zhao...
dlerwrh463
 
PDF
PhysRevE.91.023310.pdf
Krunal Gangawane
 
PDF
Thesis_Walter_PhD_final_updated
Walter Rodrigues
 
PDF
GPU HistoPyramid Based Fluid Simulation and Rendering
João Vicente P. Reis Fo.
 
PDF
LChen_diss_Pitt_FVDBM
Leitao Chen
 
PDF
rgDefense
Rajesh Gandham
 
PDF
Accelerating S3D A GPGPU Case Study
Martha Brown
 
PDF
Gpuslides
Storti Mario
 
PDF
Fast Fluid Thermodynamics Simulation By Solving Heat Diffusion Equation
ijcga
 
PDF
Fast Fluid Thermodynamics Simulation by Solving Heat Diffusion Equation
ijcga
 
PDF
Fast Fluid Thermodynamics Simulation By Solving Heat Diffusion Equation
ijcga
 
PDF
Fast Fluid Thermodynamics Simulation by Solving Heat Diffusion Equation
ijcga
 
PDF
Huge-Scale Molecular Dynamics Simulation of Multi-bubble Nuclei
Hiroshi Watanabe
 
PDF
CMSI計算科学技術特論A (2015) 第13回 Parallelization of Molecular Dynamics
Computational Materials Science Initiative
 
DOCX
PORE-SCALE SIMULATION OF MULTIPHASE FLOW USING LATTICE BOLTZ.docx
harrisonhoward80223
 
PDF
El text.tokuron a(2019).jung190711
RCCSRENKEI
 
PDF
TR-CIS-0420-09 BobZigon
Bob Zigon
 
PDF
Solving large sparse linear systems on the GPU
Bruno Levy
 
PDF
Advances in the Solution of Navier-Stokes Eqs. in GPGPU Hardware. Modelling F...
Storti Mario
 
PROBABILISTIC DIFFUSION IN RANDOM NETWORK G...
ijfcstjournal
 
Lattice Boltzmann Method And Its Applications In Engineering 1st Edition Zhao...
dlerwrh463
 
PhysRevE.91.023310.pdf
Krunal Gangawane
 
Thesis_Walter_PhD_final_updated
Walter Rodrigues
 
GPU HistoPyramid Based Fluid Simulation and Rendering
João Vicente P. Reis Fo.
 
LChen_diss_Pitt_FVDBM
Leitao Chen
 
rgDefense
Rajesh Gandham
 
Accelerating S3D A GPGPU Case Study
Martha Brown
 
Gpuslides
Storti Mario
 
Fast Fluid Thermodynamics Simulation By Solving Heat Diffusion Equation
ijcga
 
Fast Fluid Thermodynamics Simulation by Solving Heat Diffusion Equation
ijcga
 
Fast Fluid Thermodynamics Simulation By Solving Heat Diffusion Equation
ijcga
 
Fast Fluid Thermodynamics Simulation by Solving Heat Diffusion Equation
ijcga
 
Huge-Scale Molecular Dynamics Simulation of Multi-bubble Nuclei
Hiroshi Watanabe
 
CMSI計算科学技術特論A (2015) 第13回 Parallelization of Molecular Dynamics
Computational Materials Science Initiative
 
PORE-SCALE SIMULATION OF MULTIPHASE FLOW USING LATTICE BOLTZ.docx
harrisonhoward80223
 
El text.tokuron a(2019).jung190711
RCCSRENKEI
 
TR-CIS-0420-09 BobZigon
Bob Zigon
 
Solving large sparse linear systems on the GPU
Bruno Levy
 
Advances in the Solution of Navier-Stokes Eqs. in GPGPU Hardware. Modelling F...
Storti Mario
 
Ad

Recently uploaded (20)

DOCX
CS-802 (A) BDH Lab manual IPS Academy Indore
thegodhimself05
 
PPTX
Introduction to Neural Networks and Perceptron Learning Algorithm.pptx
Kayalvizhi A
 
PDF
Reasons for the succes of MENARD PRESSUREMETER.pdf
majdiamz
 
PPTX
artificial intelligence applications in Geomatics
NawrasShatnawi1
 
PPTX
Green Building & Energy Conservation ppt
Sagar Sarangi
 
PPTX
Product Development & DevelopmentLecture02.pptx
zeeshanwazir2
 
PPTX
Thermal runway and thermal stability.pptx
godow93766
 
PPTX
Shinkawa Proposal to meet Vibration API670.pptx
AchmadBashori2
 
PDF
Introduction to Productivity and Quality
মোঃ ফুরকান উদ্দিন জুয়েল
 
PPTX
GitOps_Repo_Structure for begeinner(Scaffolindg)
DanialHabibi2
 
DOCX
8th International Conference on Electrical Engineering (ELEN 2025)
elelijjournal653
 
PDF
6th International Conference on Machine Learning Techniques and Data Science ...
ijistjournal
 
PDF
MAD Unit - 2 Activity and Fragment Management in Android (Diploma IT)
JappanMavani
 
DOC
MRRS Strength and Durability of Concrete
CivilMythili
 
PPTX
Arduino Based Gas Leakage Detector Project
CircuitDigest
 
PPTX
265587293-NFPA 101 Life safety code-PPT-1.pptx
chandermwason
 
PDF
Pressure Measurement training for engineers and Technicians
AIESOLUTIONS
 
PPTX
Server Side Web Development Unit 1 of Nodejs.pptx
sneha852132
 
PPTX
Break Statement in Programming with 6 Real Examples
manojpoojary2004
 
PPTX
Heart Bleed Bug - A case study (Course: Cryptography and Network Security)
Adri Jovin
 
CS-802 (A) BDH Lab manual IPS Academy Indore
thegodhimself05
 
Introduction to Neural Networks and Perceptron Learning Algorithm.pptx
Kayalvizhi A
 
Reasons for the succes of MENARD PRESSUREMETER.pdf
majdiamz
 
artificial intelligence applications in Geomatics
NawrasShatnawi1
 
Green Building & Energy Conservation ppt
Sagar Sarangi
 
Product Development & DevelopmentLecture02.pptx
zeeshanwazir2
 
Thermal runway and thermal stability.pptx
godow93766
 
Shinkawa Proposal to meet Vibration API670.pptx
AchmadBashori2
 
Introduction to Productivity and Quality
মোঃ ফুরকান উদ্দিন জুয়েল
 
GitOps_Repo_Structure for begeinner(Scaffolindg)
DanialHabibi2
 
8th International Conference on Electrical Engineering (ELEN 2025)
elelijjournal653
 
6th International Conference on Machine Learning Techniques and Data Science ...
ijistjournal
 
MAD Unit - 2 Activity and Fragment Management in Android (Diploma IT)
JappanMavani
 
MRRS Strength and Durability of Concrete
CivilMythili
 
Arduino Based Gas Leakage Detector Project
CircuitDigest
 
265587293-NFPA 101 Life safety code-PPT-1.pptx
chandermwason
 
Pressure Measurement training for engineers and Technicians
AIESOLUTIONS
 
Server Side Web Development Unit 1 of Nodejs.pptx
sneha852132
 
Break Statement in Programming with 6 Real Examples
manojpoojary2004
 
Heart Bleed Bug - A case study (Course: Cryptography and Network Security)
Adri Jovin
 
Ad

A PROGRESSIVE MESH METHOD FOR PHYSICAL SIMULATIONS USING LATTICE BOLTZMANN METHOD ON SINGLE-NODE MULTI-GPU ARCHITECTURES

  • 1. International Journal of Distributed and Parallel Systems (IJDPS) Vol.6, No.5, September 2015 DOI:10.5121/ijdps.2015.6501 1 A PROGRESSIVE MESH METHOD FOR PHYSICAL SIMULATIONS USING LATTICE BOLTZMANN METHOD ON SINGLE-NODE MULTI-GPU ARCHITECTURES Julien Duchateau1 , François Rousselle1 , Nicolas Maquignon1 , Gilles Roussel1 , Christophe Renaud1 1 Laboratoire d’Informatique, Signal, Image de la Côte d’Opale Université du Littoral Côte d’Opale, Calais, France ABSTRACT In this paper, a new progressive mesh algorithm is introduced in order to perform fast physical simulations by the use of a lattice Boltzmann method (LBM) on a single-node multi-GPU architecture. This algorithm is able to mesh automatically the simulation domain according to the propagation of fluids. This method can also be useful in order to perform several types of physical simulations. In this paper, we associate this algorithm with a multiphase and multicomponent lattice Boltzmann model (MPMC–LBM) because it is able to perform various types of simulations on complex geometries. The use of this algorithm combined with the massive parallelism of GPUs[5] allows to obtain very good performance in comparison with the staticmesh method used in literature. Several simulations are shown in order to evaluate the algorithm. KEYWORDS Progressive mesh, Lattice Boltzmann method,single-node multi-GPU, parallel computing. 1. INTRODUCTION The lattice Boltzmann method (LBM) is a computational fluid dynamics (CFD) method. It is a relatively recent technique which is able to approximate Navier-Stokes equations by a collision- propagation scheme [1]. Lattice Boltzmann method however differs from standard approaches as finite element method (FEM) or finite volume method (FVM) by its mesoscopic approach. It is an interesting alternative which is able to simulate complex phenomena on complex geometries. Its high parallelization makes also this method attractive in order to perform simulations on parallel hardware. Moreover, the emergence of high-performance computing (HPC) architectures using GPUs [5] is also a great interest for many researchers. Parallelization is indeed an important asset of lattice Boltzmann method. However, perform simulations on large complex geometries can be very costly in computational resources. This paper introduces a new progressive mesh algorithm in order to perform physical simulations on complex geometries by the use of a multiphase and multicomponent lattice Boltzmann method. The algorithm is able to automatically mesh the simulation domain according to the propagation of fluids. Moreover, the integration of this algorithm on single-node multi-GPU architecture is also an important matter which is studied in this paper. This method is an interesting alternative which has never been exploited at the best of our knowledge.
  • 2. International Journal of Distributed and Parallel Systems (IJDPS) Vol.6, No.5, September 2015 2 Section 2 first describes the multiphase and multicomponent lattice Boltzmann method. It is able to simulate the behavior of fluids with several physical states (phase) and it is also able to model several fluids (component) interacting with each other. Section 3 presents then several recent works involving lattice Boltzmann method on GPUs. Section 4 mostly concerns the main contribution of this paper: the inclusion of a progressive mesh method in the simulation code. The principles of the method and the definition of an adapted criterion are firstly introduced. The integration on a single-node multi-GPU architecture is then described. An analysis concerning performance is also studied in section 5. The conclusion and future works are finally presented in the last section. 2. THE LATTICE BOLTZMANN METHOD 2.1. The Single relaxation time Bhatnagar-Gross-Krook (SRT-BGK) Boltzmann equation The lattice Boltzmann method is based on three main discretizations: space, time and velocities. Velocity space is reduced to a finite number of well-defined vectors. Figures 1(a) and 1(b) illustrate this discrete scheme for D2Q9 and D3Q19 model. The simulation grid is therefore discretized as a Cartesian grid and calculation steps are achieved on this entire grid. The discrete Boltzmann equation[1] with a single relaxation timeBhatnagar- Gross-Krook (SRT-BGK) collision term is defined by the following equation: ݂௜ሺ‫ݔ‬ ൅ ݁௜, ‫ݐ‬ ൅ Δ௧ሻ െ ݂௜ሺ‫,ݔ‬ ‫ݐ‬ሻ ൌ 1 ߬ ቀ݂௜ሺ‫,ݔ‬ ‫ݐ‬ሻ െ ݂௜ ௘௤ ሺ‫,ݔ‬ ‫ݐ‬ሻቁ (1) ݂௜ ௘௤ ሺ‫,ݔ‬ ‫ݐ‬ሻ ൌ ߱௜ߩሺ‫,ݔ‬ ‫ݐ‬ሻ ቆ1 ൅ ݁௜‫ ݑ‬ ܿ௦ ଶ ൅ ሺ݁௜‫ݑ‬ሻଶ 2ܿ௦ ସ െ ‫ݑ‬ଶ 2ܿ௦ ଶ ቇ (2) ܿ௦ ଶ ൌ 1 3 ൬ Δ௫ Δ௧ ൰ ଶ (3) The function ݂௜ሺ‫,ݔ‬ ‫ݐ‬ሻ corresponds to the discrete density distribution function along velocity vector ݁௜ at a position‫ݔ‬ and a time ‫.ݐ‬ The parameter ߬ corresponds to the relaxation time of the simulation. The value ߩ is the fluid density and ‫ݑ‬ corresponds to the fluid velocity. Δ௫andΔ௧ are the spatial and temporal steps of the simulation respectively. Parameters ‫ݓ‬௜ are weighting values defined according to the lattice Boltzmann scheme and can be found in [1].Macroscopic quantities as density ߩ and velocity ‫ݑ‬ are finally computed as follows: (a) D2Q9 scheme (b) D3Q19 scheme Figure 1: Example of Lattice Boltzmann schemes
  • 3. International Journal of Distributed and Parallel Systems (IJDPS) Vol.6, No.5, September 2015 3 ߩሺ‫,ݔ‬ ‫ݐ‬ሻ ൌ ෍ ݂௜ሺ‫,ݔ‬ ‫ݐ‬ሻ ௜ (4) ߩሺ‫,ݔ‬ ‫ݐ‬ሻ‫ݑ‬ሺ‫,ݔ‬ ‫ݐ‬ሻ ൌ ෍ ݂௜ሺ‫,ݔ‬ ‫ݐ‬ሻ݁௜ ௜ (5) 2.2. Multiphase and Multi Component Lattice Boltzmann Model Multiphase and multicomponent models (MPMC) allow performing complex simulations involving several physical components. In this section, a MPMC-LBM model based on the work achieved by Bao& Schaeffer [4] is presented.It includes several interaction forces based on pseudo-potential. It is calculated as follows: ߰ఈ ൌ ඨ 2ሺ‫݌‬ఈ െ ܿ௦ ଶߩఈሻ ܿ௦ ଶ݃ఈఈ (6) The term ‫݌‬ఈ is the pressure term. It is calculated by the use of an equation of state as the Peng- Robinson equation: ‫݌‬ఈ ൌ ߩఈܴఈܶఈ 1 െ ܾఈߩఈ െ ܽఈߠሺܶఈሻߩఈ ଶ 1 ൅ 2ܾఈ െ ܾଶߩଶ (7) Internal forces are then computed. The internal fluid interaction force is expressed as follows [2] [3]: ‫ܨ‬ఈఈሺ‫ݔ‬ሻ ൌ െߚ ݃ఈ 2 ܿ௦ ଶ ߰ఈሺ‫ݔ‬ሻ ෍ ‫ݓ‬௜ ௫ᇲ ߰ఈሺ‫ݔ‬ᇱሻሺ‫ݔ‬ᇱ െ ‫ݔ‬ሻ െ 1 െ ߚ 2 ݃ఈ 2 ܿ௦ ଶ ߰ఈሺ‫ݔ‬ሻ‫ݓ‬௜߰ఈ ଶሺ‫ݔ‬ᇱሻሺ‫ݔ‬ᇱ െ ‫ݔ‬ሻ (8) The valueߚ is a weighting term generally fixed to 1.16 according to [2] [3]. The inter-component force is also introduced as follows [4]: ‫ܨ‬ఈఈᇲሺ‫ݔ‬ሻ ൌ െ ݃ఈఈᇱ 2 ܿ௦ ଶ ߰ఈሺ‫ݔ‬ሻ ෍ ‫ݓ‬௜ ௫ᇲ ߰ఈሺ‫ݔ‬ᇱሻሺ‫ݔ‬ᇱ െ ‫ݔ‬ሻ (9) Additional forces can be added into the simulation code as the gravity force, or a fluid-structure interaction [3]. The incorporation of the force term is then achieved by a modifiedcollision operator expressed as follows: ݂ఈ,௜ሺ‫ݔ‬ ൅ ݁௜, ‫ݐ‬ ൅ Δ௧ሻ െ ݂ఈ,௜ሺ‫,ݔ‬ ‫ݐ‬ሻ ൌ 1 ߬ ቀ݂ఈ,௜ሺ‫,ݔ‬ ‫ݐ‬ሻ െ ݂ఈ,௜ ௘௤ ሺ‫,ݔ‬ ‫ݐ‬ሻቁ ൅ Δ݂ఈ,௜ (10) Δ݂ఈ,௜ ൌ ݂ఈ,௜ ௘௤ ሺߩఈ, ‫ݑ‬ఈ ൅ Δ‫ݑ‬ఈሻ െ ݂ఈ,௜ ௘௤ ሺߩఈ, ‫ݑ‬ఈሻ (11) Δ‫ݑ‬ఈ ൌ ‫ܨ‬ఈΔ௧ ߩఈ (12) Macroscopic quantities for each component are finally computed by the use of equations (4) and (5). 3. LATTICE BOLTZMANN METHODS AND GPUS The mass parallelism of GPUs has been quickly exploited in order to perform fast simulations[7] [8] using lattice Boltzmann method. Recent works have shown that GPUs are also used with multiphase and multicomponent models [16] [14]. The main aspects of GPU optimizations are
  • 4. International Journal of Distributed and Parallel Systems (IJDPS) Vol.6, No.5, September 2015 decomposed into several categories overlap of memory transfers with computations …. optimize global memory bandwidt Concerning LBM, an adapted data structure such as the Structure of Array (SoA studied and has proven to be efficient on GPU Several access patterns are also pattern, consists of using two calculation grids in GPU global memory in order to manage the temporal and spatial dependency of the data (Equation reading distribution functions from A and writing them to B, and reading from B reciprocally. This pattern is commonly used and offers very goo single GPU. Several techniques are however presented in literature in order to reduce significantly the computational memory cost without loss of information such as grids compression [6], Swap algorithm technique is used in order to save memory due to spatial and temporal data dependency. Recent works involving implementation of l of several GPUs are also available. A first solution, proposed in entire simulation domain into sub LBM kernels on each sub-domain in parallel. CPU threads are used to handle each CUDA context. Communications between sub Zero-copy feature allows to perform efficient communications by a mapping between CPU and GPU pointers. Data must however be read and written only once in order to obtain good performance. Some approaches have finally been proposed constituted of multiple GPUs by the use of MPI in combination with CUDA our case, we only dispose of one computing node with multiple GPUs thus we don't these architectures in this paper. 4. A PROGRESSIVE MESH ALG ON SINGLE-NODE MULTI-GPU 4.1. Motivation Works described in the previous section consider that the entire simulation domain is divided into subdomains according to the number of subdomains are therefore calculated in parallel. Figure 2: Division of the simulation domain: the entire domain is decomposed into subdomains according International Journal of Distributed and Parallel Systems (IJDPS) Vol.6, No.5, September 2015 decomposed into several categories [10] [9] as thread level parallelism, GPU memory acce overlap of memory transfers with computations …. Data coalescence is needed in order to optimize global memory bandwidth. This implies several conditions as described in [9 Concerning LBM, an adapted data structure such as the Structure of Array (SoA) has been well studied and has proven to be efficient on GPU [7]. also described in the literature. The first one, named A pattern, consists of using two calculation grids in GPU global memory in order to manage the temporal and spatial dependency of the data (Equation (10)). Simulation steps alternate between reading distribution functions from A and writing them to B, and reading from B and writing to A This pattern is commonly used and offers very good performance [10] [11] [9 single GPU. Several techniques are however presented in literature in order to reduce significantly the computational memory cost without loss of information such as grids Swap algorithm [6] or A-A pattern technique [12]. In this paper, the A technique is used in order to save memory due to spatial and temporal data dependency. ks involving implementation of lattice Boltzmann method on a single-node composed ilable. A first solution, proposed in [13] [17], consists in dividing the simulation domain into subdomains according to the number of GPUs and performing domain in parallel. CPU threads are used to handle each CUDA context. Communications between sub-domains are performed using zero-copy memory transfers. to perform efficient communications by a mapping between CPU and GPU pointers. Data must however be read and written only once in order to obtain good Some approaches have finally been proposed recently to perform simulations on several no constituted of multiple GPUs by the use of MPI in combination with CUDA [19][18 our case, we only dispose of one computing node with multiple GPUs thus we don't PROGRESSIVE MESH ALGORITHM FOR LATTICE BOLTZMANN METHOD GPU ARCHITECTURES in the previous section consider that the entire simulation domain is divided into subdomains according to the number of GPUs, as shown on Figure 2. All therefore calculated in parallel. Figure 2: Division of the simulation domain: the entire domain is decomposed into subdomains according to the number of GPUs. International Journal of Distributed and Parallel Systems (IJDPS) Vol.6, No.5, September 2015 4 as thread level parallelism, GPU memory access, Data coalescence is needed in order to l conditions as described in [9]. ) has been well described in the literature. The first one, named A-B access pattern, consists of using two calculation grids in GPU global memory in order to manage the imulation steps alternate between and writing to A [10] [11] [9] on a single GPU. Several techniques are however presented in literature in order to reduce significantly the computational memory cost without loss of information such as grids In this paper, the A-A pattern technique is used in order to save memory due to spatial and temporal data dependency. node composed , consists in dividing the domains according to the number of GPUs and performing domain in parallel. CPU threads are used to handle each CUDA copy memory transfers. to perform efficient communications by a mapping between CPU and GPU pointers. Data must however be read and written only once in order to obtain good to perform simulations on several nodes [18][21] [15]. In our case, we only dispose of one computing node with multiple GPUs thus we don't focus on OLTZMANN METHODS in the previous section consider that the entire simulation domain is meshed and Figure 2. All Figure 2: Division of the simulation domain: the entire domain is decomposed into subdomains according
  • 5. International Journal of Distributed and Parallel Systems (IJDPS) Vol.6, No.5, September 2015 In this paper, a new approach is considered. does not requires to be fully meshed at the beginning new progressive mesh method propagation of the simulated flui beginning of the simulation (Figure 3(a)) propagation of the fluid as can be seen of Figure 3 the simulation geometry (Figure 3(c)) simulations. It is also a real advantage for an application on industrial structures mostly composed of pipes or channels. It can indeed save a lot of memory and calculations geometry used for the simulation. Figure 3: Example of a 3D simu created at the beginning of the simulation, (b) several subdomains are created following the propagation of fluid, (c) all subdomains are created and completely adapt to the simulation g The progressive mesh algorithm firstly needs the introduction of a create a new subdomain to the simulation. This existing subdomains. Calculations on optimization factor. 4.2. Definition of a Criterion for the Progressive Mesh The definition of a criterion is an for the simulation. This criterion needs to represent eff velocity seems like a good choice in order to define an efficient criterion fluid velocity between two iterations dispersion. Our criterion is therefore defined as follows ‖‫ܥ‬ఈሺ‫ݔ‬ሻ‖ The symbol ‖. ‖ଶstands for the Euclidean norm in this paper. for all active subdomains on the boundary, a new subdomain is created next to this boundary as shown on Figure 4. generally fixed to 0 in this paper in order to each subdomain. (a) (b) (c) International Journal of Distributed and Parallel Systems (IJDPS) Vol.6, No.5, September 2015 approach is considered. For most simulations, the entire domain generally fully meshed at the beginning of the simulation. We propose therefore a method in order to dynamically create the mesh according to the fluid. The idea consists in defining a first subdomain at the (Figure 3(a)). Several subdomains can then be created following the propagation of the fluid as can be seen of Figure 3(b). This method finally adapts automatically (Figure 3(c)). This method is therefore applicable for any geometry and . It is also a real advantage for an application on industrial structures mostly composed It can indeed save a lot of memory and calculations according to the geometry used for the simulation. Figure 3: Example of a 3D simulation using the progressive mesh algorithm: (a) a first subdomain is created at the beginning of the simulation, (b) several subdomains are created following the propagation of fluid, (c) all subdomains are created and completely adapt to the simulation geometry. The progressive mesh algorithm firstly needs the introduction of an adapted criterion in order to w subdomain to the simulation. This new subdomain needs then to be connected to Calculations on single-node multi-GPU architecture are finally of a Criterion for the Progressive Mesh n important aspect in order to efficiently create new subdomains s criterion needs to represent efficiently the propagation of fluid. The fluid ike a good choice in order to define an efficient criterion. The difference of the fluid velocity between two iterations is considered in order to observe efficiently Our criterion is therefore defined as follows for thecomponent ߙ: ሺ ሻ‖ଶ ൌ ‖‫ݑ‬ఈሺ‫,ݔ‬ ‫ݐ‬ ൅ Δ௧ሻ െ ‫ݑ‬ఈሺ‫,ݔ‬ ‫ݐ‬ሻ‖ଶ stands for the Euclidean norm in this paper. This criterion needs to be calculated subdomains on the boundaries. If the criterion exceeds anarbitrary threshold boundary, a new subdomain is created next to this boundary as shown on Figure 4.The value in this paper in order to detect any change of velocity on the boundaries of (a) (b) (c) International Journal of Distributed and Parallel Systems (IJDPS) Vol.6, No.5, September 2015 5 simulations, the entire domain generally We propose therefore a create the mesh according to the domain at the Several subdomains can then be created following the automatically to This method is therefore applicable for any geometry and . It is also a real advantage for an application on industrial structures mostly composed according to the lation using the progressive mesh algorithm: (a) a first subdomain is created at the beginning of the simulation, (b) several subdomains are created following the propagation of eometry. criterion in order to s then to be connected to GPU architecture are finally an important important aspect in order to efficiently create new subdomains fluid. The fluid he difference of the efficiently the fluid (13) This criterion needs to be calculated thresholdܵon a The valueܵ is n the boundaries of
  • 6. International Journal of Distributed and Parallel Systems (IJDPS) Vol.6, No.5, September 2015 Figure 4: The criterion ǁC_α (x) ǁ_2 is calculated on the boundary. If the criterion exceeds the threshold S then a new subdomain is created next to the boundary. 4.3. Algorithm This section describes the algorithm for the model with the inclusion of our progressive mesh algorithm. summarize the previous sections. The calculation of the criterion and the creation of new subdomains are achieved at the last step of the algorithm in order to not disturb the simulation process. Figure 5 describes our resulti Figure 5: Algorithm for the multiphase and multicomponent Lattice Boltzmann model with the inclusion of our progressive mesh method. For colors, please refer to the PDF version of this paper. International Journal of Distributed and Parallel Systems (IJDPS) Vol.6, No.5, September 2015 ǁ_2 is calculated on the boundary. If the criterion exceeds the threshold S then a new subdomain is created next to the boundary. the algorithm for the multiphase and multicomponent lattice Boltzmann model with the inclusion of our progressive mesh algorithm. It is also useful in order to summarize the previous sections. The calculation of the criterion and the creation of new subdomains are achieved at the last step of the algorithm in order to not disturb the simulation describes our resulting algorithm. Figure 5: Algorithm for the multiphase and multicomponent Lattice Boltzmann model with the inclusion of our progressive mesh method. For colors, please refer to the PDF version of this paper. International Journal of Distributed and Parallel Systems (IJDPS) Vol.6, No.5, September 2015 6 _2 is calculated on the boundary. If the criterion exceeds the threshold S attice Boltzmann It is also useful in order to summarize the previous sections. The calculation of the criterion and the creation of new subdomains are achieved at the last step of the algorithm in order to not disturb the simulation Figure 5: Algorithm for the multiphase and multicomponent Lattice Boltzmann model with the inclusion of our progressive mesh method. For colors, please refer to the PDF version of this paper.
  • 7. International Journal of Distributed and Parallel Systems (IJDPS) Vol.6, No.5, September 2015 4.4. Integration on Single-Node Multi Efficiency of inter-GPU communications is surely the most difficult task in order to obtain good performance. Indeed, our simulations are composed of numerous subdomains which are a dynamically. The repartition of GPUs to the different subdomains is an important factor of optimization. An efficient assignment can have an important impact on the performance of the simulation. Indeed, it can reduce the communication time between subdomains and so reduce the simulation time. 4.4.1. Overlap Communications with Computations Several data exchanges are needed for this type of model. The computation of interaction inter-component ‫ܨ‬௘௫௧ implies to have access to neighboring values of the pseudo propagation step of LBM also implies to communicate several distribution functions GPUs (Figure 6). Aligned buffers ma In order to obtain a simulation time as short as possible, it is necessary to overlap data transfer with algorithm calculations. Indeed, overlapping computations and communications allows obtain a significant performance gain by reducing the waiting the computation process into 2 steps Computations on the needed boundaries are firstly done. Communi subdomains are also done while computing performed simultaneously with calculations which allow In most cases for lattice Boltzmann method, memory is page-locked memory which allow go [17][13] [15].A different approach In most recent HPC architectures, several GPUs can be connected to the same PCIe. To improve performance, Nvidia launched GPUDirect with CUDA 4.0. Figure 6: Schematic example for communication of distribution functions in 2D: red arrows corresponds to ݂௜ values to communicate between subdomains. For colors, please refer to the PDF version of this paper. International Journal of Distributed and Parallel Systems (IJDPS) Vol.6, No.5, September 2015 Node Multi-GPU Architecture GPU communications is surely the most difficult task in order to obtain good Indeed, our simulations are composed of numerous subdomains which are a of GPUs to the different subdomains is an important factor of optimization. An efficient assignment can have an important impact on the performance of the simulation. Indeed, it can reduce the communication time between subdomains and so reduce the Communications with Computations needed for this type of model. The computation of interaction implies to have access to neighboring values of the pseudo-potential. The propagation step of LBM also implies to communicate several distribution functions ligned buffers may be used for data transactions. In order to obtain a simulation time as short as possible, it is necessary to overlap data transfer with algorithm calculations. Indeed, overlapping computations and communications allows significant performance gain by reducing the waiting time of data. The idea is to separate omputation process into 2 steps: boundary calculations and interior Computations on the needed boundaries are firstly done. Communications between neighboring domains are also done while computing the interior. The different communications are thus calculations which allow good efficiency. attice Boltzmann method, memory is transferred via zero-copy transactions to locked memory which allow good overlapping between communications and computations different approach is studied in this paper concerning inter-GPU communications. In most recent HPC architectures, several GPUs can be connected to the same PCIe. To improve e, Nvidia launched GPUDirect with CUDA 4.0.This technology allows to perform Figure 6: Schematic example for communication of distribution functions in 2D: red arrows values to communicate between subdomains. For colors, please refer to the PDF International Journal of Distributed and Parallel Systems (IJDPS) Vol.6, No.5, September 2015 7 GPU communications is surely the most difficult task in order to obtain good Indeed, our simulations are composed of numerous subdomains which are added of GPUs to the different subdomains is an important factor of optimization. An efficient assignment can have an important impact on the performance of the simulation. Indeed, it can reduce the communication time between subdomains and so reduce the needed for this type of model. The computation of interaction ‫ܨ‬௜௡௧ and potential. The propagation step of LBM also implies to communicate several distribution functions݂௜ between In order to obtain a simulation time as short as possible, it is necessary to overlap data transfer with algorithm calculations. Indeed, overlapping computations and communications allows to time of data. The idea is to separate calculations. cations between neighboring the interior. The different communications are thus copy transactions to od overlapping between communications and computations GPU communications. In most recent HPC architectures, several GPUs can be connected to the same PCIe. To improve This technology allows to perform Figure 6: Schematic example for communication of distribution functions in 2D: red arrows values to communicate between subdomains. For colors, please refer to the PDF
  • 8. International Journal of Distributed and Parallel Systems (IJDPS) Vol.6, No.5, September 2015 Peer-to-Peer transfers and memory accesses between two compatible GPUs. The idea is to perform data transfer using Peer- zero-copy transactions for others. This method allows to communicate data by bypassing the use of the CPU and therefore to accelerate the transfer (Figure improves performance and the efficiency of the simulatio Figure 7: GPUDirect technology (source Nvidia). 4.4.2 Optimization of Data Transfer between The repartition of GPUs is an important factor of optimization for this type of applications. Communications cost is generally a bottleneck for exchanges between sub domains associated with one GPU.The first belonging to the same GPU. In this case, the communication cost is extremely low because communications are performed on the same concern communications between however made between Peer-to- goal to optimize dynamically the repartition of For a new sub domain ‫,ܩ‬ the function Where ‫′ܩ‬ denotes neighboring subdomains to ߛሺ‫,ܩ‬ ‫ܩ‬ᇱሻ ൌ ቐ0.5 ∗ ‫݂݋݁ݖ݅ݏ‬ሺ‫ݎ݂݁ݏ݊ܽݎݐ‬ ‫݂݋݁ݖ݅ݏ‬ሺ‫ݎ݂݁ݏ݊ܽݎݐ‬ሻ The function ߛሺ‫,ܩ‬ ‫ܩ‬ᇱሻ compares the different ways subdomain and its neighbors. An arbitrary weighting to-Peer communications. The function The function ‫ܨ‬ሺ‫ܩ‬ሻ needs therefo cost. This function is calculated for all available GPUs and the GPU with the minimum value is assigned to this subdomain. In order to keep load balancing, all GPUs have to be assigned dynamically and the same GPU could not be assigned two times as long as others GPUs are not assigned. Figure 8 explains via a International Journal of Distributed and Parallel Systems (IJDPS) Vol.6, No.5, September 2015 Peer transfers and memory accesses between two compatible GPUs. The idea is to -to-Peer data transactions for GPUs sharing the same I/O hub copy transactions for others. This method allows to communicate data by bypassing the use of the CPU and therefore to accelerate the transfer (Figure 7). The use of this type of transaction improves performance and the efficiency of the simulation code. Figure 7: GPUDirect technology (source Nvidia). Data Transfer between GPUs The repartition of GPUs is an important factor of optimization for this type of applications. Communications cost is generally a bottleneck for multi-GPU simulations. Three ways of data sub domains are defined. A first assumption assumes that one sub domain The first way concerns communications between In this case, the communication cost is extremely low because communications are performed on the same GPU global memory. The second and the third ways communications between sub domains belonging to different GPUs. A distinction is -Peer exchanges and zero-copy exchanges. This section has for goal to optimize dynamically the repartition of GPUs to new sub domains. function ‫ܨ‬ is defined as follows: ‫ܨ‬ሺ‫ܩ‬ሻ ൌ ෍ ߛ ீᇲ ሺ‫,ܩ‬ ‫ܩ‬ᇱ ሻ denotes neighboring subdomains to ‫ܩ‬ and ߛሺ‫,ܩ‬ ‫ܩ‬ᇱ ሻ is defined as follows: 0 ݂݅ ‫ܷܲܩ‬ሺ‫ܩ‬ሻ ൌ ‫ܷܲܩ‬ሺ‫ܩ‬ᇱ ሻ ‫ݎ݂݁ݏ݊ܽݎݐ‬ሻ ݂݅ ‫ܷܲܩ‬ሺ‫ܩ‬ሻ ് ‫ܷܲܩ‬ሺ‫ܩ‬ᇱሻܽ݊݀ ‫ܷܲܩ‬ሺ‫ܩ‬ሻ ܿܽ݊ ܲ2ܲ ‫ܷܲܩ‬ሺ ሻ ݂݅ ‫ܷܲܩ‬ሺ‫ܩ‬ሻ ് ‫ܷܲܩ‬ሺ‫ܩ‬ᇱሻܽ݊݀ ‫ܷܲܩ‬ሺ‫ܩ‬ሻ ݊‫ܷܲܩ ܲ2ܲ ݐ݋‬ሺ‫ܩ‬ᇱ ሻ compares the different ways of communications between the new subdomain and its neighbors. An arbitrary weighting value is included in order to promote Peer Peer communications. The function ‫ ܨ‬performs the calculation of ߛfor all active neighbors. needs therefore to be minimized in order to obtain the best communication This function is calculated for all available GPUs and the GPU with the minimum value is assigned to this subdomain. In order to keep load balancing, all GPUs have to be assigned y and the same GPU could not be assigned two times as long as others GPUs are not Figure 8 explains via a simple example the principle of this optimization. International Journal of Distributed and Parallel Systems (IJDPS) Vol.6, No.5, September 2015 8 Peer transfers and memory accesses between two compatible GPUs. The idea is to s sharing the same I/O hub and copy transactions for others. This method allows to communicate data by bypassing the use The use of this type of transaction The repartition of GPUs is an important factor of optimization for this type of applications. Three ways of data sub domain is concerns communications between sub domains In this case, the communication cost is extremely low because econd and the third ways belonging to different GPUs. A distinction is copy exchanges. This section has for (14) ሺ‫ܩ‬ᇱ ሻ ሻ (15) of communications between the new is included in order to promote Peer- for all active neighbors. re to be minimized in order to obtain the best communication This function is calculated for all available GPUs and the GPU with the minimum value is assigned to this subdomain. In order to keep load balancing, all GPUs have to be assigned y and the same GPU could not be assigned two times as long as others GPUs are not
  • 9. International Journal of Distributed and Parallel Systems (IJDPS) Vol.6, No.5, September 2015 5. RESULTS AND PERFORMAN 5.1. Hardware 8 NVIDIA Tesla C2050 graphics cards Fermi architecture based machine are used to perform simulations. Table 1 describes some Tesla communications for our architecture are also described in Figure 9. Table 1: Tesla C2050 Hardware specifications Figure 9: Peer-to-Peer communications accessibility for our architecture. CUDA compute capability Total amount of global memory (14) Multiprocessors, (32) scalar processors/MP GPU clock rate L2 cache Total amount of shared memory per block Total number of registers available per block Figure 8: Schematic example in 2D for the optimization of the repartition of GPUs. The function ‫ܨ‬ሺ‫ܩ‬ሻ is calculated for all available GPUs and the GPU which have the minimum value is chosen. For colors, please refer to the PDF version of this paper. International Journal of Distributed and Parallel Systems (IJDPS) Vol.6, No.5, September 2015 ESULTS AND PERFORMANCE 8 NVIDIA Tesla C2050 graphics cards Fermi architecture based machine are used to perform simulations. Table 1 describes some Tesla C2050 hardware specifications. communications for our architecture are also described in Figure 9. Table 1: Tesla C2050 Hardware specifications Peer communications accessibility for our architecture. CUDA compute capability 2.0 Total amount of global memory 2687 MBytes (14) Multiprocessors, (32) scalar processors/MP 448 CUDA cores GPU clock rate 1147 MHz L2 cache size 786432 bytes Total amount of shared memory per block 49152 bytes Total number of registers available per block 32768 Figure 8: Schematic example in 2D for the optimization of the repartition of GPUs. The function is calculated for all available GPUs and the GPU which have the minimum value is chosen. For colors, please refer to the PDF version of this paper. International Journal of Distributed and Parallel Systems (IJDPS) Vol.6, No.5, September 2015 9 8 NVIDIA Tesla C2050 graphics cards Fermi architecture based machine are used to perform C2050 hardware specifications. Peer-to-Peer 2.0 2687 MBytes 448 CUDA cores 1147 MHz 786432 bytes 49152 bytes 32768 Figure 8: Schematic example in 2D for the optimization of the repartition of GPUs. The function is calculated for all available GPUs and the GPU which have the minimum value is chosen. For
  • 10. International Journal of Distributed and Parallel Systems (IJDPS) Vol.6, No.5, September 2015 10 5.2. Simulations Two simulations are considered on large simulation domain in order to evaluate the performance of our contribution. Both simulations include the use of two physical components. The geometry however differs between these simulations. The first simulation is based on a simple geometry composed of 1024*256*256 calculation cells where a fluid fills all simulation domains during the simulation (Figure 10). The second simulation is based on a complex geometry composed of 1024*1024*128 calculations cells where the fluid moves within channels (Figure 11). 5.3. Performance This section deals with the performance obtained by our method. A comparison between the progressive mesh algorithm and the static mesh method generally used in literature is shown. The optimization of the repartition of GPUs on subdomains is also studied. The performance metric generally used for lattice Boltzmann method is the Million Lattice nodes Updates Per Second (MLUPS). It is calculated as follows: ܲ݁‫݂ݎ‬ெ௅௎௉ௌ ൌ ݀‫݁ݖ݅ݏ ݊݅ܽ݉݋‬ ∗ ݊‫ݏ݊݋݅ݐܽݎ݁ݐ݅ ݂݋ ݎܾ݁݉ݑ‬ ‫݁݉݅ݐ ݊݋݅ݐ݈ܽݑ݉݅ݏ‬ (16) This classical approach generally used in literature in order to perform simulations consists in equally dividing the simulation domain according to the number of GPUs. It offers generally good performance as communications can be overlapped with calculations. The use of Peer-to- Peer communications also has a beneficial effect on the performance, as shown on Figure 13. Peer-to-Peer communications allow obtaining a performance gain between 8 and 12% according Figure 10: A two-component leakage simulation on a simple geometry with a domain size of 1024*256*256 cells. Figure 11: A two-component leakage simulation on a complex geometry composed of channels with a domain size of 1024*1024*128 cells.
  • 11. International Journal of Distributed and Parallel Systems (IJDPS) Vol.6, No.5, September 2015 to the number of GPUs used for the simulation described in Figure 10. Ze communications offer a good scaling of Peer-to-Peer communications, as shown on Figure 12. The inclusion of the progressive mesh also has performance. Sub domains of size 128*128*12 and 14 describes performance in terms of calculations and memory consumption for the simulation presented on Figure 10. Note that the progressive mesh algorithm obtains excellent performance at the beginning of the simulation. The addition of simulation has for consequence a decrease of performance until the convergence of the simulation. In this particular case, all simulation domain is meshed at the end of the simulation shown on Figure 14, which leads to a mesh. In terms of memory consumption, fast apparitions of news lead to have the entire simulation domain in memory after a few iterations. Figure 12: Comparison of performance between Peer communications for the simulation shown on Figure 10. Figure 13: Comparison of performance between the progressive mesh method and the static mesh method for the simulation shown on Figure 10. The inclusion of the optimization for GPU assignment International Journal of Distributed and Parallel Systems (IJDPS) Vol.6, No.5, September 2015 to the number of GPUs used for the simulation described in Figure 10. Ze communications offer a good scaling but an almost perfect scaling is obtained with the inclusion Peer communications, as shown on Figure 12. f the progressive mesh also has an important beneficial effect on the simulation of size 128*128*128 are considered for these simulations. and 14 describes performance in terms of calculations and memory consumption for the simulation presented on Figure 10. Note that the progressive mesh algorithm obtains excellent performance at the beginning of the simulation. The addition of sub domains simulation has for consequence a decrease of performance until the convergence of the simulation. In this particular case, all simulation domain is meshed at the end of the simulation which leads to a very slight decrease of performance compared to the static In terms of memory consumption, fast apparitions of news sub domains are noted which to have the entire simulation domain in memory after a few iterations. Figure 12: Comparison of performance between Peer-to-Peer communications with zero communications for the simulation shown on Figure 10. Comparison of performance between the progressive mesh method and the static mesh method for the simulation shown on Figure 10. The inclusion of the optimization for GPU assignment is also presented. International Journal of Distributed and Parallel Systems (IJDPS) Vol.6, No.5, September 2015 11 to the number of GPUs used for the simulation described in Figure 10. Zero-copy scaling is obtained with the inclusion an important beneficial effect on the simulation 8 are considered for these simulations. Figures 13 and 14 describes performance in terms of calculations and memory consumption for the simulation presented on Figure 10. Note that the progressive mesh algorithm obtains excellent b domains during the simulation has for consequence a decrease of performance until the convergence of the simulation. In this particular case, all simulation domain is meshed at the end of the simulation, as t decrease of performance compared to the static are noted which with zero-copy Comparison of performance between the progressive mesh method and the static mesh method for the simulation shown on Figure 10. The inclusion of the optimization for GPU assignment
  • 12. International Journal of Distributed and Parallel Systems (IJDPS) Vol.6, No.5, September 2015 Figure 13 also compares performance between two different assignments for GPUs. The first one is a simple assignation which assigns to ne uses the optimization method presented in sec leads to an important difference of performance noted at the convergence of this simulation between the two approaches. due to the fact that the communication cost is more optimized assignment. Since subdomains are added dynamically and connected to e therefore important to optimize these communications in order to reduce the simulation time. The same comparison is also done for the simulation presented on Figure 11, as shown on Figures 15 and 16. The main difference in this situation complex and channelized. Physical simulations on channelized geometry are especially pre on industrial structures. In this case, the progressive mesh method shows excellent results. In terms of memory, method is easily able to simulate on a global simulation domain of size 1024*1024*128 while the static mesh method is unable to perform the simulation. The amount of is indeed too important for this simulation. consumption during the simulation. less important than the static mesh method. A gain of approximatively 50% of memory is noted for this particular simulation. This is due automatically adapts to the evolution of the simulation and so only needed zones of the global simulation domain are meshed. Figure 14: Comparison of memory mesh method for the simulation shown on Figure 10. International Journal of Distributed and Parallel Systems (IJDPS) Vol.6, No.5, September 2015 Figure 13 also compares performance between two different assignments for GPUs. The first one is a simple assignation which assigns to new subdomain the first available GPU. The second one presented in section 4.4.2. The comparison of these two methods leads to an important difference of performance. Indeed, a difference of approximatively 30% is simulation between the two approaches. This difference is mostly due to the fact that the communication cost is more important for a simple assignment than an optimized assignment. Since subdomains are added dynamically and connected to each other, it is these communications in order to reduce the simulation time. The same comparison is also done for the simulation presented on Figure 11, as shown on Figures 15 and 16. The main difference in this situation is the geometry of the simulation which is more complex and channelized. Physical simulations on channelized geometry are especially pre In this case, the progressive mesh method shows excellent results. In terms of memory, method is easily able to simulate on a global simulation domain of size 1024*1024*128 while the static mesh method is unable to perform the simulation. The amount of needed is indeed too important for this simulation. Figure 15 shows the evolution of memory consumption during the simulation. The memory cost at the convergence of the simulation is far less important than the static mesh method. A gain of approximatively 50% of memory is noted for this particular simulation. This is due to the fact that the progressive mesh method automatically adapts to the evolution of the simulation and so only needed zones of the global Figure 14: Comparison of memory consumption between the progressive mesh method and the static mesh method for the simulation shown on Figure 10. International Journal of Distributed and Parallel Systems (IJDPS) Vol.6, No.5, September 2015 12 Figure 13 also compares performance between two different assignments for GPUs. The first one The second one tion 4.4.2. The comparison of these two methods . Indeed, a difference of approximatively 30% is This difference is mostly assignment than an ach other, it is these communications in order to reduce the simulation time. The same comparison is also done for the simulation presented on Figure 11, as shown on Figures is the geometry of the simulation which is more complex and channelized. Physical simulations on channelized geometry are especially present In this case, the progressive mesh method shows excellent results. In terms of memory, this method is easily able to simulate on a global simulation domain of size 1024*1024*128 and more needed memory the evolution of memory The memory cost at the convergence of the simulation is far less important than the static mesh method. A gain of approximatively 50% of memory is noted to the fact that the progressive mesh method automatically adapts to the evolution of the simulation and so only needed zones of the global consumption between the progressive mesh method and the static
  • 13. International Journal of Distributed and Parallel Systems (IJDPS) Vol.6, No.5, September 2015 Figure 15: Comparison of memory consumption between the method for the simulation shown on Figure 11. The comparison of the repartition performance gain (19%) is still noted for this simulation. This proves that a method is important in order to obtain good performance. not need to be fully meshed brings an important gain in performance. The geometry has therefore an important impact on the performance on Figure 16: Comparison of performance between a simple repartition of GPUs with an optimized assignment of GPUs for the simulation shown on Figure 11. 6. CONCLUSION In this paper, an efficient progressive Boltzmann method is presented. This progressive mesh method can be a useful tool in order to perform several types of physical simulations. Its main advantage is that subdomains are automatically added to the simulation by the use of an adapted criterion. This method is also able to save a lot of memory and calculations International Journal of Distributed and Parallel Systems (IJDPS) Vol.6, No.5, September 2015 Figure 15: Comparison of memory consumption between the progressive mesh method and the static mesh method for the simulation shown on Figure 11. The comparison of the repartition of GPUs is also described in Figure 16. A %) is still noted for this simulation. This proves that a dynamic optimization method is important in order to obtain good performance. Moreover, the fact that the domain does not need to be fully meshed brings an important gain in performance. The geometry has therefore an important impact on the performance on the progressive mesh method. Figure 16: Comparison of performance between a simple repartition of GPUs with an optimized assignment of GPUs for the simulation shown on Figure 11. In this paper, an efficient progressive mesh algorithm for physical simulations using the lattice Boltzmann method is presented. This progressive mesh method can be a useful tool in order to perform several types of physical simulations. Its main advantage is that subdomains are ded to the simulation by the use of an adapted criterion. This method is also able and calculations in order to perform simulations on large installations. International Journal of Distributed and Parallel Systems (IJDPS) Vol.6, No.5, September 2015 13 progressive mesh method and the static mesh . An important dynamic optimization Moreover, the fact that the domain does not need to be fully meshed brings an important gain in performance. The geometry has therefore Figure 16: Comparison of performance between a simple repartition of GPUs with an optimized assignment mesh algorithm for physical simulations using the lattice Boltzmann method is presented. This progressive mesh method can be a useful tool in order to perform several types of physical simulations. Its main advantage is that subdomains are ded to the simulation by the use of an adapted criterion. This method is also able in order to perform simulations on large installations.
  • 14. International Journal of Distributed and Parallel Systems (IJDPS) Vol.6, No.5, September 2015 14 The integration of the progressive mesh method on single-node multi-GPU architecture is also treated. A dynamic optimization of the repartition of GPUs to subdomains is an important factor in order to obtain good performance. The combination of all these contributions allows therefore performing fast physical simulations on all types of geometry. The progressive mesh method is therefore an interesting alternative because it allows obtaining similar or better performances than the usual static mesh method. The progressive mesh algorithm is however limited to the memory of the GPU which is generally far more inferior to the CPU RAM. The creation of new subdomains is indeed possible while there is a sufficient amount of memory on the GPUs. Extensions of this work to cases that require more memory than all GPUs can handle is now under investigation. Data transfer optimizations with the CPU host will therefore be essential to keep good performances. ACKNOWLEDGEMENTS This work has been made possible thanks to collaboration between academic and industrial groups, gathered by the INNOCOLD association. REFERENCES [1] B. Chopard, J.L. Falcone J. Latt, The Lattice Boltzmann advection diffusion model revisited, The European Physical Journal - Special Topics,Vol. 171, pp. 245-249, 2009. [2] S. Gong, P. Cheng, Numerical investigation of droplet motion and coalescence by an improved lattice Boltzmann model for phase transitionsand multiphase flows, Computers & Fluids , Vol. 53, pp. 93- 104, 2012. [3] S. Gong, P. Cheng, A lattice Boltzmann method for liquid vapor phase change heat transfer, Computers & Fluids, Vol. 54, pp. 93-104, 2012. [4] J. Bao, L. Schaeffer, Lattice Boltzmann equation model for multicomponent multi-phase flow with high density ratios, Applied MathematicalModelling, 2012. [5] Nvidia, C. U. D. A. (2011). Nvidia cuda c programming guide. NVIDIA Corporation, 120, 18.8 [6] M. Wittmann, T. Zeiser, G. Hager, G. Wellein, Comparison of different propagation steps for Lattice Boltzmann methods, Computers and Mathematicswith Applications, Vol. 65 pp. 924-935, 2013. [7] J. Tölke, Implementation of a Lattice Boltzmann kernel using the compute unified device architecture developed by nVIDIA, Computing andVisualization in Science, 1-11, 2008. [8] J. Tölke, M. Krafczyk, TeraFLOP computing on a desktop PC with GPUs for 3D CFD, International Journal of Computational Fluid Dynamics 22(7), pp. 443-456, 2008. [9] F. Kuznik, C.Obrecht, G. Rusaouën, J-J. Roux, LBM based flow simulation using GPU computing processor, Computers and Mathematics withApplications 27, 2009. [10] C. Obrecht, F. Kuznik, B. Tourancheau, J-J. Roux, A new approach to the lattice Boltzmann method for graphics processing units, Computersand Mathematics with Applications 61, pp. 3628-3638, 2011. [11] P.R. Rinaldi, E.A Dari, M.J. Vénere, A. Clausse, A Lattice-Boltzmannsolver for 3D fluid on GPU, Simulation Modeling Pratice and Theory 25,pp. 163-171, 2012. [12] P. Bailey, J. Myre, S. Walsh, D. Lilja, M. Saar, Accelerating lattice boltzmann fluid flows using graphics processors, International Conferenceon Parallel Processing, pp. 550-557, 2009. [13] C. Obrecht, F. Kuznik, B. Tourancheau, J-J. Roux, Multi-GPU implementation of the lattice Boltzmann method, Computers and Mathematicswith Applications, 80, pp. 269-275, 2013. [14] X. Li, Y. Zhang, X. Wang, W. Ge, GPU-based numerical simulation of multi-phase flow in porous media using multiple-relaxation-time latticeBoltzmann method, Chemical Engineering Science, Vol. 102, pp. 209-219,2013. [15] M. Januszewski, M. Kostur, Sailfish: A flexible multi-GPU implementationof the lattice Boltzmann method, Computer Physics Communications,Vol. 185, pp. 2350-2368, 2014. [16] F. Jiang, C. Hu, Numerical simulation of a rising CO2 droplet in the initial accelerating stage by a multiphase lattice Boltzmann method,Applied Ocean Research, Vol. 45, pp. 1-9, 2014.
  • 15. International Journal of Distributed and Parallel Systems (IJDPS) Vol.6, No.5, September 2015 15 [17] C. Obrecht, F. Kuznik, B. Tourancheau, and J.-J. Roux, Multi-GPU Implementation of a Hybrid Thermal Lattice Boltzmann Solver using theTheLMA Framework, Computers and Fluids, Vol. 80, pp. 269275, 2013. [18] C. Rosales, Multiphase LBM Distributed over Multiple GPUs, CLUSTER’11 Proceedings of the 2011 IEEE International Conference onCluster Computing, pp. 1-7, 2011. [19] C. Obrecht, F. Kuznik, B. Tourancheau, J-J. Roux, Scalable lattice Boltzmann solvers for CUDA GPU clusters, Parallel Computing, Vol.39, pp. 259-270, 2013. [20] J. Habich, C. Feichtinger, H. Köstler, G. Hager, G. Wellein, Performance engineering for the lattice Boltzmann method on GPGPUs: Architecturalrequirements and performance results, Computer & Fluids, Vol. 80, pp.276-282, 2013. [21] C. Feichtinger, J. Habich, H. Köstler, U. Rüde, T. Aoki, Performance Modeling and Analysis of Heterogeneous Lattice Boltzmann Simulationson CPU-GPU Clusters, Parallel Computing, 2014. AUTHORS Julien Duchateau is a PhD student in computer science at the Université du Littoral Côte d’Opale in France. His main research interest are massive parallelism on CPUs and GPUs, physical simulations and computer graphics. François Rousselle is an associate professor in computer science at the Université du Littoral Côte d’Opale in France. His main research interests are computer graphics, physical simulations, virtual reality and massive parallelism. Nicolas Maquignon is a PhD student in simulation and numerical physics at the Université du Littoral Côte d’Opale. His main research interests are numerical physics, numerical mathematics and numerical modeling. Christophe Renaud is a professor in computer science at the Université du Littoral Côte d’Opale in France. His main research interests are computer graphics, virtual reality, physical simulations and massive parallelism. Gilles Roussel is an associate professor in automatic at the Université du Littoral Côte d’Opale in France. His main research interests are automatic, signal processing, physical simulations and industrial computing.