SlideShare a Scribd company logo
SimpleArray
between Python and C++
Yung-Yu Chen
(yyc@sciwork.dev)
PyCon Taiwan 2023.9.2
1
Speed is the king
リニア中央新幹線 (Chuo Shinkansen)
https://linear-chuo-shinkansen.jr-central.co.jp/about/design/
天下武功
無堅不破
唯快不破
500km/h high speed rail by 2038
2
Array: HPC fundamental
...
...
...
regularly allocated bu
ff
er
regularly allocated bu
ff
er
regularly allocated bu
ff
er
regularly accessing algorithm
randomly accessing algorithm
quoted from SemiAnalysis
https://www.semianalysis.com/p/apple-m2-die-shot-and-architecture
die shots of popular chips
The data structure for cache optimization, data parallelism, SIMD, GPU, all the
techniques of high-performance computing (HPC)
3
Dimensionality
one-dimensional array
two-dimensional array three-dimensional array
arrays of higher dimension ...
data is always laid out
linearly in memory,
regardless of dimension
4
Hand-craft array library
and don't forget that there will be some
5
Make your own array
• Make SimpleArray
• Vehicle to carry ad
hoc optimization in
addition to general
array operations
buffer
management
data access
void calc_distance(
size_t const n
, double const * x
, double const * y
, double * r)
{
for (size_t i = 0 ; i < n ; ++i)
{
r[i] = std::sqrt(x[i]*x[i] + y[i]*y[i]);
}
}
vmovupd ymm0, ymmword [rsi + r9*8]
vmulpd ymm0, ymm0, ymm0
vmovupd ymm1, ymmword [rdx + r9*8]
vmulpd ymm1, ymm1, ymm1
vaddpd ymm0, ymm0, ymm1
vsqrtpd ymm0, ymm0
movupd xmm0, xmmword [rsi + r8*8]
mulpd xmm0, xmm0
movupd xmm1, xmmword [rdx + r8*8]
mulpd xmm1, xmm1
addpd xmm1, xmm0
sqrtpd xmm0, xmm1
AVX: 256-bit-wide vectorization SSE: 128-bit-wide vectorization
C++ code
6
Key components:
Multi-dimensional array
• SimpleArray is a class template
• Holds a contiguous memory bu
ff
er
• Provides multi-dimensional accessors to its elements
• Designed for fundamental types; might be used for POD types (untested)
...
...
...
...
...
array
m × n
0,0 0,1
1,0 1,1
m,0 m,1
0,n
1,n
m,n
0 1 2 3 4 5 6 7 8
...
1D array
7
SimpleArray is not C++ std::vector
SimpleArray std::vector
SimpleArray is
fi
xed size
• Only allocate memory on
construction
std::vector is variable size
• Bu
ff
er may be invalidated
• Implicit memory allocation
(reallocation)
Multi-dimensional access:
operator()
One-dimensional access:
operator[]
8
Features
• Header-only C++ code
• Simple memory allocation by
using
fi
xed-size
• Separate bu
ff
er ownership
and meta data
• Meta data for multi-dimensional
access:
• Type
• Shape and stride
• Custom for application speci
fi
c
behavior:
• Ghost index (in the
fi
rst
dimension)
9
Buffer management
• Only constructor allocates memory; only destructor deallocates memory
• Allocation and deallocation only happen when necessary
• Management is implemented in the class ConcreteBuffer
• ConcreteBuffer is a member datum of SimpleArray
template <typename T>
class SimpleArray
{
using buffer_type = ConcreteBuffer;
std::shared_ptr<buffer_type> m_buffer;
};
10
ConcreteBuffer
class ConcreteBuffer
: public std::enable_shared_from_this<ConcreteBuffer>
{
private:
// Protect the constructors
struct ctor_passkey {};
public:
static std::shared_ptr<ConcreteBuffer> construct(size_t nbytes)
{
return std::make_shared<ConcreteBuffer>(nbytes, ctor_passkey());
}
...
};
// Turn off move
ConcreteBuffer(ConcreteBuffer &&) = delete;
ConcreteBuffer & operator=(ConcreteBuffer &&) = delete;
// Use a deleter to customize memory management with unique_ptr
using data_deleter_type = detail::ConcreteBufferDataDeleter;
using unique_ptr_type = std::unique_ptr<int8_t, data_deleter_type>;
size_t m_nbytes; // Remember the amount of bytes
unique_ptr_type m_data; // Point to the allocated momery
ConcreteBuffer(size_t nbytes, const ctor_passkey &)
: m_nbytes(nbytes)
, m_data(allocate(nbytes))
{}
// Allocate memory
static unique_ptr_type allocate(size_t nbytes)
{
unique_ptr_type ret(nullptr, data_deleter_type());
if (0 != nbytes)
{
ret = unique_ptr_type(new int8_t[nbytes], data_deleter_type());
}
return ret;
}
1. Exclusively manage it using std::shared_ptr
1.1. Turn o
ff
move to prevent the pointer from going away 3. Use a deleter to customize deallocation from outside
2. Allocate memory and keep it in a std::unique_ptr
11
Design a custom management interface
using data_deleter_type = detail::ConcreteBufferDataDeleter;
using unique_ptr_type = std::unique_ptr<int8_t, data_deleter_type>;
struct ConcreteBufferDataDeleter
{
using remover_type = ConcreteBufferRemover;
explicit
ConcreteBufferDataDeleter(
std::unique_ptr<remover_type> && remover_in)
: remover(std::move(remover_in))
{}
void operator()(int8_t * p) const
{
if (!remover)
{
// Array deletion if no remover is available.
delete[] p;
}
else
{
(*remover)(p);
}
}
std::unique_ptr<remover_type> remover{nullptr};
};
struct ConcreteBufferRemover
{
virtual ~ConcreteBufferRemover() = default;
virtual void operator()(int8_t * p) const
{
delete[] p;
}
};
struct ConcreteBufferNoRemove
: public ConcreteBufferRemover
{
void operator()(int8_t *) const override {}
};
unique_ptr deleter is a proxy to
ConcreteBuffer remover
Remover is a polymorphic functor type
Allow change to work with a foreign sub-
system for memory management
12
Data access
template <typename T>
class SimpleArray
{
public:
// Type information.
using value_type = T;
static constexpr size_t ITEMSIZE = sizeof(value_type);
using buffer_type = ConcreteBuffer;
/// Contiguous data buffer for the array.
std::shared_ptr<buffer_type> m_buffer;
using shape_type = small_vector<size_t>;
/// Each element in this vector is the number of element in the
/// corresponding dimension.
shape_type m_shape;
/// Each element in this vector is the number of elements (not number of
/// bytes) to skip for advancing an index in the corresponding dimension.
shape_type m_stride;
};
Type information is made available
during compile time, so compiler
can take the type information to
generate fast instructions
13
Shape and stride
explicit SimpleArray(size_t length)
: m_buffer(buffer_type::construct(length * ITEMSIZE))
, m_shape{length}
, m_stride{1}
, m_body(m_buffer->data<T>())
{}
explicit SimpleArray(shape_type const & shape)
: m_shape(shape)
, m_stride(calc_stride(m_shape))
{
if (!m_shape.empty())
{
m_buffer = buffer_type::construct(
m_shape[0] * m_stride[0] * ITEMSIZE);
m_body = m_buffer->data<T>();
}
}
static shape_type calc_stride(shape_type const & shape)
{
shape_type stride(shape.size());
if (!shape.empty())
{
stride[shape.size() - 1] = 1;
for (size_t it = shape.size() - 1; it > 0; --it)
{
stride[it - 1] = stride[it] * shape[it];
}
}
return stride;
}
One-dimensional shape
Multi-dimensional shape
Stride is the number of elements to be
skipped when the index is changed by 1 in the
corresponding dimension:
...
...
...
...
...
array
m × n
0,0 0,1
1,0 1,1
m,0 m,1
0,n
1,n
m,n
14
Multi-dimensional accessor
template <typename T>
class SimpleArray
{
...
template <typename... Args>
value_type const & operator()(Args... args) const
{ return *vptr(args...); }
template <typename... Args>
value_type & operator()(Args... args)
{ return *vptr(args...); }
template <typename... Args>
value_type const * vptr(Args... args) const
{ return m_body + buffer_offset(m_stride, args...); }
template <typename... Args>
value_type * vptr(Args... args)
{ return m_body + buffer_offset(m_stride, args...); }
...
};
namespace detail
{
// Use recursion in templates.
template <size_t D, typename S>
size_t buffer_offset_impl(S const &)
{
return 0;
}
template <size_t D, typename S, typename Arg, typename... Args>
size_t buffer_offset_impl
(S const & strides, Arg arg, Args... args)
{
return arg * strides[D] +
buffer_offset_impl<D + 1>(strides, args...);
}
} /* end namespace detail */
template <typename S, typename... Args>
size_t buffer_offset(S const & strides, Args... args)
{
return detail::buffer_offset_impl<0>(strides, args...);
}
Element and pointer accessor by using
variadic template
Recursive implementation (compile-time) to
save conditional branching
15
Ghost (negative) index
• Ghost: elements indexed by negative integer
• Similar to the negative index for the POD array
• No overhead for normal arrays that start with 0 index
// C-style POD array.
int32_t data[100];
// Make a pointer to the head address of the array.
int32_t * pdata = data;
// Make another pointer to the 50-th element from the head of the array.
int32_t * odata = pdata + 50;
odata[-10]
0 1 2 3 4 5 6 7 8 9 40 41 42 43 44 45 46 47 48 49
data[40]
data ... ...
50
*(data+40)
*(odata-10)
pdata odata
16
Get ghost information on construction
template <typename T>
class SimpleArray
{
private:
// Number of ghost elements
size_t m_nghost = 0;
// Starting address of non-ghost
value_type * m_body = nullptr;
};
static T * calc_body(T * data, shape_type const & stride,
size_t nghost)
{
if (nullptr == data || stride.empty() || 0 == nghost)
{
// Do nothing.
}
else
{
shape_type shape(stride.size(), 0);
shape[0] = nghost;
data += buffer_offset(stride, shape);
}
return data;
}
SimpleArray(SimpleArray const & other)
: m_buffer(other.m_buffer->clone())
, m_shape(other.m_shape)
, m_stride(other.m_stride)
, m_nghost(other.m_nghost)
, m_body(calc_body(m_buffer->data<T>(),
m_stride, other.m_nghost))
{}
SimpleArray(SimpleArray && other) noexcept
: m_buffer(std::move(other.m_buffer))
, m_shape(std::move(other.m_shape))
, m_stride(std::move(other.m_stride))
, m_nghost(other.m_nghost)
, m_body(other.m_body)
{}
Keep ghost number and body address Assign ghost and body information
in constructors
Calculate body address
17
No additional overhead
during element access
Make your own small vector
template <typename T, size_t N = 3>
class small_vector
{
public:
T const * data() const { return m_head; }
T * data() { return m_head; }
private:
T * m_head = nullptr;
unsigned int m_size = 0;
unsigned int m_capacity = N;
std::array<T, N> m_data;
};
explicit small_vector(size_t size)
: m_size(static_cast<unsigned int>(size))
{
if (m_size > N)
{
m_capacity = m_size;
// Allocate only when the size is too large.
m_head = new T[m_capacity];
}
else
{
m_capacity = N;
m_head = m_data.data();
}
}
small_vector() { m_head = m_data.data(); }
~small_vector()
{
if (m_head != m_data.data() && m_head != nullptr)
{
delete[] m_head;
m_head = nullptr;
}
}
Spend a small size of memory to reduce
dynamic memory allocation
Allocate memory only when size is too large
No deallocation is needed if no allocation *data
val val
initial
after push-back
(empty)
(dynamically allocated bu
ff
er)
std::vector
*data
val val
initial and after
fi
rst couple of push-back
(preallocated bu
ff
er)
(our) small_vector
18
Small vector, cont.
Copy constructor may also allocate memory Move constructor should not allocate memory
small_vector(small_vector const & other)
: m_size(other.m_size)
{
if (other.m_head == other.m_data.data())
{
m_capacity = N;
m_head = m_data.data();
}
else
{
m_capacity = m_size;
m_head = new T[m_capacity];
}
std::copy_n(other.m_head, m_size, m_head);
}
small_vector(small_vector && other) noexcept
: m_size(other.m_size)
{
if (other.m_head == other.m_data.data())
{
m_capacity = N;
std::copy_n(other.m_data.begin(), m_size,
m_data.begin());
m_head = m_data.data();
}
else
{
m_capacity = m_size;
m_head = other.m_head;
other.m_size = 0;
other.m_capacity = N;
other.m_head = other.m_data.data();
}
}
19
Constructors are the key of making C++ work
Numpy for dense array
• The numpy library provides everything we need for arrays in Python
• Convention of using numpy:
• The created array copies the Python data:
>>> import numpy as np
>>> lst = [1, 1, 2, 3, 5]
>>> print('A list:', lst)
A list: [1, 1, 2, 3, 5]
>>> array = np.array(lst)
>>> print('An array:', np.array(array))
An array: [1 1 2 3 5]
Create a list: Create an array:
>>> array[2] = 8
>>> print(array) # The array changes.
[1 1 8 3 5]
>>> print(lst) # The input sequence is not changed.
[1, 1, 2, 3, 5]
20
Python is slow
u = 0
u = 0
u = 0
u(y) = sin(πy)
i, j
i, j + 1
i, j − 1
i + 1,j
i − 1,j
i + 1,j + 1
i + 1,j − 1
i − 1,j + 1
i − 1,j − 1
grid points
computing domain
(and solution)
∂2
u
∂x2
+
∂2
u
∂y2
= 0 (0 < x < 1; 0 < y < 1)
u(0,y) = 0, u(1,y) = sin(πy) (0 ≤ y ≤ 1)
u(x,0) = 0, u(x,1) = 0 (0 ≤ x ≤ 1)
Equation for domain interior
Boundary conditions
un+1
(xi, yi) =
un
(xi+1, yj) + un
(xi−1, yj) + un
(xi, yj+1) + un
(xi, yj−1)
4
Use point-Jacobi method to march
Let us show how slow Python is by solving the Laplace equation, a simple PDE
21
Python vs numpy
def solve_array():
u = uoriginal.copy() # Input from outer scope
un = u.copy() # Create the buffer for the next time step
converged = False
step = 0
while not converged:
step += 1
un[1:nx-1,1:nx-1] = (u[2:nx,1:nx-1] + u[0:nx-2,1:nx-1] +
u[1:nx-1,2:nx] + u[1:nx-1,0:nx-2]) / 4
norm = np.abs(un-u).max()
u[...] = un[...]
converged = True if norm < 1.e-5 else False
return u, step, norm
def solve_python_loop():
u = uoriginal.copy() # Input from outer scope
un = u.copy() # Create the buffer for the next time step
converged = False
step = 0
# Outer loop.
while not converged:
step += 1
# Inner loops. One for x and the other for y.
for it in range(1, nx-1):
for jt in range(1, nx-1):
un[it,jt] = (u[it+1,jt] + u[it-1,jt] + u[it,jt+1] + u[it,jt-1]) / 4
norm = np.abs(un-u).max()
u[...] = un[...]
converged = True if norm < 1.e-5 else False
return u, step, norm
>>> with Timer():
>>> u, step, norm = solve_array()
Wall time: 0.0552309 s
>>> with Timer():
>>> u, step, norm = solve_python_loop()
Wall time: 4.79688 s
numpy array for point-Jacobi
python loop for point-Jacobi
The array version is 87x
faster than the loop version
22
Speed up by C++
std::tuple<modmesh::SimpleArray<double>, size_t, double>
solve1(modmesh::SimpleArray<double> u)
{
const size_t nx = u.shape(0);
modmesh::SimpleArray<double> un = u;
bool converged = false;
size_t step = 0;
double norm;
while (!converged)
{
norm = 0.0;
++step;
for (size_t it=1; it<nx-1; ++it)
{
for (size_t jt=1; jt<nx-1; ++jt)
{
un(it,jt) = (u(it+1,jt) + u(it-1,jt) + u(it,jt+1) + u(it,jt-1)) / 4;
double const v = std::abs(un(it,jt) - u(it,jt));
if (v > norm)
{
norm = v;
}
}
}
// additional loop slows down: norm = calc_norm_amax(u, un);
if (norm < 1.e-5) { converged = true; }
u = un;
}
return std::make_tuple(u, step, norm);
}
// Expose the C++ helper to Python
PYBIND11_MODULE(solve_cpp, m)
{
// Boilerplate for using the SimpleArray with Python
{
modmesh::python::import_numpy();
modmesh::python::wrap_ConcreteBuffer(m);
modmesh::python::wrap_SimpleArray(m);
}
m.def
(
"solve_cpp"
, [](pybind11::array_t<double> & uin)
{
auto ret = solve1(modmesh::python::makeSimpleArray(uin));
return std::make_tuple(
modmesh::python::to_ndarray(std::get<0>(ret)),
std::get<1>(ret),
std::get<2>(ret));
}
);
}
>>> with Timer():
>>> u, step, norm = solve_cpp.solve_cpp(uoriginal)
Wall time: 0.0251369 s
c++ loop for point-Jacobi expose to Python
23
Runtime and maintainability
24
for it in range(1, nx-1):
for jt in range(1, nx-1):
un[it,jt] = (u[it+1,jt] + u[it-1,jt] + u[it,jt+1] + u[it,jt-1]) / 4
un[1:nx-1,1:nx-1] = (u[2:nx,1:nx-1] + u[0:nx-2,1:nx-1] +
u[1:nx-1,2:nx] + u[1:nx-1,0:nx-2]) / 4
for (size_t it=1; it<nx-1; ++it)
{
for (size_t jt=1; jt<nx-1; ++jt)
{
un(it,jt) = (u(it+1,jt) + u(it-1,jt) + u(it,jt+1) + u(it,jt-1)) / 4;
}
}
Point-Jacobi method
Python nested loop
Numpy array
C++ nested loop
4.797s (1x)
0.055s (87x)
0.025s (192x)
un+1
(xi, yi) =
un
(xi+1, yj) + un
(xi−1, yj) + un
(xi, yj+1) + un
(xi, yj−1)
4
Conversion between Python containers and C++
• Not really "conversion" or "translation", but data copy
val
val
val
val
val
C++
std::vector<double>
ref
ref
ref
ref
ref
Python
list
val
val
val
val
val
python data are
randomly allocation
Reconstruct from Python to C++
val
val
val
val
val
C++
std::vector<double>
ref
ref
ref
ref
ref
Python
list
val
val
val
val
val
sequential construction makes
more regular memory layout
Reconstruct from C++ to Python
25
Zero-copy to speed up
Python app C++ app
a11 a12 ⋯ a1n a21 ⋯ am1 ⋯ amn
share memory bu
ff
er across language
Ndarray
C++
container
Python app C++ app
C++
container
Ndarray
manage
access a11 a12 ⋯ a1n a21 ⋯ am1 ⋯ amn
memory bu
ff
er shared across language
👍 bottom (C++) - up (Python) design
Python app C++ app
C++
container
Ndarray
manage
access
a11 a12 ⋯ a1n a21 ⋯ am1 ⋯ amn
memory bu
ff
er shared across language
top (Python) - down (C++) design 👎
Python-controlling
lifecycle is
bad for long-running
process
C++ allows
fi
ne-grained
control of all resources
26
Buffer interface and numpy ndarray
template <typename S>
std::enable_if_t<is_simple_array_v<S>, pybind11::array>
to_ndarray(S && sarr)
{
namespace py = pybind11;
using T = typename std::remove_reference_t<S>::value_type;
std::vector<size_t> const shape(sarr.shape().begin(),
sarr.shape().end());
std::vector<size_t> stride(sarr.stride().begin(),
sarr.stride().end());
for (size_t & v : stride) { v *= sarr.itemsize(); }
return py::array(
/* Numpy dtype */
py::detail::npy_format_descriptor<T>::dtype(),
/* Buffer dimensions */
shape,
/* Strides (in bytes) for each index */
stride,
/* Pointer to buffer */
sarr.data(),
/* Owning Python object */
py::cast(sarr.buffer().shared_from_this()));
}
template <typename T>
static SimpleArray<T> makeSimpleArray(
pybind11::array_t<T> & ndarr)
{
typename SimpleArray<T>::shape_type shape;
for (ssize_t i = 0; i < ndarr.ndim(); ++i)
{
shape.push_back(ndarr.shape(i));
}
std::shared_ptr<ConcreteBuffer> const buffer =
ConcreteBuffer::construct(
ndarr.nbytes(),
ndarr.mutable_data(),
std::make_unique<ConcreteBufferNdarrayRemover>(
ndarr));
return SimpleArray<T>(shape, buffer);
}
take SimpleArray bu
ff
er to make ndarray take ndarray bu
ff
er to make SimpleArray
27
Conclusion
It's important to always keep Python:
• Think in arrays for speed
• Speed is the king
• Don't be afraid of C++
• Always keep Python -- it's your UI
28

More Related Content

Similar to SimpleArray between Python and C++ (20)

PPTX
1. C Basics for Data Structures Bridge Course
P. Subathra Kishore, KAMARAJ College of Engineering and Technology, Madurai
 
PPTX
Vocabulary Types in C++17
Bartlomiej Filipek
 
PDF
dynamic-allocation.pdf
ngonidzashemutsipa
 
PPTX
Getting started cpp full
Võ Hòa
 
PPTX
Kapacitor - Real Time Data Processing Engine
Prashant Vats
 
PPT
Lecture2.ppt
Sabaunnisa3
 
PPTX
Introduction to dynamic memory allocation – Structures–accessing structure me...
kavi806657
 
PDF
aspice
ssuser974acd1
 
PPTX
Dynamic Memory Allocation.pptx
ssuser688516
 
PPTX
2011.02.18 marco parenzan - modelli di programmazione per le gpu
Marco Parenzan
 
PDF
Address/Thread/Memory Sanitizer
Platonov Sergey
 
PPS
Clanguage
Vidyacenter
 
PPTX
Exploiting GPU's for Columnar DataFrrames by Kiran Lonikar
Spark Summit
 
PDF
State of the .Net Performance
CUSTIS
 
PDF
Adam Sitnik "State of the .NET Performance"
Yulia Tsisyk
 
PDF
Embedded C - Lecture 2
Mohamed Abdallah
 
PPTX
programming language in c&c++
Haripritha
 
PDF
HexRaysCodeXplorer: object oriented RE for fun and profit
Alex Matrosov
 
PPS
C programming session 01
Dushmanta Nath
 
Vocabulary Types in C++17
Bartlomiej Filipek
 
dynamic-allocation.pdf
ngonidzashemutsipa
 
Getting started cpp full
Võ Hòa
 
Kapacitor - Real Time Data Processing Engine
Prashant Vats
 
Lecture2.ppt
Sabaunnisa3
 
Introduction to dynamic memory allocation – Structures–accessing structure me...
kavi806657
 
Dynamic Memory Allocation.pptx
ssuser688516
 
2011.02.18 marco parenzan - modelli di programmazione per le gpu
Marco Parenzan
 
Address/Thread/Memory Sanitizer
Platonov Sergey
 
Clanguage
Vidyacenter
 
Exploiting GPU's for Columnar DataFrrames by Kiran Lonikar
Spark Summit
 
State of the .Net Performance
CUSTIS
 
Adam Sitnik "State of the .NET Performance"
Yulia Tsisyk
 
Embedded C - Lecture 2
Mohamed Abdallah
 
programming language in c&c++
Haripritha
 
HexRaysCodeXplorer: object oriented RE for fun and profit
Alex Matrosov
 
C programming session 01
Dushmanta Nath
 

More from Yung-Yu Chen (10)

PDF
Write Python for Speed
Yung-Yu Chen
 
PDF
Write code and find a job
Yung-Yu Chen
 
PDF
Notes about moving from python to c++ py contw 2020
Yung-Yu Chen
 
PDF
On the necessity and inapplicability of python
Yung-Yu Chen
 
PDF
Take advantage of C++ from Python
Yung-Yu Chen
 
PDF
Start Wrap Episode 11: A New Rope
Yung-Yu Chen
 
PDF
Harmonic Stack for Speed
Yung-Yu Chen
 
PDF
Your interactive computing
Yung-Yu Chen
 
PDF
Engineer Engineering Software
Yung-Yu Chen
 
PDF
Craftsmanship in Computational Work
Yung-Yu Chen
 
Write Python for Speed
Yung-Yu Chen
 
Write code and find a job
Yung-Yu Chen
 
Notes about moving from python to c++ py contw 2020
Yung-Yu Chen
 
On the necessity and inapplicability of python
Yung-Yu Chen
 
Take advantage of C++ from Python
Yung-Yu Chen
 
Start Wrap Episode 11: A New Rope
Yung-Yu Chen
 
Harmonic Stack for Speed
Yung-Yu Chen
 
Your interactive computing
Yung-Yu Chen
 
Engineer Engineering Software
Yung-Yu Chen
 
Craftsmanship in Computational Work
Yung-Yu Chen
 
Ad

Recently uploaded (20)

PPTX
Tally_Basic_Operations_Presentation.pptx
AditiBansal54083
 
PPTX
Tally software_Introduction_Presentation
AditiBansal54083
 
PDF
[Solution] Why Choose the VeryPDF DRM Protector Custom-Built Solution for You...
Lingwen1998
 
PPTX
Hardware(Central Processing Unit ) CU and ALU
RizwanaKalsoom2
 
PPTX
Finding Your License Details in IBM SPSS Statistics Version 31.pptx
Version 1 Analytics
 
PDF
SAP Firmaya İade ABAB Kodları - ABAB ile yazılmıl hazır kod örneği
Salih Küçük
 
PPTX
Coefficient of Variance in IBM SPSS Statistics Version 31.pptx
Version 1 Analytics
 
PDF
IDM Crack with Internet Download Manager 6.42 Build 43 with Patch Latest 2025
bashirkhan333g
 
PDF
The 5 Reasons for IT Maintenance - Arna Softech
Arna Softech
 
PPTX
Agentic Automation Journey Series Day 2 – Prompt Engineering for UiPath Agents
klpathrudu
 
PDF
MiniTool Power Data Recovery 8.8 With Crack New Latest 2025
bashirkhan333g
 
PPTX
Empowering Asian Contributions: The Rise of Regional User Groups in Open Sour...
Shane Coughlan
 
PDF
AOMEI Partition Assistant Crack 10.8.2 + WinPE Free Downlaod New Version 2025
bashirkhan333g
 
PDF
TheFutureIsDynamic-BoxLang witch Luis Majano.pdf
Ortus Solutions, Corp
 
PPTX
Customise Your Correlation Table in IBM SPSS Statistics.pptx
Version 1 Analytics
 
PDF
유니티에서 Burst Compiler+ThreadedJobs+SIMD 적용사례
Seongdae Kim
 
PDF
Empower Your Tech Vision- Why Businesses Prefer to Hire Remote Developers fro...
logixshapers59
 
PDF
Automate Cybersecurity Tasks with Python
VICTOR MAESTRE RAMIREZ
 
PPTX
Foundations of Marketo Engage - Powering Campaigns with Marketo Personalization
bbedford2
 
PDF
Technical-Careers-Roadmap-in-Software-Market.pdf
Hussein Ali
 
Tally_Basic_Operations_Presentation.pptx
AditiBansal54083
 
Tally software_Introduction_Presentation
AditiBansal54083
 
[Solution] Why Choose the VeryPDF DRM Protector Custom-Built Solution for You...
Lingwen1998
 
Hardware(Central Processing Unit ) CU and ALU
RizwanaKalsoom2
 
Finding Your License Details in IBM SPSS Statistics Version 31.pptx
Version 1 Analytics
 
SAP Firmaya İade ABAB Kodları - ABAB ile yazılmıl hazır kod örneği
Salih Küçük
 
Coefficient of Variance in IBM SPSS Statistics Version 31.pptx
Version 1 Analytics
 
IDM Crack with Internet Download Manager 6.42 Build 43 with Patch Latest 2025
bashirkhan333g
 
The 5 Reasons for IT Maintenance - Arna Softech
Arna Softech
 
Agentic Automation Journey Series Day 2 – Prompt Engineering for UiPath Agents
klpathrudu
 
MiniTool Power Data Recovery 8.8 With Crack New Latest 2025
bashirkhan333g
 
Empowering Asian Contributions: The Rise of Regional User Groups in Open Sour...
Shane Coughlan
 
AOMEI Partition Assistant Crack 10.8.2 + WinPE Free Downlaod New Version 2025
bashirkhan333g
 
TheFutureIsDynamic-BoxLang witch Luis Majano.pdf
Ortus Solutions, Corp
 
Customise Your Correlation Table in IBM SPSS Statistics.pptx
Version 1 Analytics
 
유니티에서 Burst Compiler+ThreadedJobs+SIMD 적용사례
Seongdae Kim
 
Empower Your Tech Vision- Why Businesses Prefer to Hire Remote Developers fro...
logixshapers59
 
Automate Cybersecurity Tasks with Python
VICTOR MAESTRE RAMIREZ
 
Foundations of Marketo Engage - Powering Campaigns with Marketo Personalization
bbedford2
 
Technical-Careers-Roadmap-in-Software-Market.pdf
Hussein Ali
 
Ad

SimpleArray between Python and C++

  • 1. SimpleArray between Python and C++ Yung-Yu Chen ([email protected]) PyCon Taiwan 2023.9.2 1
  • 2. Speed is the king リニア中央新幹線 (Chuo Shinkansen) https://linear-chuo-shinkansen.jr-central.co.jp/about/design/ 天下武功 無堅不破 唯快不破 500km/h high speed rail by 2038 2
  • 3. Array: HPC fundamental ... ... ... regularly allocated bu ff er regularly allocated bu ff er regularly allocated bu ff er regularly accessing algorithm randomly accessing algorithm quoted from SemiAnalysis https://www.semianalysis.com/p/apple-m2-die-shot-and-architecture die shots of popular chips The data structure for cache optimization, data parallelism, SIMD, GPU, all the techniques of high-performance computing (HPC) 3
  • 4. Dimensionality one-dimensional array two-dimensional array three-dimensional array arrays of higher dimension ... data is always laid out linearly in memory, regardless of dimension 4
  • 5. Hand-craft array library and don't forget that there will be some 5
  • 6. Make your own array • Make SimpleArray • Vehicle to carry ad hoc optimization in addition to general array operations buffer management data access void calc_distance( size_t const n , double const * x , double const * y , double * r) { for (size_t i = 0 ; i < n ; ++i) { r[i] = std::sqrt(x[i]*x[i] + y[i]*y[i]); } } vmovupd ymm0, ymmword [rsi + r9*8] vmulpd ymm0, ymm0, ymm0 vmovupd ymm1, ymmword [rdx + r9*8] vmulpd ymm1, ymm1, ymm1 vaddpd ymm0, ymm0, ymm1 vsqrtpd ymm0, ymm0 movupd xmm0, xmmword [rsi + r8*8] mulpd xmm0, xmm0 movupd xmm1, xmmword [rdx + r8*8] mulpd xmm1, xmm1 addpd xmm1, xmm0 sqrtpd xmm0, xmm1 AVX: 256-bit-wide vectorization SSE: 128-bit-wide vectorization C++ code 6 Key components:
  • 7. Multi-dimensional array • SimpleArray is a class template • Holds a contiguous memory bu ff er • Provides multi-dimensional accessors to its elements • Designed for fundamental types; might be used for POD types (untested) ... ... ... ... ... array m × n 0,0 0,1 1,0 1,1 m,0 m,1 0,n 1,n m,n 0 1 2 3 4 5 6 7 8 ... 1D array 7
  • 8. SimpleArray is not C++ std::vector SimpleArray std::vector SimpleArray is fi xed size • Only allocate memory on construction std::vector is variable size • Bu ff er may be invalidated • Implicit memory allocation (reallocation) Multi-dimensional access: operator() One-dimensional access: operator[] 8
  • 9. Features • Header-only C++ code • Simple memory allocation by using fi xed-size • Separate bu ff er ownership and meta data • Meta data for multi-dimensional access: • Type • Shape and stride • Custom for application speci fi c behavior: • Ghost index (in the fi rst dimension) 9
  • 10. Buffer management • Only constructor allocates memory; only destructor deallocates memory • Allocation and deallocation only happen when necessary • Management is implemented in the class ConcreteBuffer • ConcreteBuffer is a member datum of SimpleArray template <typename T> class SimpleArray { using buffer_type = ConcreteBuffer; std::shared_ptr<buffer_type> m_buffer; }; 10
  • 11. ConcreteBuffer class ConcreteBuffer : public std::enable_shared_from_this<ConcreteBuffer> { private: // Protect the constructors struct ctor_passkey {}; public: static std::shared_ptr<ConcreteBuffer> construct(size_t nbytes) { return std::make_shared<ConcreteBuffer>(nbytes, ctor_passkey()); } ... }; // Turn off move ConcreteBuffer(ConcreteBuffer &&) = delete; ConcreteBuffer & operator=(ConcreteBuffer &&) = delete; // Use a deleter to customize memory management with unique_ptr using data_deleter_type = detail::ConcreteBufferDataDeleter; using unique_ptr_type = std::unique_ptr<int8_t, data_deleter_type>; size_t m_nbytes; // Remember the amount of bytes unique_ptr_type m_data; // Point to the allocated momery ConcreteBuffer(size_t nbytes, const ctor_passkey &) : m_nbytes(nbytes) , m_data(allocate(nbytes)) {} // Allocate memory static unique_ptr_type allocate(size_t nbytes) { unique_ptr_type ret(nullptr, data_deleter_type()); if (0 != nbytes) { ret = unique_ptr_type(new int8_t[nbytes], data_deleter_type()); } return ret; } 1. Exclusively manage it using std::shared_ptr 1.1. Turn o ff move to prevent the pointer from going away 3. Use a deleter to customize deallocation from outside 2. Allocate memory and keep it in a std::unique_ptr 11
  • 12. Design a custom management interface using data_deleter_type = detail::ConcreteBufferDataDeleter; using unique_ptr_type = std::unique_ptr<int8_t, data_deleter_type>; struct ConcreteBufferDataDeleter { using remover_type = ConcreteBufferRemover; explicit ConcreteBufferDataDeleter( std::unique_ptr<remover_type> && remover_in) : remover(std::move(remover_in)) {} void operator()(int8_t * p) const { if (!remover) { // Array deletion if no remover is available. delete[] p; } else { (*remover)(p); } } std::unique_ptr<remover_type> remover{nullptr}; }; struct ConcreteBufferRemover { virtual ~ConcreteBufferRemover() = default; virtual void operator()(int8_t * p) const { delete[] p; } }; struct ConcreteBufferNoRemove : public ConcreteBufferRemover { void operator()(int8_t *) const override {} }; unique_ptr deleter is a proxy to ConcreteBuffer remover Remover is a polymorphic functor type Allow change to work with a foreign sub- system for memory management 12
  • 13. Data access template <typename T> class SimpleArray { public: // Type information. using value_type = T; static constexpr size_t ITEMSIZE = sizeof(value_type); using buffer_type = ConcreteBuffer; /// Contiguous data buffer for the array. std::shared_ptr<buffer_type> m_buffer; using shape_type = small_vector<size_t>; /// Each element in this vector is the number of element in the /// corresponding dimension. shape_type m_shape; /// Each element in this vector is the number of elements (not number of /// bytes) to skip for advancing an index in the corresponding dimension. shape_type m_stride; }; Type information is made available during compile time, so compiler can take the type information to generate fast instructions 13
  • 14. Shape and stride explicit SimpleArray(size_t length) : m_buffer(buffer_type::construct(length * ITEMSIZE)) , m_shape{length} , m_stride{1} , m_body(m_buffer->data<T>()) {} explicit SimpleArray(shape_type const & shape) : m_shape(shape) , m_stride(calc_stride(m_shape)) { if (!m_shape.empty()) { m_buffer = buffer_type::construct( m_shape[0] * m_stride[0] * ITEMSIZE); m_body = m_buffer->data<T>(); } } static shape_type calc_stride(shape_type const & shape) { shape_type stride(shape.size()); if (!shape.empty()) { stride[shape.size() - 1] = 1; for (size_t it = shape.size() - 1; it > 0; --it) { stride[it - 1] = stride[it] * shape[it]; } } return stride; } One-dimensional shape Multi-dimensional shape Stride is the number of elements to be skipped when the index is changed by 1 in the corresponding dimension: ... ... ... ... ... array m × n 0,0 0,1 1,0 1,1 m,0 m,1 0,n 1,n m,n 14
  • 15. Multi-dimensional accessor template <typename T> class SimpleArray { ... template <typename... Args> value_type const & operator()(Args... args) const { return *vptr(args...); } template <typename... Args> value_type & operator()(Args... args) { return *vptr(args...); } template <typename... Args> value_type const * vptr(Args... args) const { return m_body + buffer_offset(m_stride, args...); } template <typename... Args> value_type * vptr(Args... args) { return m_body + buffer_offset(m_stride, args...); } ... }; namespace detail { // Use recursion in templates. template <size_t D, typename S> size_t buffer_offset_impl(S const &) { return 0; } template <size_t D, typename S, typename Arg, typename... Args> size_t buffer_offset_impl (S const & strides, Arg arg, Args... args) { return arg * strides[D] + buffer_offset_impl<D + 1>(strides, args...); } } /* end namespace detail */ template <typename S, typename... Args> size_t buffer_offset(S const & strides, Args... args) { return detail::buffer_offset_impl<0>(strides, args...); } Element and pointer accessor by using variadic template Recursive implementation (compile-time) to save conditional branching 15
  • 16. Ghost (negative) index • Ghost: elements indexed by negative integer • Similar to the negative index for the POD array • No overhead for normal arrays that start with 0 index // C-style POD array. int32_t data[100]; // Make a pointer to the head address of the array. int32_t * pdata = data; // Make another pointer to the 50-th element from the head of the array. int32_t * odata = pdata + 50; odata[-10] 0 1 2 3 4 5 6 7 8 9 40 41 42 43 44 45 46 47 48 49 data[40] data ... ... 50 *(data+40) *(odata-10) pdata odata 16
  • 17. Get ghost information on construction template <typename T> class SimpleArray { private: // Number of ghost elements size_t m_nghost = 0; // Starting address of non-ghost value_type * m_body = nullptr; }; static T * calc_body(T * data, shape_type const & stride, size_t nghost) { if (nullptr == data || stride.empty() || 0 == nghost) { // Do nothing. } else { shape_type shape(stride.size(), 0); shape[0] = nghost; data += buffer_offset(stride, shape); } return data; } SimpleArray(SimpleArray const & other) : m_buffer(other.m_buffer->clone()) , m_shape(other.m_shape) , m_stride(other.m_stride) , m_nghost(other.m_nghost) , m_body(calc_body(m_buffer->data<T>(), m_stride, other.m_nghost)) {} SimpleArray(SimpleArray && other) noexcept : m_buffer(std::move(other.m_buffer)) , m_shape(std::move(other.m_shape)) , m_stride(std::move(other.m_stride)) , m_nghost(other.m_nghost) , m_body(other.m_body) {} Keep ghost number and body address Assign ghost and body information in constructors Calculate body address 17 No additional overhead during element access
  • 18. Make your own small vector template <typename T, size_t N = 3> class small_vector { public: T const * data() const { return m_head; } T * data() { return m_head; } private: T * m_head = nullptr; unsigned int m_size = 0; unsigned int m_capacity = N; std::array<T, N> m_data; }; explicit small_vector(size_t size) : m_size(static_cast<unsigned int>(size)) { if (m_size > N) { m_capacity = m_size; // Allocate only when the size is too large. m_head = new T[m_capacity]; } else { m_capacity = N; m_head = m_data.data(); } } small_vector() { m_head = m_data.data(); } ~small_vector() { if (m_head != m_data.data() && m_head != nullptr) { delete[] m_head; m_head = nullptr; } } Spend a small size of memory to reduce dynamic memory allocation Allocate memory only when size is too large No deallocation is needed if no allocation *data val val initial after push-back (empty) (dynamically allocated bu ff er) std::vector *data val val initial and after fi rst couple of push-back (preallocated bu ff er) (our) small_vector 18
  • 19. Small vector, cont. Copy constructor may also allocate memory Move constructor should not allocate memory small_vector(small_vector const & other) : m_size(other.m_size) { if (other.m_head == other.m_data.data()) { m_capacity = N; m_head = m_data.data(); } else { m_capacity = m_size; m_head = new T[m_capacity]; } std::copy_n(other.m_head, m_size, m_head); } small_vector(small_vector && other) noexcept : m_size(other.m_size) { if (other.m_head == other.m_data.data()) { m_capacity = N; std::copy_n(other.m_data.begin(), m_size, m_data.begin()); m_head = m_data.data(); } else { m_capacity = m_size; m_head = other.m_head; other.m_size = 0; other.m_capacity = N; other.m_head = other.m_data.data(); } } 19 Constructors are the key of making C++ work
  • 20. Numpy for dense array • The numpy library provides everything we need for arrays in Python • Convention of using numpy: • The created array copies the Python data: >>> import numpy as np >>> lst = [1, 1, 2, 3, 5] >>> print('A list:', lst) A list: [1, 1, 2, 3, 5] >>> array = np.array(lst) >>> print('An array:', np.array(array)) An array: [1 1 2 3 5] Create a list: Create an array: >>> array[2] = 8 >>> print(array) # The array changes. [1 1 8 3 5] >>> print(lst) # The input sequence is not changed. [1, 1, 2, 3, 5] 20
  • 21. Python is slow u = 0 u = 0 u = 0 u(y) = sin(πy) i, j i, j + 1 i, j − 1 i + 1,j i − 1,j i + 1,j + 1 i + 1,j − 1 i − 1,j + 1 i − 1,j − 1 grid points computing domain (and solution) ∂2 u ∂x2 + ∂2 u ∂y2 = 0 (0 < x < 1; 0 < y < 1) u(0,y) = 0, u(1,y) = sin(πy) (0 ≤ y ≤ 1) u(x,0) = 0, u(x,1) = 0 (0 ≤ x ≤ 1) Equation for domain interior Boundary conditions un+1 (xi, yi) = un (xi+1, yj) + un (xi−1, yj) + un (xi, yj+1) + un (xi, yj−1) 4 Use point-Jacobi method to march Let us show how slow Python is by solving the Laplace equation, a simple PDE 21
  • 22. Python vs numpy def solve_array(): u = uoriginal.copy() # Input from outer scope un = u.copy() # Create the buffer for the next time step converged = False step = 0 while not converged: step += 1 un[1:nx-1,1:nx-1] = (u[2:nx,1:nx-1] + u[0:nx-2,1:nx-1] + u[1:nx-1,2:nx] + u[1:nx-1,0:nx-2]) / 4 norm = np.abs(un-u).max() u[...] = un[...] converged = True if norm < 1.e-5 else False return u, step, norm def solve_python_loop(): u = uoriginal.copy() # Input from outer scope un = u.copy() # Create the buffer for the next time step converged = False step = 0 # Outer loop. while not converged: step += 1 # Inner loops. One for x and the other for y. for it in range(1, nx-1): for jt in range(1, nx-1): un[it,jt] = (u[it+1,jt] + u[it-1,jt] + u[it,jt+1] + u[it,jt-1]) / 4 norm = np.abs(un-u).max() u[...] = un[...] converged = True if norm < 1.e-5 else False return u, step, norm >>> with Timer(): >>> u, step, norm = solve_array() Wall time: 0.0552309 s >>> with Timer(): >>> u, step, norm = solve_python_loop() Wall time: 4.79688 s numpy array for point-Jacobi python loop for point-Jacobi The array version is 87x faster than the loop version 22
  • 23. Speed up by C++ std::tuple<modmesh::SimpleArray<double>, size_t, double> solve1(modmesh::SimpleArray<double> u) { const size_t nx = u.shape(0); modmesh::SimpleArray<double> un = u; bool converged = false; size_t step = 0; double norm; while (!converged) { norm = 0.0; ++step; for (size_t it=1; it<nx-1; ++it) { for (size_t jt=1; jt<nx-1; ++jt) { un(it,jt) = (u(it+1,jt) + u(it-1,jt) + u(it,jt+1) + u(it,jt-1)) / 4; double const v = std::abs(un(it,jt) - u(it,jt)); if (v > norm) { norm = v; } } } // additional loop slows down: norm = calc_norm_amax(u, un); if (norm < 1.e-5) { converged = true; } u = un; } return std::make_tuple(u, step, norm); } // Expose the C++ helper to Python PYBIND11_MODULE(solve_cpp, m) { // Boilerplate for using the SimpleArray with Python { modmesh::python::import_numpy(); modmesh::python::wrap_ConcreteBuffer(m); modmesh::python::wrap_SimpleArray(m); } m.def ( "solve_cpp" , [](pybind11::array_t<double> & uin) { auto ret = solve1(modmesh::python::makeSimpleArray(uin)); return std::make_tuple( modmesh::python::to_ndarray(std::get<0>(ret)), std::get<1>(ret), std::get<2>(ret)); } ); } >>> with Timer(): >>> u, step, norm = solve_cpp.solve_cpp(uoriginal) Wall time: 0.0251369 s c++ loop for point-Jacobi expose to Python 23
  • 24. Runtime and maintainability 24 for it in range(1, nx-1): for jt in range(1, nx-1): un[it,jt] = (u[it+1,jt] + u[it-1,jt] + u[it,jt+1] + u[it,jt-1]) / 4 un[1:nx-1,1:nx-1] = (u[2:nx,1:nx-1] + u[0:nx-2,1:nx-1] + u[1:nx-1,2:nx] + u[1:nx-1,0:nx-2]) / 4 for (size_t it=1; it<nx-1; ++it) { for (size_t jt=1; jt<nx-1; ++jt) { un(it,jt) = (u(it+1,jt) + u(it-1,jt) + u(it,jt+1) + u(it,jt-1)) / 4; } } Point-Jacobi method Python nested loop Numpy array C++ nested loop 4.797s (1x) 0.055s (87x) 0.025s (192x) un+1 (xi, yi) = un (xi+1, yj) + un (xi−1, yj) + un (xi, yj+1) + un (xi, yj−1) 4
  • 25. Conversion between Python containers and C++ • Not really "conversion" or "translation", but data copy val val val val val C++ std::vector<double> ref ref ref ref ref Python list val val val val val python data are randomly allocation Reconstruct from Python to C++ val val val val val C++ std::vector<double> ref ref ref ref ref Python list val val val val val sequential construction makes more regular memory layout Reconstruct from C++ to Python 25
  • 26. Zero-copy to speed up Python app C++ app a11 a12 ⋯ a1n a21 ⋯ am1 ⋯ amn share memory bu ff er across language Ndarray C++ container Python app C++ app C++ container Ndarray manage access a11 a12 ⋯ a1n a21 ⋯ am1 ⋯ amn memory bu ff er shared across language 👍 bottom (C++) - up (Python) design Python app C++ app C++ container Ndarray manage access a11 a12 ⋯ a1n a21 ⋯ am1 ⋯ amn memory bu ff er shared across language top (Python) - down (C++) design 👎 Python-controlling lifecycle is bad for long-running process C++ allows fi ne-grained control of all resources 26
  • 27. Buffer interface and numpy ndarray template <typename S> std::enable_if_t<is_simple_array_v<S>, pybind11::array> to_ndarray(S && sarr) { namespace py = pybind11; using T = typename std::remove_reference_t<S>::value_type; std::vector<size_t> const shape(sarr.shape().begin(), sarr.shape().end()); std::vector<size_t> stride(sarr.stride().begin(), sarr.stride().end()); for (size_t & v : stride) { v *= sarr.itemsize(); } return py::array( /* Numpy dtype */ py::detail::npy_format_descriptor<T>::dtype(), /* Buffer dimensions */ shape, /* Strides (in bytes) for each index */ stride, /* Pointer to buffer */ sarr.data(), /* Owning Python object */ py::cast(sarr.buffer().shared_from_this())); } template <typename T> static SimpleArray<T> makeSimpleArray( pybind11::array_t<T> & ndarr) { typename SimpleArray<T>::shape_type shape; for (ssize_t i = 0; i < ndarr.ndim(); ++i) { shape.push_back(ndarr.shape(i)); } std::shared_ptr<ConcreteBuffer> const buffer = ConcreteBuffer::construct( ndarr.nbytes(), ndarr.mutable_data(), std::make_unique<ConcreteBufferNdarrayRemover>( ndarr)); return SimpleArray<T>(shape, buffer); } take SimpleArray bu ff er to make ndarray take ndarray bu ff er to make SimpleArray 27
  • 28. Conclusion It's important to always keep Python: • Think in arrays for speed • Speed is the king • Don't be afraid of C++ • Always keep Python -- it's your UI 28