SimpleArray between Python and C++

SimpleArray
between Python and C++
Yung-Yu Chen
(yyc@sciwork.dev)
PyCon Taiwan 2023.9.2
1

Speed is the king
リニア中央新幹線 (Chuo Shinkansen)
https://linear-chuo-shinkansen.jr-central.co.jp/about/design/
天下武功
無堅不破
唯快不破
500km/h high speed rail by 2038
2

Array: HPC fundamental
...
...
...
regularly allocated bu
ff
er
ff
er
ff
er
regularly accessing algorithm
randomly accessing algorithm
quoted from SemiAnalysis
https://www.semianalysis.com/p/apple-m2-die-shot-and-architecture
die shots of popular chips
The data structure for cache optimization, data parallelism, SIMD, GPU, all the
techniques of high-performance computing (HPC)
3

Dimensionality
one-dimensional array
two-dimensional array three-dimensional array
arrays of higher dimension ...
data is always laid out
linearly in memory,
regardless of dimension
4

Hand-craft array library
and don't forget that there will be some
5

Make your own array
• Make SimpleArray
• Vehicle to carry ad
hoc optimization in
addition to general
array operations
buffer
management
data access
void calc_distance(
size_t const n
, double const * x
, double const * y
, double * r)
{
for (size_t i = 0 ; i < n ; ++i)
{
r[i] = std::sqrt(x[i]*x[i] + y[i]*y[i]);
}
}
vmovupd ymm0, ymmword [rsi + r9*8]
vmulpd ymm0, ymm0, ymm0
vmovupd ymm1, ymmword [rdx + r9*8]
vmulpd ymm1, ymm1, ymm1
vaddpd ymm0, ymm0, ymm1
vsqrtpd ymm0, ymm0
movupd xmm0, xmmword [rsi + r8*8]
mulpd xmm0, xmm0
movupd xmm1, xmmword [rdx + r8*8]
mulpd xmm1, xmm1
addpd xmm1, xmm0
sqrtpd xmm0, xmm1
AVX: 256-bit-wide vectorization SSE: 128-bit-wide vectorization
C++ code
6
Key components:

Multi-dimensional array
• SimpleArray is a class template
• Holds a contiguous memory bu
ff
er
• Provides multi-dimensional accessors to its elements
• Designed for fundamental types; might be used for POD types (untested)
...
...
...
...
...
array
m × n
0,0 0,1
1,0 1,1
m,0 m,1
0,n
1,n
m,n
0 1 2 3 4 5 6 7 8
...
1D array
7

SimpleArray is not C++ std::vector
SimpleArray std::vector
SimpleArray is
fi
xed size
• Only allocate memory on
construction
std::vector is variable size
• Bu
ff
er may be invalidated
• Implicit memory allocation
(reallocation)
Multi-dimensional access:
operator()
One-dimensional access:
operator[]
8

Features
• Header-only C++ code
• Simple memory allocation by
using
fi
xed-size
• Separate bu
ff
er ownership
and meta data
• Meta data for multi-dimensional
access:
• Type
• Shape and stride
• Custom for application speci
fi
c
behavior:
• Ghost index (in the
fi
rst
dimension)
9

Buffer management
• Only constructor allocates memory; only destructor deallocates memory
• Allocation and deallocation only happen when necessary
• Management is implemented in the class ConcreteBuffer
• ConcreteBuffer is a member datum of SimpleArray
template <typename T>
class SimpleArray
{
using buffer_type = ConcreteBuffer;
std::shared_ptr<buffer_type> m_buffer;
};
10

ConcreteBuffer
class ConcreteBuffer
: public std::enable_shared_from_this<ConcreteBuffer>
{
private:
// Protect the constructors
struct ctor_passkey {};
public:
static std::shared_ptr<ConcreteBuffer> construct(size_t nbytes)
{
return std::make_shared<ConcreteBuffer>(nbytes, ctor_passkey());
}
...
};
// Turn off move
ConcreteBuffer(ConcreteBuffer &&) = delete;
ConcreteBuffer & operator=(ConcreteBuffer &&) = delete;
// Use a deleter to customize memory management with unique_ptr
using data_deleter_type = detail::ConcreteBufferDataDeleter;
using unique_ptr_type = std::unique_ptr<int8_t, data_deleter_type>;
size_t m_nbytes; // Remember the amount of bytes
unique_ptr_type m_data; // Point to the allocated momery
ConcreteBuffer(size_t nbytes, const ctor_passkey &)
: m_nbytes(nbytes)
, m_data(allocate(nbytes))
{}
// Allocate memory
static unique_ptr_type allocate(size_t nbytes)
{
unique_ptr_type ret(nullptr, data_deleter_type());
if (0 != nbytes)
{
ret = unique_ptr_type(new int8_t[nbytes], data_deleter_type());
}
return ret;
}
1. Exclusively manage it using std::shared_ptr
1.1. Turn o
ff
move to prevent the pointer from going away 3. Use a deleter to customize deallocation from outside
2. Allocate memory and keep it in a std::unique_ptr
11

Design a custom management interface
using data_deleter_type = detail::ConcreteBufferDataDeleter;
using unique_ptr_type = std::unique_ptr<int8_t, data_deleter_type>;
struct ConcreteBufferDataDeleter
{
using remover_type = ConcreteBufferRemover;
explicit
ConcreteBufferDataDeleter(
std::unique_ptr<remover_type> && remover_in)
: remover(std::move(remover_in))
{}
void operator()(int8_t * p) const
{
if (!remover)
{
// Array deletion if no remover is available.
delete[] p;
}
else
{
(*remover)(p);
}
}
std::unique_ptr<remover_type> remover{nullptr};
};
struct ConcreteBufferRemover
{
virtual ~ConcreteBufferRemover() = default;
virtual void operator()(int8_t * p) const
{
delete[] p;
}
};
struct ConcreteBufferNoRemove
: public ConcreteBufferRemover
{
void operator()(int8_t *) const override {}
};
unique_ptr deleter is a proxy to
ConcreteBuffer remover
Remover is a polymorphic functor type
Allow change to work with a foreign sub-
system for memory management
12

Data access
class SimpleArray
{
public:
// Type information.
using value_type = T;
static constexpr size_t ITEMSIZE = sizeof(value_type);
using buffer_type = ConcreteBuffer;
/// Contiguous data buffer for the array.
std::shared_ptr<buffer_type> m_buffer;
using shape_type = small_vector<size_t>;
/// Each element in this vector is the number of element in the
/// corresponding dimension.
shape_type m_shape;
/// Each element in this vector is the number of elements (not number of
/// bytes) to skip for advancing an index in the corresponding dimension.
shape_type m_stride;
};
Type information is made available
during compile time, so compiler
can take the type information to
generate fast instructions
13

Shape and stride
explicit SimpleArray(size_t length)
: m_buffer(buffer_type::construct(length * ITEMSIZE))
, m_shape{length}
, m_stride{1}
, m_body(m_buffer->data<T>())
{}
explicit SimpleArray(shape_type const & shape)
: m_shape(shape)
, m_stride(calc_stride(m_shape))
{
if (!m_shape.empty())
{
m_buffer = buffer_type::construct(
m_shape[0] * m_stride[0] * ITEMSIZE);
m_body = m_buffer->data<T>();
}
}
static shape_type calc_stride(shape_type const & shape)
{
shape_type stride(shape.size());
if (!shape.empty())
{
stride[shape.size() - 1] = 1;
for (size_t it = shape.size() - 1; it > 0; --it)
{
stride[it - 1] = stride[it] * shape[it];
}
}
return stride;
}
One-dimensional shape
Multi-dimensional shape
Stride is the number of elements to be
skipped when the index is changed by 1 in the
corresponding dimension:
...
...
...
...
...
array
m × n
0,0 0,1
1,0 1,1
m,0 m,1
0,n
1,n
m,n
14

Multi-dimensional accessor
class SimpleArray
{
...
template <typename... Args>
value_type const & operator()(Args... args) const
{ return *vptr(args...); }
value_type & operator()(Args... args)
{ return *vptr(args...); }
value_type const * vptr(Args... args) const
{ return m_body + buffer_offset(m_stride, args...); }
value_type * vptr(Args... args)
{ return m_body + buffer_offset(m_stride, args...); }
...
};
namespace detail
{
// Use recursion in templates.
template <size_t D, typename S>
size_t buffer_offset_impl(S const &)
{
return 0;
}
template <size_t D, typename S, typename Arg, typename... Args>
size_t buffer_offset_impl
(S const & strides, Arg arg, Args... args)
{
return arg * strides[D] +
buffer_offset_impl<D + 1>(strides, args...);
}
} /* end namespace detail */
template <typename S, typename... Args>
size_t buffer_offset(S const & strides, Args... args)
{
return detail::buffer_offset_impl<0>(strides, args...);
}
Element and pointer accessor by using
variadic template
Recursive implementation (compile-time) to
save conditional branching
15

Ghost (negative) index
• Ghost: elements indexed by negative integer
• Similar to the negative index for the POD array
• No overhead for normal arrays that start with 0 index
// C-style POD array.
int32_t data[100];
// Make a pointer to the head address of the array.
int32_t * pdata = data;
// Make another pointer to the 50-th element from the head of the array.
int32_t * odata = pdata + 50;
odata[-10]
0 1 2 3 4 5 6 7 8 9 40 41 42 43 44 45 46 47 48 49
data[40]
data ... ...
50
*(data+40)
*(odata-10)
pdata odata
16

Get ghost information on construction
class SimpleArray
{
private:
// Number of ghost elements
size_t m_nghost = 0;
// Starting address of non-ghost
value_type * m_body = nullptr;
};
static T * calc_body(T * data, shape_type const & stride,
size_t nghost)
{
if (nullptr == data || stride.empty() || 0 == nghost)
{
// Do nothing.
}
else
{
shape_type shape(stride.size(), 0);
shape[0] = nghost;
data += buffer_offset(stride, shape);
}
return data;
}
SimpleArray(SimpleArray const & other)
: m_buffer(other.m_buffer->clone())
, m_shape(other.m_shape)
, m_stride(other.m_stride)
, m_nghost(other.m_nghost)
, m_body(calc_body(m_buffer->data<T>(),
m_stride, other.m_nghost))
{}
SimpleArray(SimpleArray && other) noexcept
: m_buffer(std::move(other.m_buffer))
, m_shape(std::move(other.m_shape))
, m_stride(std::move(other.m_stride))
, m_nghost(other.m_nghost)
, m_body(other.m_body)
{}
Keep ghost number and body address Assign ghost and body information
in constructors
Calculate body address
17
No additional overhead
during element access

Make your own small vector
template <typename T, size_t N = 3>
class small_vector
{
public:
T const * data() const { return m_head; }
T * data() { return m_head; }
private:
T * m_head = nullptr;
unsigned int m_size = 0;
unsigned int m_capacity = N;
std::array<T, N> m_data;
};
explicit small_vector(size_t size)
: m_size(static_cast<unsigned int>(size))
{
if (m_size > N)
{
m_capacity = m_size;
// Allocate only when the size is too large.
m_head = new T[m_capacity];
}
else
{
m_capacity = N;
m_head = m_data.data();
}
}
small_vector() { m_head = m_data.data(); }
~small_vector()
{
if (m_head != m_data.data() && m_head != nullptr)
{
delete[] m_head;
m_head = nullptr;
}
}
Spend a small size of memory to reduce
dynamic memory allocation
Allocate memory only when size is too large
No deallocation is needed if no allocation *data
val val
initial
after push-back
(empty)
(dynamically allocated bu
ff
er)
std::vector
*data
val val
initial and after
fi
rst couple of push-back
(preallocated bu
ff
er)
(our) small_vector
18

Small vector, cont.
Copy constructor may also allocate memory Move constructor should not allocate memory
small_vector(small_vector const & other)
: m_size(other.m_size)
{
if (other.m_head == other.m_data.data())
{
m_capacity = N;
}
else
{
m_head = new T[m_capacity];
}
std::copy_n(other.m_head, m_size, m_head);
}
small_vector(small_vector && other) noexcept
: m_size(other.m_size)
{
if (other.m_head == other.m_data.data())
{
m_capacity = N;
std::copy_n(other.m_data.begin(), m_size,
m_data.begin());
}
else
{
m_head = other.m_head;
other.m_size = 0;
other.m_capacity = N;
other.m_head = other.m_data.data();
}
}
19
Constructors are the key of making C++ work

Numpy for dense array
• The numpy library provides everything we need for arrays in Python
• Convention of using numpy:
• The created array copies the Python data:
>>> import numpy as np
>>> lst = [1, 1, 2, 3, 5]
>>> print('A list:', lst)
A list: [1, 1, 2, 3, 5]
>>> array = np.array(lst)
>>> print('An array:', np.array(array))
An array: [1 1 2 3 5]
Create a list: Create an array:
>>> array[2] = 8
>>> print(array) # The array changes.
[1 1 8 3 5]
>>> print(lst) # The input sequence is not changed.
[1, 1, 2, 3, 5]
20

Python is slow
u = 0
u = 0
u = 0
u(y) = sin(πy)
i, j
i, j + 1
i, j − 1
i + 1,j
i − 1,j
i + 1,j + 1
i + 1,j − 1
i − 1,j + 1
i − 1,j − 1
grid points
computing domain
(and solution)
∂2
u
∂x2
+
∂2
u
∂y2
= 0 (0 < x < 1; 0 < y < 1)
u(0,y) = 0, u(1,y) = sin(πy) (0 ≤ y ≤ 1)
u(x,0) = 0, u(x,1) = 0 (0 ≤ x ≤ 1)
Equation for domain interior
Boundary conditions
un+1
(xi, yi) =
un
(xi+1, yj) + un
(xi−1, yj) + un
(xi, yj+1) + un
(xi, yj−1)
4
Use point-Jacobi method to march
Let us show how slow Python is by solving the Laplace equation, a simple PDE
21

Python vs numpy
def solve_array():
u = uoriginal.copy() # Input from outer scope
un = u.copy() # Create the buffer for the next time step
converged = False
step = 0
while not converged:
step += 1
un[1:nx-1,1:nx-1] = (u[2:nx,1:nx-1] + u[0:nx-2,1:nx-1] +
u[1:nx-1,2:nx] + u[1:nx-1,0:nx-2]) / 4
norm = np.abs(un-u).max()
u[...] = un[...]
converged = True if norm < 1.e-5 else False
return u, step, norm
def solve_python_loop():
u = uoriginal.copy() # Input from outer scope
un = u.copy() # Create the buffer for the next time step
converged = False
step = 0
# Outer loop.
while not converged:
step += 1
# Inner loops. One for x and the other for y.
for it in range(1, nx-1):
for jt in range(1, nx-1):
un[it,jt] = (u[it+1,jt] + u[it-1,jt] + u[it,jt+1] + u[it,jt-1]) / 4
norm = np.abs(un-u).max()
u[...] = un[...]
converged = True if norm < 1.e-5 else False
return u, step, norm
>>> with Timer():
>>> u, step, norm = solve_array()
Wall time: 0.0552309 s
>>> with Timer():
>>> u, step, norm = solve_python_loop()
numpy array for point-Jacobi
python loop for point-Jacobi
The array version is 87x
faster than the loop version
22

Speed up by C++
std::tuple<modmesh::SimpleArray<double>, size_t, double>
solve1(modmesh::SimpleArray<double> u)
{
const size_t nx = u.shape(0);
modmesh::SimpleArray<double> un = u;
bool converged = false;
size_t step = 0;
double norm;
while (!converged)
{
norm = 0.0;
++step;
for (size_t it=1; it<nx-1; ++it)
{
for (size_t jt=1; jt<nx-1; ++jt)
{
un(it,jt) = (u(it+1,jt) + u(it-1,jt) + u(it,jt+1) + u(it,jt-1)) / 4;
double const v = std::abs(un(it,jt) - u(it,jt));
if (v > norm)
{
norm = v;
}
}
}
// additional loop slows down: norm = calc_norm_amax(u, un);
if (norm < 1.e-5) { converged = true; }
u = un;
}
return std::make_tuple(u, step, norm);
}
// Expose the C++ helper to Python
PYBIND11_MODULE(solve_cpp, m)
{
// Boilerplate for using the SimpleArray with Python
{
modmesh::python::import_numpy();
modmesh::python::wrap_ConcreteBuffer(m);
modmesh::python::wrap_SimpleArray(m);
}
m.def
(
"solve_cpp"
, [](pybind11::array_t<double> & uin)
{
auto ret = solve1(modmesh::python::makeSimpleArray(uin));
return std::make_tuple(
modmesh::python::to_ndarray(std::get<0>(ret)),
std::get<1>(ret),
std::get<2>(ret));
}
);
}
>>> with Timer():
>>> u, step, norm = solve_cpp.solve_cpp(uoriginal)
c++ loop for point-Jacobi expose to Python
23

Runtime and maintainability
24
for it in range(1, nx-1):
for jt in range(1, nx-1):
un[it,jt] = (u[it+1,jt] + u[it-1,jt] + u[it,jt+1] + u[it,jt-1]) / 4
un[1:nx-1,1:nx-1] = (u[2:nx,1:nx-1] + u[0:nx-2,1:nx-1] +
u[1:nx-1,2:nx] + u[1:nx-1,0:nx-2]) / 4
for (size_t it=1; it<nx-1; ++it)
{
for (size_t jt=1; jt<nx-1; ++jt)
{
un(it,jt) = (u(it+1,jt) + u(it-1,jt) + u(it,jt+1) + u(it,jt-1)) / 4;
}
}
Point-Jacobi method
Python nested loop
Numpy array
C++ nested loop
4.797s (1x)
0.055s (87x)
0.025s (192x)
un+1
(xi, yi) =
un
(xi+1, yj) + un
(xi−1, yj) + un
(xi, yj+1) + un
(xi, yj−1)
4

Conversion between Python containers and C++
• Not really "conversion" or "translation", but data copy
val
val
val
val
val
C++
std::vector<double>
ref
ref
ref
ref
ref
Python
list
val
val
val
val
val
python data are
randomly allocation
Reconstruct from Python to C++
val
val
val
val
val
C++
std::vector<double>
ref
ref
ref
ref
ref
Python
list
val
val
val
val
val
sequential construction makes
more regular memory layout
Reconstruct from C++ to Python
25

Zero-copy to speed up
Python app C++ app
a11 a12 ⋯ a1n a21 ⋯ am1 ⋯ amn
share memory bu
ff
er across language
Ndarray
C++
container
Python app C++ app
C++
container
Ndarray
manage
access a11 a12 ⋯ a1n a21 ⋯ am1 ⋯ amn
memory bu
ff
er shared across language
👍 bottom (C++) - up (Python) design
Python app C++ app
C++
container
Ndarray
manage
access
a11 a12 ⋯ a1n a21 ⋯ am1 ⋯ amn
memory bu
ff
er shared across language
top (Python) - down (C++) design 👎
Python-controlling
lifecycle is
bad for long-running
process
C++ allows
fi
ne-grained
control of all resources
26

Buffer interface and numpy ndarray
template <typename S>
std::enable_if_t<is_simple_array_v<S>, pybind11::array>
to_ndarray(S && sarr)
{
namespace py = pybind11;
using T = typename std::remove_reference_t<S>::value_type;
std::vector<size_t> const shape(sarr.shape().begin(),
sarr.shape().end());
std::vector<size_t> stride(sarr.stride().begin(),
sarr.stride().end());
for (size_t & v : stride) { v *= sarr.itemsize(); }
return py::array(
/* Numpy dtype */
py::detail::npy_format_descriptor<T>::dtype(),
/* Buffer dimensions */
shape,
/* Strides (in bytes) for each index */
stride,
/* Pointer to buffer */
sarr.data(),
/* Owning Python object */
py::cast(sarr.buffer().shared_from_this()));
}
static SimpleArray<T> makeSimpleArray(
pybind11::array_t<T> & ndarr)
{
typename SimpleArray<T>::shape_type shape;
for (ssize_t i = 0; i < ndarr.ndim(); ++i)
{
shape.push_back(ndarr.shape(i));
}
std::shared_ptr<ConcreteBuffer> const buffer =
ConcreteBuffer::construct(
ndarr.nbytes(),
ndarr.mutable_data(),
std::make_unique<ConcreteBufferNdarrayRemover>(
ndarr));
return SimpleArray<T>(shape, buffer);
}
take SimpleArray bu
ff
er to make ndarray take ndarray bu
ff
er to make SimpleArray
27

Conclusion
It's important to always keep Python:
• Think in arrays for speed
• Speed is the king
• Don't be afraid of C++
• Always keep Python -- it's your UI
28

SimpleArray between Python and C++

More Related Content

Similar to SimpleArray between Python and C++ (20)

More from Yung-Yu Chen (10)

Recently uploaded (20)

SimpleArray between Python and C++