parallel
: Device-Level Parallel Algorithms
The cuda.cccl.parallel
library provides device-level algorithms that operate
on entire arrays or ranges of data. These algorithms are designed to be easy to use from Python
while delivering the performance of hand-optimized CUDA kernels, portable across different
GPU architectures.
Algorithms
The core functionality provided by the parallel
library are algorithms such
as reductions, scans, sorts, and transforms.
Here’s a simple example showing how to use the reduce_into
algorithm to
reduce an array of integers.
def sum_reduction_example():
"""Sum all values in an array using reduction."""
def add_op(a, b):
return a + b
dtype = np.int32
h_init = np.array([0], dtype=dtype)
d_input = cp.array([1, 2, 3, 4, 5], dtype=dtype)
d_output = cp.empty(1, dtype=dtype)
# Instantiate reduction
reduce_into = parallel.reduce_into(d_output, d_output, add_op, h_init)
# Determine temporary device storage requirements
temp_storage_size = reduce_into(None, d_input, d_output, len(d_input), h_init)
# Allocate temporary storage
d_temp_storage = cp.empty(temp_storage_size, dtype=np.uint8)
# Run reduction
reduce_into(d_temp_storage, d_input, d_output, len(d_input), h_init)
expected_output = 15 # 1+2+3+4+5
assert (d_output == expected_output).all()
print(f"Sum: {d_output[0]}")
return d_output[0]
Iterators
Algorithms can be used not just on arrays, but also on iterators. Iterators provide a way to represent sequences of data without needing to allocate memory for them.
Here’s an example showing how to use reduction with a CountingIterator
that
generates a sequence of numbers starting from a specified value.
def counting_iterator_example():
"""Demonstrate reduction with counting iterator."""
def add_op(a, b):
return a + b
first_item = 10
num_items = 3
first_it = iterators.CountingIterator(np.int32(first_item)) # Input sequence
h_init = np.array([0], dtype=np.int32) # Initial value for the reduction
d_output = cp.empty(1, dtype=np.int32) # Storage for output
# Instantiate reduction, determine storage requirements, and allocate storage
reduce_into = algorithms.reduce_into(first_it, d_output, add_op, h_init)
temp_storage_size = reduce_into(None, first_it, d_output, num_items, h_init)
d_temp_storage = cp.empty(temp_storage_size, dtype=np.uint8)
# Run reduction
reduce_into(d_temp_storage, first_it, d_output, num_items, h_init)
expected_output = functools.reduce(
lambda a, b: a + b, range(first_item, first_item + num_items)
)
assert (d_output == expected_output).all()
print(f"Counting iterator result: {d_output[0]} (expected: {expected_output})")
return d_output[0]
Iterators also provide a way to compose operations. Here’s an example showing
how to use reduce_into
with a TransformIterator
to compute the sum of squares
of a sequence of numbers.
def transform_iterator_example():
"""Demonstrate reduction with transform iterator."""
def add_op(a, b):
return a + b
def transform_op(a):
return -a if a % 2 == 0 else a
first_item = 10
num_items = 100
transform_it = parallel.TransformIterator(
parallel.CountingIterator(np.int32(first_item)), transform_op
) # Input sequence
h_init = np.array([0], dtype=np.int64) # Initial value for the reduction
d_output = cp.empty(1, dtype=np.int64) # Storage for output
# Instantiate reduction, determine storage requirements, and allocate storage
reduce_into = parallel.reduce_into(transform_it, d_output, add_op, h_init)
temp_storage_size = reduce_into(None, transform_it, d_output, num_items, h_init)
d_temp_storage = cp.empty(temp_storage_size, dtype=np.uint8)
# Run reduction
reduce_into(d_temp_storage, transform_it, d_output, num_items, h_init)
expected_output = functools.reduce(
lambda a, b: a + b,
[-a if a % 2 == 0 else a for a in range(first_item, first_item + num_items)],
)
# Test assertions
print(f"Transform iterator result: {d_output[0]} (expected: {expected_output})")
assert (d_output == expected_output).all()
assert d_output[0] == expected_output
return d_output[0]
Custom Types
The parallel
library supports defining custom data types,
using the gpu_struct
decorator.
Here are some examples showing how to define and use custom types:
def pixel_reduction_example():
"""Demonstrate reduction with custom Pixel struct to find maximum green value."""
@parallel.gpu_struct
class Pixel:
r: np.int32
g: np.int32
b: np.int32
def max_g_value(x, y):
return x if x.g > y.g else y
# Create random RGB data
d_rgb = cp.random.randint(0, 256, (10, 3), dtype=np.int32).view(Pixel.dtype)
d_out = cp.empty(1, Pixel.dtype)
h_init = Pixel(0, 0, 0)
reduce_into = parallel.reduce_into(d_rgb, d_out, max_g_value, h_init)
temp_storage_bytes = reduce_into(None, d_rgb, d_out, d_rgb.size, h_init)
d_temp_storage = cp.empty(temp_storage_bytes, dtype=np.uint8)
_ = reduce_into(d_temp_storage, d_rgb, d_out, d_rgb.size, h_init)
# Verify result
h_rgb = d_rgb.get()
expected = h_rgb[h_rgb.view("int32")[:, 1].argmax()]
assert expected["g"] == d_out.get()["g"]
print(f"Maximum green value: {d_out.get()['g']}")
return d_out.get()
Example Collections
For complete runnable examples and more advanced usage patterns, see our full collection of examples.