`parallel`: Device-Level Parallel Algorithms

The cuda.cccl.parallel library provides device-level algorithms that operate on entire arrays or ranges of data. These algorithms are designed to be easy to use from Python while delivering the performance of hand-optimized CUDA kernels, portable across different GPU architectures.

Algorithms

The core functionality provided by the parallel library are algorithms such as reductions, scans, sorts, and transforms.

Here’s a simple example showing how to use the reduce_into algorithm to reduce an array of integers.

Basic reduction example. View complete source on GitHub

def sum_reduction_example():
    """Sum all values in an array using reduction."""

    def add_op(a, b):
        return a + b

    dtype = np.int32
    h_init = np.array([0], dtype=dtype)
    d_input = cp.array([1, 2, 3, 4, 5], dtype=dtype)
    d_output = cp.empty(1, dtype=dtype)

    # Instantiate reduction
    reduce_into = parallel.reduce_into(d_output, d_output, add_op, h_init)

    # Determine temporary device storage requirements
    temp_storage_size = reduce_into(None, d_input, d_output, len(d_input), h_init)

    # Allocate temporary storage
    d_temp_storage = cp.empty(temp_storage_size, dtype=np.uint8)

    # Run reduction
    reduce_into(d_temp_storage, d_input, d_output, len(d_input), h_init)

    expected_output = 15  # 1+2+3+4+5
    assert (d_output == expected_output).all()
    print(f"Sum: {d_output[0]}")
    return d_output[0]

Iterators

Algorithms can be used not just on arrays, but also on iterators. Iterators provide a way to represent sequences of data without needing to allocate memory for them.

Here’s an example showing how to use reduction with a CountingIterator that generates a sequence of numbers starting from a specified value.

Counting iterator example. View complete source on GitHub

def counting_iterator_example():
    """Demonstrate reduction with counting iterator."""

    def add_op(a, b):
        return a + b

    first_item = 10
    num_items = 3

    first_it = iterators.CountingIterator(np.int32(first_item))  # Input sequence
    h_init = np.array([0], dtype=np.int32)  # Initial value for the reduction
    d_output = cp.empty(1, dtype=np.int32)  # Storage for output

    # Instantiate reduction, determine storage requirements, and allocate storage
    reduce_into = algorithms.reduce_into(first_it, d_output, add_op, h_init)
    temp_storage_size = reduce_into(None, first_it, d_output, num_items, h_init)
    d_temp_storage = cp.empty(temp_storage_size, dtype=np.uint8)

    # Run reduction
    reduce_into(d_temp_storage, first_it, d_output, num_items, h_init)

    expected_output = functools.reduce(
        lambda a, b: a + b, range(first_item, first_item + num_items)
    )
    assert (d_output == expected_output).all()
    print(f"Counting iterator result: {d_output[0]} (expected: {expected_output})")
    return d_output[0]

Iterators also provide a way to compose operations. Here’s an example showing how to use reduce_into with a TransformIterator to compute the sum of squares of a sequence of numbers.

Transform iterator example. View complete source on GitHub

def transform_iterator_example():
    """Demonstrate reduction with transform iterator."""

    def add_op(a, b):
        return a + b

    def transform_op(a):
        return -a if a % 2 == 0 else a

    first_item = 10
    num_items = 100

    transform_it = parallel.TransformIterator(
        parallel.CountingIterator(np.int32(first_item)), transform_op
    )  # Input sequence
    h_init = np.array([0], dtype=np.int64)  # Initial value for the reduction
    d_output = cp.empty(1, dtype=np.int64)  # Storage for output

    # Instantiate reduction, determine storage requirements, and allocate storage
    reduce_into = parallel.reduce_into(transform_it, d_output, add_op, h_init)
    temp_storage_size = reduce_into(None, transform_it, d_output, num_items, h_init)
    d_temp_storage = cp.empty(temp_storage_size, dtype=np.uint8)

    # Run reduction
    reduce_into(d_temp_storage, transform_it, d_output, num_items, h_init)

    expected_output = functools.reduce(
        lambda a, b: a + b,
        [-a if a % 2 == 0 else a for a in range(first_item, first_item + num_items)],
    )

    # Test assertions
    print(f"Transform iterator result: {d_output[0]} (expected: {expected_output})")
    assert (d_output == expected_output).all()
    assert d_output[0] == expected_output
    return d_output[0]

Custom Types

The parallel library supports defining custom data types, using the gpu_struct decorator. Here are some examples showing how to define and use custom types:

Custom type reduction example. View complete source on GitHub

def pixel_reduction_example():
    """Demonstrate reduction with custom Pixel struct to find maximum green value."""

    @parallel.gpu_struct
    class Pixel:
        r: np.int32
        g: np.int32
        b: np.int32

    def max_g_value(x, y):
        return x if x.g > y.g else y

    # Create random RGB data
    d_rgb = cp.random.randint(0, 256, (10, 3), dtype=np.int32).view(Pixel.dtype)
    d_out = cp.empty(1, Pixel.dtype)

    h_init = Pixel(0, 0, 0)

    reduce_into = parallel.reduce_into(d_rgb, d_out, max_g_value, h_init)
    temp_storage_bytes = reduce_into(None, d_rgb, d_out, d_rgb.size, h_init)

    d_temp_storage = cp.empty(temp_storage_bytes, dtype=np.uint8)
    _ = reduce_into(d_temp_storage, d_rgb, d_out, d_rgb.size, h_init)

    # Verify result
    h_rgb = d_rgb.get()
    expected = h_rgb[h_rgb.view("int32")[:, 1].argmax()]

    assert expected["g"] == d_out.get()["g"]
    print(f"Maximum green value: {d_out.get()['g']}")
    return d_out.get()

Example Collections

For complete runnable examples and more advanced usage patterns, see our full collection of examples.

External API References

cuda.cccl.parallel API Reference

parallel: Device-Level Parallel Algorithms