docs: Add an example argmax reducer using cccl #3763

maxymnaumchyk · 2025-12-05T12:10:34Z

I've been trying to profile this function that I made. Unfortunately, it's now running slower than the one that awkward currently has (that is using raw cuda-kernels). For example here is me running cccl_argmax function that I wrote, on a randomly generated awkward array:

It takes ~65ms per run.
Here is the same array processed by the awkward.argmax() function:

If I try to profile cccl_argmax line by line, I see that most of the time is spent on the ak.local_index() function that I use to get the local indices of all the values in an array.

Now what is strange, if I try to profile just the ak.local_index() it runs much faster by itself. (1.84ms compared to 10.7ms inside cccl_argmax)

I've been trying to profile these function using Nvidia Nsight but I don't see anything there. Maybe you can see something that could be slowing the whole thing down. Any suggestions (including other tools I could use to profile) will be greatly appreciated as I'm a bit stuck here now ~

github-actions · 2025-12-05T12:21:05Z

The documentation preview is ready to be viewed at http://preview.awkward-array.org.s3-website.us-east-1.amazonaws.com/PR3763

ianna · 2025-12-05T15:46:50Z

@maxymnaumchyk - could you, please, post the profiling of the ak.local_index itself? I think, you have shown me? Thanks.

maxymnaumchyk · 2025-12-05T16:16:39Z

Sure, here:

Line 861 is where the localindex cuda_kernel is called.

shwina · 2025-12-05T17:57:03Z

Thanks for the ping! Taking a look now and I'll get back to you soon.

shwina · 2025-12-05T18:32:49Z

Thanks @maxymnaumchyk - you've run up to a known issue with CUB's DeviceSegmentedReduce function - which is what segmented_reduce calls down to: NVIDIA/cccl#6171.

We're going to prioritize this right away and get back to you here. In the mean time, let me see if we can suggest workarounds.

maxymnaumchyk · 2025-12-07T19:29:45Z

Thank you for the feedback @shwina! How did you point down that the issue is related to DeviceSegmentedReduce function?

shwina · 2025-12-08T14:02:37Z

Thanks, @maxymnaumchyk - here are some steps you can use to hopefully reproduce the results I'm seeing.

First, I modified your script a bit for benchmarking:

https://gist.github.com/shwina/9ef807626377c3839ca8266a7de84720#file-awkward_argmax_reduction-_script-py

Notably:

I added some cp.cuda.Device().synchronize() calls to ensure that the asynchronous execution of CUDA kernels does not affect the benchmarking
I added an nvtx annotation around the section of code we are interested in studying

Then, I ran this script using the Nsight Systems profiler, and inspected the results using the Nsight Systems GUI. You can also use nsightful if you prefer to view the results on your browser. Here's a notebook that you can run to produce and view the profile with nsightful:

https://gist.github.com/shwina/9ef807626377c3839ca8266a7de84720#file-generate_profile-ipynb

Here's what the resulting profile looked like for me; you can see that most of the time is being spent in the DeviceSegmentedReduce kernel - and comparatively little time is being spent in the local_index kernel:

Now, regarding some of your previous observations --

If I try to profile cccl_argmax line by line, I see that most of the time is spent on the ak.local_index() function that I use to get the local indices of all the values in an array.

My suspicion is that this is due to not having synchronize() after each kernel invocation within the cccl_argmax function. If you added that, do you see more consistent results?

I've been trying to profile these function using Nvidia Nsight but I don't see anything there.

You may have to expand the "CUDA HW" section of the profile to see the call to cub::DeviceSegmentedReduceKernel. Here's what it looks like for me:

Please let me know if you are still having trouble or not seeing similar results to mine! In the mean time, we're working to improve the performance of cub::DeviceSegmentedReduceKernel for small segment sizes.

maxymnaumchyk · 2025-12-08T16:41:57Z

Thanks a lot for the explanation @shwina! Yes, now after adding cp.cuda.Device().synchronize() calls, I get similar results to yours.

An unrelated question: the cudaDeviceSynchronize process that I see on host isn't actually doing anything? It just makes the host wait until the kernel is executed?

shwina · 2025-12-08T16:54:07Z

An unrelated question: the cudaDeviceSynchronize process that I see on host isn't actually doing anything? It just makes the host wait until the kernel is executed?

Yes -- precisely!

shwina · 2025-12-19T19:49:48Z

Hi @maxymnaumchyk - with the latest cuda-cccl (v0.4.4.4); you can define operations that have side-effects This allows using unary_transform to compute argmax, in the following way:

def cccl_argmax_new(awkward_array):
    input_data = awkward_array.layout.content.data

    # Prepare the start and end offsets
    offsets = awkward_array.layout.offsets.data
    start_o = offsets[:-1]
    end_o = offsets[1:]

    # Prepare the output array
    n_segments = start_o.size
    output = cp.empty([n_segments], dtype=np.int64)

    def segment_reduce_op(segment_id: np.int64) -> np.int64:
        start_idx = start_o[segment_id]
        end_idx = end_o[segment_id]
        segment = input_data[start_idx:end_idx]
        if len(segment) == 0:
            return -1
        return np.argmax(segment)

    segment_ids = CountingIterator(np.int64(0))
    unary_transform(segment_ids, output, segment_reduce_op, n_segments)

    return output

The reason this is much faster than segmented_reduce is that the latter currently uses 1 block per segment. For small segment sizes like you have here, that's not super performant. On the contrary, the unary_transform based approach uses 1 thread per segment, keeping the GPU much busier.

On my machine:

Time taken for ak.argmax: 0.0037847493775188925 seconds
Time taken for cccl_argmax: 0.028108258079737426 seconds
Time taken for cccl_argmax_new: 0.0004994929768145084 seconds

Full Script

import awkward as ak
import cupy as cp
import numpy as np
import nvtx
import timeit

from cuda.compute import unary_transform, CountingIterator, gpu_struct, ZipIterator, segmented_reduce


def cccl_argmax(awkward_array):
    @gpu_struct
    class ak_array:
        data: cp.float64
        local_index: cp.inat64

    # compare the values of the arrays
    def max_op(a: ak_array, b: ak_array):
        return a if a.data > b.data else b

    input_data = awkward_array.layout.content.data
    # use an internal awkward function to get the local indicies
    local_indicies = ak.local_index(awkward_array, axis=1)
    local_indicies = local_indicies.layout.content.data

    # Combine data and their indicies into a single structure
    # input_struct = cp.stack((input_data, parents), axis=1).view(ak_array.dtype)
    input_struct = ZipIterator(input_data, local_indicies)

    # Prepare the start and end offsets
    offsets = awkward_array.layout.offsets.data
    start_o = offsets[:-1]
    end_o = offsets[1:]

    # Prepare the output array
    n_segments = start_o.size
    output = cp.zeros([n_segments], dtype=ak_array.dtype)

    # Initial value for the reduction
    h_init = ak_array(-1, -1)

    # Perform the segmented reduce
    segmented_reduce(
        input_struct, output, start_o, end_o, max_op, h_init, n_segments
    )

    return output


def cccl_argmax_new(awkward_array):
    input_data = awkward_array.layout.content.data

    # Prepare the start and end offsets
    offsets = awkward_array.layout.offsets.data
    start_o = offsets[:-1]
    end_o = offsets[1:]

    # Prepare the output array
    n_segments = start_o.size
    output = cp.empty([n_segments], dtype=np.int64)

    def segment_reduce_op(segment_id: np.int64) -> np.int64:
        start_idx = start_o[segment_id]
        end_idx = end_o[segment_id]
        segment = input_data[start_idx:end_idx]
        if len(segment) == 0:
            return -1
        return np.argmax(segment)

    segment_ids = CountingIterator(np.int64(0))
    unary_transform(segment_ids, output, segment_reduce_op, n_segments)

    return output


print("Loading the array...")
awkward_array = ak.to_backend(ak.from_parquet(
    "random_listoffset_small.parquet"), 'cuda')


# first time, ak.argmax:
_ = ak.argmax(awkward_array, axis=1)  # warmup
start_time = timeit.default_timer()
for i in range(10):
    expect = ak.argmax(awkward_array, axis=1)
    cp.cuda.Device().synchronize()
end_time = timeit.default_timer()
print(f"Time taken for ak.argmax: {(end_time - start_time) / 10} seconds")

# first time, cccl_argmax:
_ = cccl_argmax(awkward_array)  # warmup
start_time = timeit.default_timer()
for i in range(10):
    got = cccl_argmax(awkward_array)
    cp.cuda.Device().synchronize()
end_time = timeit.default_timer()
print(f"Time taken for cccl_argmax: {(end_time - start_time) / 10} seconds")

# check results
assert np.all(ak.to_numpy(ak.to_backend(ak.fill_none(expect, -1), "cpu"))
              == got.get()['local_index'])

# next, time cccl_argmax_new:
_ = cccl_argmax_new(awkward_array)  # warmup
start_time = timeit.default_timer()
for i in range(10):
    got = cccl_argmax_new(awkward_array)
    cp.cuda.Device().synchronize()
end_time = timeit.default_timer()
print(
    f"Time taken for cccl_argmax_new: {(end_time - start_time) / 10} seconds")

# check results
assert np.all(ak.to_numpy(ak.to_backend(ak.fill_none(expect, -1), "cpu"))
              == got.get())

maxymnaumchyk · 2025-12-30T20:52:07Z

Thanks a lot @shwina! I'll check it out.

docs: Add an example argmax reducer using cccl

a5a3b19

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

docs: Add an example argmax reducer using cccl #3763

docs: Add an example argmax reducer using cccl #3763

maxymnaumchyk commented Dec 5, 2025

Uh oh!

github-actions bot commented Dec 5, 2025

Uh oh!

ianna commented Dec 5, 2025

Uh oh!

maxymnaumchyk commented Dec 5, 2025

Uh oh!

shwina commented Dec 5, 2025

Uh oh!

shwina commented Dec 5, 2025 •

edited

Loading

Uh oh!

maxymnaumchyk commented Dec 7, 2025

Uh oh!

shwina commented Dec 8, 2025

Uh oh!

maxymnaumchyk commented Dec 8, 2025 •

edited

Loading

Uh oh!

shwina commented Dec 8, 2025

Uh oh!

shwina commented Dec 19, 2025

Uh oh!

maxymnaumchyk commented Dec 30, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

docs: Add an example argmax reducer using cccl #3763

Are you sure you want to change the base?

docs: Add an example argmax reducer using cccl #3763

Conversation

maxymnaumchyk commented Dec 5, 2025

Uh oh!

github-actions bot commented Dec 5, 2025

Uh oh!

ianna commented Dec 5, 2025

Uh oh!

maxymnaumchyk commented Dec 5, 2025

Uh oh!

shwina commented Dec 5, 2025

Uh oh!

shwina commented Dec 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

maxymnaumchyk commented Dec 7, 2025

Uh oh!

shwina commented Dec 8, 2025

Uh oh!

maxymnaumchyk commented Dec 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

shwina commented Dec 8, 2025

Uh oh!

shwina commented Dec 19, 2025

Uh oh!

maxymnaumchyk commented Dec 30, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

shwina commented Dec 5, 2025 •

edited

Loading

maxymnaumchyk commented Dec 8, 2025 •

edited

Loading