-
Notifications
You must be signed in to change notification settings - Fork 101
docs: Add an example argmax reducer using cccl #3763
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
docs: Add an example argmax reducer using cccl #3763
Conversation
|
The documentation preview is ready to be viewed at http://preview.awkward-array.org.s3-website.us-east-1.amazonaws.com/PR3763 |
|
@maxymnaumchyk - could you, please, post the profiling of the |
|
Line 861 is where the |
|
Thanks for the ping! Taking a look now and I'll get back to you soon. |
|
Thanks @maxymnaumchyk - you've run up to a known issue with CUB's We're going to prioritize this right away and get back to you here. In the mean time, let me see if we can suggest workarounds. |
|
Thank you for the feedback @shwina! How did you point down that the issue is related to |
|
Thanks, @maxymnaumchyk - here are some steps you can use to hopefully reproduce the results I'm seeing. First, I modified your script a bit for benchmarking: Notably:
Then, I ran this script using the Nsight Systems profiler, and inspected the results using the Nsight Systems GUI. You can also use nsightful if you prefer to view the results on your browser. Here's a notebook that you can run to produce and view the profile with https://gist.github.com/shwina/9ef807626377c3839ca8266a7de84720#file-generate_profile-ipynb Here's what the resulting profile looked like for me; you can see that most of the time is being spent in the
Now, regarding some of your previous observations --
My suspicion is that this is due to not having
You may have to expand the "CUDA HW" section of the profile to see the call to
Please let me know if you are still having trouble or not seeing similar results to mine! In the mean time, we're working to improve the performance of |
|
Thanks a lot for the explanation @shwina! Yes, now after adding
An unrelated question: the |
Yes -- precisely! |
|
Hi @maxymnaumchyk - with the latest def cccl_argmax_new(awkward_array):
input_data = awkward_array.layout.content.data
# Prepare the start and end offsets
offsets = awkward_array.layout.offsets.data
start_o = offsets[:-1]
end_o = offsets[1:]
# Prepare the output array
n_segments = start_o.size
output = cp.empty([n_segments], dtype=np.int64)
def segment_reduce_op(segment_id: np.int64) -> np.int64:
start_idx = start_o[segment_id]
end_idx = end_o[segment_id]
segment = input_data[start_idx:end_idx]
if len(segment) == 0:
return -1
return np.argmax(segment)
segment_ids = CountingIterator(np.int64(0))
unary_transform(segment_ids, output, segment_reduce_op, n_segments)
return outputThe reason this is much faster than On my machine: Time taken for ak.argmax: 0.0037847493775188925 seconds
Time taken for cccl_argmax: 0.028108258079737426 seconds
Time taken for cccl_argmax_new: 0.0004994929768145084 secondsFull Scriptimport awkward as ak
import cupy as cp
import numpy as np
import nvtx
import timeit
from cuda.compute import unary_transform, CountingIterator, gpu_struct, ZipIterator, segmented_reduce
def cccl_argmax(awkward_array):
@gpu_struct
class ak_array:
data: cp.float64
local_index: cp.inat64
# compare the values of the arrays
def max_op(a: ak_array, b: ak_array):
return a if a.data > b.data else b
input_data = awkward_array.layout.content.data
# use an internal awkward function to get the local indicies
local_indicies = ak.local_index(awkward_array, axis=1)
local_indicies = local_indicies.layout.content.data
# Combine data and their indicies into a single structure
# input_struct = cp.stack((input_data, parents), axis=1).view(ak_array.dtype)
input_struct = ZipIterator(input_data, local_indicies)
# Prepare the start and end offsets
offsets = awkward_array.layout.offsets.data
start_o = offsets[:-1]
end_o = offsets[1:]
# Prepare the output array
n_segments = start_o.size
output = cp.zeros([n_segments], dtype=ak_array.dtype)
# Initial value for the reduction
h_init = ak_array(-1, -1)
# Perform the segmented reduce
segmented_reduce(
input_struct, output, start_o, end_o, max_op, h_init, n_segments
)
return output
def cccl_argmax_new(awkward_array):
input_data = awkward_array.layout.content.data
# Prepare the start and end offsets
offsets = awkward_array.layout.offsets.data
start_o = offsets[:-1]
end_o = offsets[1:]
# Prepare the output array
n_segments = start_o.size
output = cp.empty([n_segments], dtype=np.int64)
def segment_reduce_op(segment_id: np.int64) -> np.int64:
start_idx = start_o[segment_id]
end_idx = end_o[segment_id]
segment = input_data[start_idx:end_idx]
if len(segment) == 0:
return -1
return np.argmax(segment)
segment_ids = CountingIterator(np.int64(0))
unary_transform(segment_ids, output, segment_reduce_op, n_segments)
return output
print("Loading the array...")
awkward_array = ak.to_backend(ak.from_parquet(
"random_listoffset_small.parquet"), 'cuda')
# first time, ak.argmax:
_ = ak.argmax(awkward_array, axis=1) # warmup
start_time = timeit.default_timer()
for i in range(10):
expect = ak.argmax(awkward_array, axis=1)
cp.cuda.Device().synchronize()
end_time = timeit.default_timer()
print(f"Time taken for ak.argmax: {(end_time - start_time) / 10} seconds")
# first time, cccl_argmax:
_ = cccl_argmax(awkward_array) # warmup
start_time = timeit.default_timer()
for i in range(10):
got = cccl_argmax(awkward_array)
cp.cuda.Device().synchronize()
end_time = timeit.default_timer()
print(f"Time taken for cccl_argmax: {(end_time - start_time) / 10} seconds")
# check results
assert np.all(ak.to_numpy(ak.to_backend(ak.fill_none(expect, -1), "cpu"))
== got.get()['local_index'])
# next, time cccl_argmax_new:
_ = cccl_argmax_new(awkward_array) # warmup
start_time = timeit.default_timer()
for i in range(10):
got = cccl_argmax_new(awkward_array)
cp.cuda.Device().synchronize()
end_time = timeit.default_timer()
print(
f"Time taken for cccl_argmax_new: {(end_time - start_time) / 10} seconds")
# check results
assert np.all(ak.to_numpy(ak.to_backend(ak.fill_none(expect, -1), "cpu"))
== got.get()) |
|
Thanks a lot @shwina! I'll check it out. |





Hello @shwina!
I've been trying to profile this function that I made. Unfortunately, it's now running slower than the one that awkward currently has (that is using raw cuda-kernels). For example here is me running


cccl_argmaxfunction that I wrote, on a randomly generated awkward array:It takes ~65ms per run.
Here is the same array processed by the
awkward.argmax()function:If I try to profile

cccl_argmaxline by line, I see that most of the time is spent on theak.local_index()function that I use to get the local indices of all the values in an array.Now what is strange, if I try to profile just the
ak.local_index()it runs much faster by itself. (1.84ms compared to 10.7ms insidecccl_argmax)I've been trying to profile these function using Nvidia Nsight but I don't see anything there. Maybe you can see something that could be slowing the whole thing down. Any suggestions (including other tools I could use to profile) will be greatly appreciated as I'm a bit stuck here now ~