Skip to content

feat: [cuda] performance improvement reducers for axis=None and lazy parents allocation#3806

Draft
ianna wants to merge 17 commits intoscikit-hep:mainfrom
ianna:ianna/high_level_cupy_for_min_max_sum_reducers
Draft

feat: [cuda] performance improvement reducers for axis=None and lazy parents allocation#3806
ianna wants to merge 17 commits intoscikit-hep:mainfrom
ianna:ianna/high_level_cupy_for_min_max_sum_reducers

Conversation

@ianna
Copy link
Member

@ianna ianna commented Jan 18, 2026

  • Replaced manual CUDA kernel templates with optimized cupy.ufunc.at calls.
    • awkward_reduce_min
    • awkward_reduce_max
    • awkward_reduce_sum
    • awkward_reduce_prod
  • a dtype‑promotion table that matches CuPy’s ufunc.at support
  • Implemented the above reducers for axis=None
    • $475\times$ performance improvement on GPU
    • PyPy’s NumPy compatibility is improving but still incomplete.
    • Remove parents allocations before calling the kernels

Avoids allocating ak.index.Index64.zeros(layout.length) during the initial stages of reduce. For large arrays, this significantly reduces memory pressure and avoids $O(N)$ initialization costs.

  • Added resolve_parents to handle the transition between the virtualized (None, length) representation and the materialized Index64 array.
  • Updated reduce to initialize parents using the optimized tuple.

Note: "Refactor C++ kernels" is going to be a separate PR to make sure that the changes are well tested. The reason is that some kernels do not need parents, but their length.

@ianna ianna requested a review from maxymnaumchyk January 18, 2026 18:35
@ianna ianna changed the title feat: implement reducers using cupy.ufunc.at and atomic fallbacks feat: [cuda] implement reducers using cupy.ufunc.at and atomic fallbacks Jan 18, 2026
@codecov
Copy link

codecov bot commented Jan 18, 2026

Codecov Report

❌ Patch coverage is 69.71429% with 53 lines in your changes missing coverage. Please review.
✅ Project coverage is 82.56%. Comparing base (1b7e3d6) to head (417e89a).
⚠️ Report is 25 commits behind head on main.

Files with missing lines Patch % Lines
src/awkward/_connect/cuda/_reducers.py 0.00% 30 Missing ⚠️
src/awkward/_nplikes/cupy.py 4.16% 23 Missing ⚠️
Additional details and impacted files
Files with missing lines Coverage Δ
src/awkward/_connect/cuda/__init__.py 0.00% <ø> (ø)
src/awkward/_do.py 84.79% <100.00%> (+0.64%) ⬆️
src/awkward/_nplikes/array_module.py 95.29% <100.00%> (+0.36%) ⬆️
src/awkward/_reducers.py 98.20% <100.00%> (+0.09%) ⬆️
src/awkward/contents/bytemaskedarray.py 88.43% <100.00%> (+0.05%) ⬆️
src/awkward/contents/indexedoptionarray.py 89.67% <100.00%> (+0.09%) ⬆️
src/awkward/contents/listoffsetarray.py 81.42% <100.00%> (+0.28%) ⬆️
src/awkward/contents/numpyarray.py 91.41% <100.00%> (+0.09%) ⬆️
src/awkward/contents/regulararray.py 87.08% <100.00%> (+0.09%) ⬆️
src/awkward/_nplikes/cupy.py 32.50% <4.16%> (-5.89%) ⬇️
... and 1 more

... and 66 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@github-actions
Copy link

The documentation preview is ready to be viewed at http://preview.awkward-array.org.s3-website.us-east-1.amazonaws.com/PR3806

@ianna
Copy link
Member Author

ianna commented Jan 20, 2026

also fixes #3807

from:

Time taken for ak.max on GPU: 0.5229 seconds
Time taken for ak.max on CPU: 0.0042 seconds

to:

Time taken for ak.max on GPU: 0.0011 seconds
Time taken for ak.max on CPU: 0.0015 seconds

@ianna ianna force-pushed the ianna/high_level_cupy_for_min_max_sum_reducers branch from 61d2aed to 2320042 Compare January 20, 2026 10:26
@ianna
Copy link
Member Author

ianna commented Jan 20, 2026

performance check for the last commit:

>>> ak.max(gpu_arr)
... result = timeit.timeit(lambda: ak.max(gpu_arr),  number=10)
... print(f"Time taken for ak.max on GPU: {result / 10:.4f} seconds")
... result = timeit.timeit(lambda: ak.max(arr),  number=10)
... print(f"Time taken for ak.max on CPU: {result / 10:.4f} seconds")
... 
Time taken for ak.max on GPU: 0.0011 seconds
Time taken for ak.max on CPU: 0.0015 seconds
>>> ak.min(gpu_arr)
... result = timeit.timeit(lambda: ak.min(gpu_arr),  number=10)
... print(f"Time taken for ak.min on GPU: {result / 10:.4f} seconds")
... result = timeit.timeit(lambda: ak.min(arr),  number=10)
... print(f"Time taken for ak.min on CPU: {result / 10:.4f} seconds")
... 
Time taken for ak.min on GPU: 0.0010 seconds
Time taken for ak.min on CPU: 0.0014 seconds
>>> ak.min(gpu_arr, axis=-1)
... result = timeit.timeit(lambda: ak.min(gpu_arr, axis=-1),  number=10)
... print(f"Time taken for ak.min on GPU: {result / 10:.4f} seconds")
... result = timeit.timeit(lambda: ak.min(arr, axis=-1),  number=10)
... print(f"Time taken for ak.min on CPU: {result / 10:.4f} seconds")
... 
Time taken for ak.min on GPU: 0.0016 seconds
Time taken for ak.min on CPU: 0.0042 seconds

Copy link
Member Author

@ianna ianna left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As discussed at the last meeting on Friday, we considered using CuPy ufuncs directly for these reducers. Unfortunately, CuPy does not provide atomic or ufunc.at support for int64 in a way that preserves the required semantics, which is why this PR relies on promotion to uint64 instead.

So in order to make reducers like sum / prod / generic reducers work on GPU at all, I reinterpret int64 values as uint64 and perform the operation in that domain, then reinterpret back. This matches two’s-complement bit patterns but does not preserve ordering semantics for negative values.

As a consequence, reducers that depend on ordering or comparisons (min, max, argmin, block-boundary reducers, etc.) can produce incorrect results for int64 on CUDA. This is why we currently see failures such as:

test_block_boundary_max

test_block_boundary_min

test_block_boundary_negative_min

test_block_boundary_argmin

test_0115_generic_reducer_operation_highlevel_1

These failures are expected with the current approach and stem from the lack of native int64 support in CuPy’s atomic and ufunc.at implementations, not from a logic bug in Awkward itself.

At the moment, this PR prioritizes making GPU ufunc reducers available (even with weakened semantics) rather than raising NotImplementedError for large parts of the reducer API on CUDA.

@shwina - I would very much appreciate guidance on how we want to handle this long-term.

Comment on lines +10 to +23
CUPY_UFUNC_AT_PROMOTION = {
"bool": {"promoted": "uint32", "reinterpret": False},
"int8": {"promoted": "int32", "reinterpret": False},
"uint8": {"promoted": "uint32", "reinterpret": False},
"int16": {"promoted": "int32", "reinterpret": False},
"uint16": {"promoted": "uint32", "reinterpret": False},
"int32": {"promoted": "int32", "reinterpret": False},
"uint32": {"promoted": "uint32", "reinterpret": False},
"int64": {"promoted": "uint64", "reinterpret": True},
"uint64": {"promoted": "uint64", "reinterpret": False},
"float16": {"promoted": "float32", "reinterpret": False},
"float32": {"promoted": "float32", "reinterpret": False},
"float64": {"promoted": "float64", "reinterpret": False},
}
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR currently relies on an unsafe promotion from int64 → uint64 in the CUDA backend.

This is intentional and not an oversight.

@ianna
Copy link
Member Author

ianna commented Jan 21, 2026

As discussed at the last meeting on Friday, we considered using CuPy ufuncs directly for these reducers. Unfortunately, CuPy does not provide atomic or ufunc.at support for int64 in a way that preserves the required semantics, which is why this PR relies on promotion to uint64 instead.

So in order to make reducers like sum / prod / generic reducers work on GPU at all, I reinterpret int64 values as uint64 and perform the operation in that domain, then reinterpret back. This matches two’s-complement bit patterns but does not preserve ordering semantics for negative values.

As a consequence, reducers that depend on ordering or comparisons (min, max, argmin, block-boundary reducers, etc.) can produce incorrect results for int64 on CUDA. This is why we currently see failures such as:

test_block_boundary_max

test_block_boundary_min

test_block_boundary_negative_min

test_block_boundary_argmin

test_0115_generic_reducer_operation_highlevel_1

These failures are expected with the current approach and stem from the lack of native int64 support in CuPy’s atomic and ufunc.at implementations, not from a logic bug in Awkward itself.

At the moment, this PR prioritizes making GPU ufunc reducers available (even with weakened semantics) rather than raising NotImplementedError for large parts of the reducer API on CUDA.

@shwina - I would very much appreciate guidance on how we want to handle this long-term.

To answer my own question - awkward simply cannot use CuPy ufuncs because we support a wide variety of dtypes that are not supported currently by CuPy. On the contrary CCCL already allows up to define the functions which will take any supported by awkward dtype.

@ikrommyd
Copy link
Collaborator

Regarding the axis=None reducers part of this PR, this can be done identically to #3653 for ALL other reducers for ALL backends.
Regarding the nplike changes here, we shouldn't introduce new functionality only on the cupy nplike (like the initial kwarg or similar). We should introduce them to all nplikes in a similar and self-consistent manner.
In general, I don't think we should be introducing specific changes to the cupy nplike. All nplikes inherit from the general array_module.py nplike so whatever is common should be common across all nplikes

cc @pfackeldey since you implemented the original axis none specialization.

@ianna ianna marked this pull request as draft January 22, 2026 15:52
@ikrommyd
Copy link
Collaborator

On top of what I said above, these reducer specialization don't need parents, starts, shifts, outlength so it would be best (as it is currently on the list) to not allocate those at all.

@ianna
Copy link
Member Author

ianna commented Jan 25, 2026

On top of what I said above, these reducer specialization don't need parents, starts, shifts, outlength so it would be best (as it is currently on the list) to not allocate those at all.

Agree.

@ianna ianna force-pushed the ianna/high_level_cupy_for_min_max_sum_reducers branch from 31dd26f to 8a8900a Compare February 5, 2026 15:10
@ianna ianna force-pushed the ianna/high_level_cupy_for_min_max_sum_reducers branch from c0a287a to 417e89a Compare February 6, 2026 19:05
@ianna ianna changed the title feat: [cuda] implement reducers using cupy.ufunc.at and atomic fallbacks feat: [cuda] performance improvement reducers for axis=None and lazy parents allocation Feb 6, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants