feat: [cuda] performance improvement reducers for `axis=None` and lazy `parents` allocation by ianna · Pull Request #3806 · scikit-hep/awkward

ianna · 2026-01-18T18:35:35Z

Replaced manual CUDA kernel templates with optimized cupy.ufunc.at calls.

awkward_reduce_min

awkward_reduce_max

awkward_reduce_sum

awkward_reduce_prod

a dtype‑promotion table that matches CuPy’s ufunc.at support
Implemented the above reducers for axis=None
- $475\times$ performance improvement on GPU
- ~~PyPy’s NumPy compatibility is improving but still incomplete.~~
- Remove parents allocations before calling the kernels

Avoids allocating ak.index.Index64.zeros(layout.length) during the initial stages of reduce. For large arrays, this significantly reduces memory pressure and avoids $O(N)$ initialization costs.

Added resolve_parents to handle the transition between the virtualized (None, length) representation and the materialized Index64 array.
Updated reduce to initialize parents using the optimized tuple.

Note: "Refactor C++ kernels" is going to be a separate PR to make sure that the changes are well tested. The reason is that some kernels do not need parents, but their length.

codecov · 2026-01-18T18:41:02Z

Codecov Report

❌ Patch coverage is 69.71429% with 53 lines in your changes missing coverage. Please review.
✅ Project coverage is 82.56%. Comparing base (1b7e3d6) to head (417e89a).
⚠️ Report is 25 commits behind head on main.

Files with missing lines	Patch %	Lines
src/awkward/_connect/cuda/_reducers.py	0.00%	30 Missing ⚠️
src/awkward/_nplikes/cupy.py	4.16%	23 Missing ⚠️

Additional details and impacted files

Files with missing lines	Coverage Δ
src/awkward/_connect/cuda/__init__.py	`0.00% <ø> (ø)`
src/awkward/_do.py	`84.79% <100.00%> (+0.64%)`	⬆️
src/awkward/_nplikes/array_module.py	`95.29% <100.00%> (+0.36%)`	⬆️
src/awkward/_reducers.py	`98.20% <100.00%> (+0.09%)`	⬆️
src/awkward/contents/bytemaskedarray.py	`88.43% <100.00%> (+0.05%)`	⬆️
src/awkward/contents/indexedoptionarray.py	`89.67% <100.00%> (+0.09%)`	⬆️
src/awkward/contents/listoffsetarray.py	`81.42% <100.00%> (+0.28%)`	⬆️
src/awkward/contents/numpyarray.py	`91.41% <100.00%> (+0.09%)`	⬆️
src/awkward/contents/regulararray.py	`87.08% <100.00%> (+0.09%)`	⬆️
src/awkward/_nplikes/cupy.py	`32.50% <4.16%> (-5.89%)`	⬇️
... and 1 more

... and 66 files with indirect coverage changes

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

github-actions · 2026-01-18T18:46:41Z

The documentation preview is ready to be viewed at http://preview.awkward-array.org.s3-website.us-east-1.amazonaws.com/PR3806

ianna · 2026-01-20T00:04:51Z

also fixes #3807

from:

Time taken for ak.max on GPU: 0.5229 seconds
Time taken for ak.max on CPU: 0.0042 seconds

to:

Time taken for ak.max on GPU: 0.0011 seconds
Time taken for ak.max on CPU: 0.0015 seconds

ianna · 2026-01-20T10:27:32Z

performance check for the last commit:

>>> ak.max(gpu_arr)
... result = timeit.timeit(lambda: ak.max(gpu_arr),  number=10)
... print(f"Time taken for ak.max on GPU: {result / 10:.4f} seconds")
... result = timeit.timeit(lambda: ak.max(arr),  number=10)
... print(f"Time taken for ak.max on CPU: {result / 10:.4f} seconds")
... 
Time taken for ak.max on GPU: 0.0011 seconds
Time taken for ak.max on CPU: 0.0015 seconds
>>> ak.min(gpu_arr)
... result = timeit.timeit(lambda: ak.min(gpu_arr),  number=10)
... print(f"Time taken for ak.min on GPU: {result / 10:.4f} seconds")
... result = timeit.timeit(lambda: ak.min(arr),  number=10)
... print(f"Time taken for ak.min on CPU: {result / 10:.4f} seconds")
... 
Time taken for ak.min on GPU: 0.0010 seconds
Time taken for ak.min on CPU: 0.0014 seconds
>>> ak.min(gpu_arr, axis=-1)
... result = timeit.timeit(lambda: ak.min(gpu_arr, axis=-1),  number=10)
... print(f"Time taken for ak.min on GPU: {result / 10:.4f} seconds")
... result = timeit.timeit(lambda: ak.min(arr, axis=-1),  number=10)
... print(f"Time taken for ak.min on CPU: {result / 10:.4f} seconds")
... 
Time taken for ak.min on GPU: 0.0016 seconds
Time taken for ak.min on CPU: 0.0042 seconds

ianna

As discussed at the last meeting on Friday, we considered using CuPy ufuncs directly for these reducers. Unfortunately, CuPy does not provide atomic or ufunc.at support for int64 in a way that preserves the required semantics, which is why this PR relies on promotion to uint64 instead.

So in order to make reducers like sum / prod / generic reducers work on GPU at all, I reinterpret int64 values as uint64 and perform the operation in that domain, then reinterpret back. This matches two’s-complement bit patterns but does not preserve ordering semantics for negative values.

As a consequence, reducers that depend on ordering or comparisons (min, max, argmin, block-boundary reducers, etc.) can produce incorrect results for int64 on CUDA. This is why we currently see failures such as:

test_block_boundary_max

test_block_boundary_min

test_block_boundary_negative_min

test_block_boundary_argmin

test_0115_generic_reducer_operation_highlevel_1

These failures are expected with the current approach and stem from the lack of native int64 support in CuPy’s atomic and ufunc.at implementations, not from a logic bug in Awkward itself.

At the moment, this PR prioritizes making GPU ufunc reducers available (even with weakened semantics) rather than raising NotImplementedError for large parts of the reducer API on CUDA.

@shwina - I would very much appreciate guidance on how we want to handle this long-term.

ianna · 2026-01-20T16:59:29Z

src/awkward/_connect/cuda/_reducers.py

+CUPY_UFUNC_AT_PROMOTION = {
+    "bool": {"promoted": "uint32", "reinterpret": False},
+    "int8": {"promoted": "int32", "reinterpret": False},
+    "uint8": {"promoted": "uint32", "reinterpret": False},
+    "int16": {"promoted": "int32", "reinterpret": False},
+    "uint16": {"promoted": "uint32", "reinterpret": False},
+    "int32": {"promoted": "int32", "reinterpret": False},
+    "uint32": {"promoted": "uint32", "reinterpret": False},
+    "int64": {"promoted": "uint64", "reinterpret": True},
+    "uint64": {"promoted": "uint64", "reinterpret": False},
+    "float16": {"promoted": "float32", "reinterpret": False},
+    "float32": {"promoted": "float32", "reinterpret": False},
+    "float64": {"promoted": "float64", "reinterpret": False},
+}


This PR currently relies on an unsafe promotion from int64 → uint64 in the CUDA backend.

This is intentional and not an oversight.

ianna · 2026-01-21T15:29:40Z

As discussed at the last meeting on Friday, we considered using CuPy ufuncs directly for these reducers. Unfortunately, CuPy does not provide atomic or ufunc.at support for int64 in a way that preserves the required semantics, which is why this PR relies on promotion to uint64 instead.

So in order to make reducers like sum / prod / generic reducers work on GPU at all, I reinterpret int64 values as uint64 and perform the operation in that domain, then reinterpret back. This matches two’s-complement bit patterns but does not preserve ordering semantics for negative values.

As a consequence, reducers that depend on ordering or comparisons (min, max, argmin, block-boundary reducers, etc.) can produce incorrect results for int64 on CUDA. This is why we currently see failures such as:
test_block_boundary_max

test_block_boundary_min

test_block_boundary_negative_min

test_block_boundary_argmin

test_0115_generic_reducer_operation_highlevel_1
These failures are expected with the current approach and stem from the lack of native int64 support in CuPy’s atomic and ufunc.at implementations, not from a logic bug in Awkward itself.

At the moment, this PR prioritizes making GPU ufunc reducers available (even with weakened semantics) rather than raising NotImplementedError for large parts of the reducer API on CUDA.

@shwina - I would very much appreciate guidance on how we want to handle this long-term.

To answer my own question - awkward simply cannot use CuPy ufuncs because we support a wide variety of dtypes that are not supported currently by CuPy. On the contrary CCCL already allows up to define the functions which will take any supported by awkward dtype.

ikrommyd · 2026-01-22T15:12:13Z

Regarding the axis=None reducers part of this PR, this can be done identically to #3653 for ALL other reducers for ALL backends.
Regarding the nplike changes here, we shouldn't introduce new functionality only on the cupy nplike (like the initial kwarg or similar). We should introduce them to all nplikes in a similar and self-consistent manner.
In general, I don't think we should be introducing specific changes to the cupy nplike. All nplikes inherit from the general array_module.py nplike so whatever is common should be common across all nplikes

cc @pfackeldey since you implemented the original axis none specialization.

ikrommyd · 2026-01-24T23:39:29Z

On top of what I said above, these reducer specialization don't need parents, starts, shifts, outlength so it would be best (as it is currently on the list) to not allocate those at all.

ianna · 2026-01-25T07:38:09Z

On top of what I said above, these reducer specialization don't need parents, starts, shifts, outlength so it would be best (as it is currently on the list) to not allocate those at all.

Agree.

implement reducers using cupy.ufunc.at and atomic fallbacks

ebc3631

ianna requested a review from maxymnaumchyk January 18, 2026 18:35

ianna changed the title ~~feat: implement reducers using cupy.ufunc.at and atomic fallbacks~~ feat: [cuda] implement reducers using cupy.ufunc.at and atomic fallbacks Jan 18, 2026

ianna mentioned this pull request Jan 18, 2026

axis = None reduction is much slower on GPU #3807

Open

ianna added 4 commits January 19, 2026 11:11

fix it

1984b1a

fix ~300x performance penalty on GPU

c6c7efa

add max reducer for axis=None

1c76b87

add min reducer for axis=None

431bee3

add dtype‑promotion table that matches CuPy’s ufunc.at support

2320042

ianna force-pushed the ianna/high_level_cupy_for_min_max_sum_reducers branch from 61d2aed to 2320042 Compare January 20, 2026 10:26

ianna added 6 commits January 20, 2026 05:30

linter errors fixed

84442c5

use reduce_with_cupy_at for sum reduction

0e96cab

add prod reducer for axis=None and rewrite cuda prod kernel

84f45ef

cleanup

eb02978

cleanup

eef632b

revert

8e89de6

ianna commented Jan 20, 2026

View reviewed changes

ianna marked this pull request as draft January 22, 2026 15:52

ianna added 3 commits January 25, 2026 12:51

eplace eager parents allocation with lazy (None, length) tuple

7fa70df

Merge remote updates and resolve conflicts in _do.py

36ec88a

add missing kernels

85c6e04

ianna mentioned this pull request Feb 4, 2026

ak.argmin and ak.argmax on ListOffsetArray with cuda backend and empty last row give an illegal memory access crash #3838

Open

fix missing count

8a8900a

ianna force-pushed the ianna/high_level_cupy_for_min_max_sum_reducers branch from 31dd26f to 8a8900a Compare February 5, 2026 15:10

move back to cuda kernels

417e89a

ianna force-pushed the ianna/high_level_cupy_for_min_max_sum_reducers branch from c0a287a to 417e89a Compare February 6, 2026 19:05

ianna changed the title ~~feat: [cuda] implement reducers using cupy.ufunc.at and atomic fallbacks~~ feat: [cuda] performance improvement reducers for axis=None and lazy parents allocation Feb 6, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: [cuda] performance improvement reducers for `axis=None` and lazy `parents` allocation#3806

feat: [cuda] performance improvement reducers for `axis=None` and lazy `parents` allocation#3806
ianna wants to merge 17 commits intoscikit-hep:mainfrom
ianna:ianna/high_level_cupy_for_min_max_sum_reducers

ianna commented Jan 18, 2026 •

edited

Loading

Uh oh!

codecov bot commented Jan 18, 2026 •

edited

Loading

Uh oh!

github-actions bot commented Jan 18, 2026

Uh oh!

ianna commented Jan 20, 2026

Uh oh!

ianna commented Jan 20, 2026

Uh oh!

ianna left a comment

Uh oh!

ianna Jan 20, 2026

Uh oh!

ianna commented Jan 21, 2026

Uh oh!

ikrommyd commented Jan 22, 2026

Uh oh!

ikrommyd commented Jan 24, 2026

Uh oh!

ianna commented Jan 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

ianna commented Jan 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codecov bot commented Jan 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

github-actions bot commented Jan 18, 2026

Uh oh!

ianna commented Jan 20, 2026

Uh oh!

ianna commented Jan 20, 2026

Uh oh!

ianna left a comment

Choose a reason for hiding this comment

Uh oh!

ianna Jan 20, 2026

Choose a reason for hiding this comment

Uh oh!

ianna commented Jan 21, 2026

Uh oh!

ikrommyd commented Jan 22, 2026

Uh oh!

ikrommyd commented Jan 24, 2026

Uh oh!

ianna commented Jan 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

ianna commented Jan 18, 2026 •

edited

Loading

codecov bot commented Jan 18, 2026 •

edited

Loading