feat: [cuda] performance improvement reducers for axis=None and lazy parents allocation#3806
feat: [cuda] performance improvement reducers for axis=None and lazy parents allocation#3806ianna wants to merge 17 commits intoscikit-hep:mainfrom
axis=None and lazy parents allocation#3806Conversation
cupy.ufunc.at and atomic fallbackscupy.ufunc.at and atomic fallbacks
Codecov Report❌ Patch coverage is
Additional details and impacted files
🚀 New features to boost your workflow:
|
|
The documentation preview is ready to be viewed at http://preview.awkward-array.org.s3-website.us-east-1.amazonaws.com/PR3806 |
|
also fixes #3807 from: to: |
61d2aed to
2320042
Compare
|
performance check for the last commit: |
ianna
left a comment
There was a problem hiding this comment.
As discussed at the last meeting on Friday, we considered using CuPy ufuncs directly for these reducers. Unfortunately, CuPy does not provide atomic or ufunc.at support for int64 in a way that preserves the required semantics, which is why this PR relies on promotion to uint64 instead.
So in order to make reducers like sum / prod / generic reducers work on GPU at all, I reinterpret int64 values as uint64 and perform the operation in that domain, then reinterpret back. This matches two’s-complement bit patterns but does not preserve ordering semantics for negative values.
As a consequence, reducers that depend on ordering or comparisons (min, max, argmin, block-boundary reducers, etc.) can produce incorrect results for int64 on CUDA. This is why we currently see failures such as:
test_block_boundary_max
test_block_boundary_min
test_block_boundary_negative_min
test_block_boundary_argmin
test_0115_generic_reducer_operation_highlevel_1
These failures are expected with the current approach and stem from the lack of native int64 support in CuPy’s atomic and ufunc.at implementations, not from a logic bug in Awkward itself.
At the moment, this PR prioritizes making GPU ufunc reducers available (even with weakened semantics) rather than raising NotImplementedError for large parts of the reducer API on CUDA.
@shwina - I would very much appreciate guidance on how we want to handle this long-term.
| CUPY_UFUNC_AT_PROMOTION = { | ||
| "bool": {"promoted": "uint32", "reinterpret": False}, | ||
| "int8": {"promoted": "int32", "reinterpret": False}, | ||
| "uint8": {"promoted": "uint32", "reinterpret": False}, | ||
| "int16": {"promoted": "int32", "reinterpret": False}, | ||
| "uint16": {"promoted": "uint32", "reinterpret": False}, | ||
| "int32": {"promoted": "int32", "reinterpret": False}, | ||
| "uint32": {"promoted": "uint32", "reinterpret": False}, | ||
| "int64": {"promoted": "uint64", "reinterpret": True}, | ||
| "uint64": {"promoted": "uint64", "reinterpret": False}, | ||
| "float16": {"promoted": "float32", "reinterpret": False}, | ||
| "float32": {"promoted": "float32", "reinterpret": False}, | ||
| "float64": {"promoted": "float64", "reinterpret": False}, | ||
| } |
There was a problem hiding this comment.
This PR currently relies on an unsafe promotion from int64 → uint64 in the CUDA backend.
This is intentional and not an oversight.
To answer my own question - awkward simply cannot use CuPy ufuncs because we support a wide variety of dtypes that are not supported currently by CuPy. On the contrary CCCL already allows up to define the functions which will take any supported by awkward dtype. |
|
Regarding the axis=None reducers part of this PR, this can be done identically to #3653 for ALL other reducers for ALL backends. cc @pfackeldey since you implemented the original axis none specialization. |
|
On top of what I said above, these reducer specialization don't need |
Agree. |
31dd26f to
8a8900a
Compare
c0a287a to
417e89a
Compare
cupy.ufunc.at and atomic fallbacksaxis=None and lazy parents allocation
cupy.ufunc.atcalls.awkward_reduce_minawkward_reduce_maxawkward_reduce_sumawkward_reduce_proddtype‑promotion table that matches CuPy’sufunc.atsupportaxis=NonePyPy’s NumPy compatibility is improving but still incomplete.parentsallocations before calling the kernelsAvoids allocating$O(N)$ initialization costs.
ak.index.Index64.zeros(layout.length)during the initial stages of reduce. For large arrays, this significantly reduces memory pressure and avoidsresolve_parentsto handle the transition between the virtualized(None, length)representation and the materializedIndex64 array.reduceto initialize parents using the optimized tuple.Note: "Refactor C++ kernels" is going to be a separate PR to make sure that the changes are well tested. The reason is that some kernels do not need
parents, but theirlength.