Releases: nv-legate/cupynumeric
v25.03.02
This is a beta release of cuPyNumeric.
Linux x86 and ARM builds for Python 3.10 - 3.12 are available on PyPI at https://pypi.org/project/nvidia-cupynumeric/, and as conda packages at https://anaconda.org/legate/cupynumeric.
Documentation for this release can be found at https://docs.nvidia.com/cupynumeric/25.03/.
New features
PIP install support
With this release, Linux x86 and ARM builds of cuPyNumeric are available for Python 3.10 - 3.12 as Python wheels on PyPI in addition to conda.
- cuPyNumeric can be installed with:
See https://docs.nvidia.com/cupynumeric/25.03/installation.html#installing-pypi-packages for further instructions.
pip install nvidia-cupynumeric
- These wheels support multi-node execution through UCX.
See https://docs.nvidia.com/legate/25.03/networking-wheels.html for more details.
v25.03.00
This is a beta release of cuPyNumeric.
Linux x86 and ARM conda packages are available for this release at https://anaconda.org/legate/cupynumeric.
Documentation for this release can be found at https://docs.nvidia.com/cupynumeric/25.03/.
New features
Licensing
- With this release the Legate framework, on which cuPyNumeric is based, becomes open-source, under the Apache-2.0 license. This makes the entire cuPyNumeric stack (anything above the CUDA library level) open-source.
Added functionality
- Matrix exponential:
cupynumeric.linalg.expm
- Batched eigendecomposition:
cupynumeric.linalg.eigvals
&cupynumeric.linalg.eig
Performance improvements
- No longer doing unnecessary streaming when running matrix multiplication on a single processor/GPU.
UX improvements
- Add the
legate.core.ProfileRange
Python context manager, to annotate sub-spans within a larger task span on the profiler visualization. - Add the
local_task_array
helper function, that can be used in Python tasks to create a view over a Store/Array argument, using a NumPy or CuPy array as appropriate based on the type of memory where the data is located.
Documentation improvements
- Add a user guide chapter on accelerating multi-GPU HDF5 workloads.
Known issues
- We are aware of possible performance regressions when using UCX 1.18. We are temporarily restricting our packages to UCX <= 1.17 while we investigate this.
v25.01.00
This is a beta release of cuPyNumeric.
Linux x86 and ARM conda packages are available at https://anaconda.org/legate/cupynumeric.
Documentation for this release can be found at https://docs.nvidia.com/cupynumeric/25.01/.
New features
Added functionality
- Add the
method
parameter tocupynumeric.convolve
. - Increase the maximum array dimension from 4 to 6.
- Experimental support for NumPy 2.0 (not reflected in the package constraints yet).
Memory management enhancements
- Updates to take advantage of the deferred-eager pool unification in Legate. This change has the potential to increase the effective available memory capacity by up to 100% for many usecases. It also removes the need for the user to adjust the
--eager-alloc-percentage
. - Add the
offload_to()
API, that allows a user to offload an array to a particular memory kind, such that any copies in other memories are discarded. This can be useful e.g. to evict an array from GPU memory onto system memory, freeing up space for subsequent GPU tasks.
I/O improvements
- Use cuFile to accelerate HDF5 reads on the GPU.
- Add support for reading "binary" HDF5 datasets (in particular useful for reading boolean-type datasets).
UX Improvements
- Consider NUMA node topology when allocating CPU cores and memory during automatic machine configuration.
- Add environment variable
LEGATE_LIMIT_STDOUT
, to only print out the output from one of the copies of the top-level program in a multi-process execution. - Remove an extraneous warning about
__buffer__
being unimplemented.
Deprecations
- Drop support for the Maxwell GPU architecture. cuPyNumeric now requires at least Pascal (
sm_60
).
v24.11.02
This is a patch release of cuPyNumeric.
Linux x86 and ARM conda packages are available at https://anaconda.org/legate/cupynumeric.
Documentation for this release can be found at https://docs.nvidia.com/cupynumeric/24.11/.
Packaging Changes
- Update for Legate
v24.11.01
v24.11.01
This is a patch release of cuPyNumeric.
Linux x86 and ARM conda packages are available at https://anaconda.org/legate/cupynumeric.
Documentation for this release can be found at https://docs.nvidia.com/cupynumeric/24.11/.
Bug Fixes
- Explicit fallback to
__array__()
on__buffer__
v24.11.00
This is a beta release of cuPyNumeric.
Linux x86 and ARM conda packages are available at https://anaconda.org/legate/cupynumeric.
Documentation for this release can be found at https://docs.nvidia.com/cupynumeric/24.11/.
New features
Improved API coverage
- Implement
np.unravel_index
- Implement
np.angle
- Implement
np.median
- Implement
np.ix_
- Implement
np.meshgrid
- Implement
np.expand_dims
- Implement
np.rot90
- Implement
np.round
- Implement
np.fft.fftshift
andnp.fft.ifftshift
- Implement
np.roll
- Support
full_matrices
parameter ofnp.linalg.svd
Memory management enhancements
- Memory efficient implementation of matrix multiplication - this implementation batches over the reduction dimension, achieving constant memory overhead regardless of array sizes.
- Memory efficiency for stencil computation - add
np.ndarray.stencil_hint
method, that instructs cuPyNumeric to pre-allocate the necessary space for ghost elements when an array is to be used in a stencil computation, reducing intermediate memory use. - Memory allocation report - report the object-memory mapping when a computation runs out of memory, to help users debug and optimize memory usage.
Enhanced infrastructure support
- GH200 Grace Hopper Superchip support - allows users to leverage GH200-based cloud instances and supercomputers.
- GASNet support - support GASNet as an alternative networking backend to UCX, using a GASNet wrapper, MPI wrapper, and custom build utilities.
- Initial HDF5 support - distributed read/write of HDF5 files using a POSIX backend.
- Automatic resource configuration at run time - automatically discover and use all the available compute resources including CPU, GPU, system memory, and framebuffer memory.
- More enhancements from Legate 24.11
Other
- Re-implement the RNG module on top of the C++ STL random library, removing the need to have cuRand in CPU-only installations.
Known Issues
cuPyNumeric will emit a false-positive warning like the following:
RuntimeWarning: cuPyNumeric has not implemented numpy.ndarray.__buffer__ and is falling back to canonical NumPy. You may notice significantly decreased performance for this function call.
in cases such as when an arithmetic operation is performed on a scalar array, e.g. cupynumeric.array(42) * 2
. There is no actual performance degradation occurring in this case. We are working on a patch that will suppress this warning.
v24.06.01
This is a patch release, and includes the following fixes:
- Fix for nv-legate/legate#947
- Fix package dependencies (cuda and openblas)
x86 conda packages with multi-node support (based on UCX) are available at https://anaconda.org/legate/cunumeric.
Documentation for this release can be found at https://docs.nvidia.com/cunumeric/24.06/.
v24.06.00
This release ports cuNumeric to the C++-based Legate-Core. Additionally, it includes the following new features:
np.linalg.qr
,np.linalg.svd
(single-GPU support only)- "where" argument for unary operations
np.select
np.flipup
,np.fliplr
np.cov
np.load
(initial, unoptimized implementation)np.average
np.logical_and/or.reduce
np.digitize
np.diff
np.linalg.cholesky
,np.linalg.solve
(multi-GPU support, based on cuSolverMp -- not included in conda packages, requires a manual build)- C++-based
ndarray
class (experimental support)
x86 conda packages with multi-node support (based on UCX) are available at https://anaconda.org/legate/cunumeric.
Documentation for this release can be found at https://docs.nvidia.com/cunumeric/24.06/.
Known issues
Including the nvidia
conda channel in an environment with cunumeric
may end up pulling cutensor
2.0, even though the cunumeric
packages explicitly request cutensor
1.7. This can cause error messages like this:
OSError: libcutensor.so.1: cannot open shared object file: No such file or directory
This is not an issue with cuNumeric, but with incorrect constraints on the cutensor
packages on the nvidia
channel. Please avoid including the nvidia
conda channel in any conda environment including cunumeric
.
v23.11.00
This release contains performance improvements to the variance operation, and a multi-dimensional Cholesky implementation.
Conda packages for this release are available at https://anaconda.org/legate/cunumeric.
What's Changed
🚀 New Features
- Added variance as a unary reduction by @jjwilke in #593
- Add batched cholesky implementation and tests by @jjwilke in #1029
🐛 Bug Fixes
- Replacing set with OrderedSet to avoid control-replication violations by @ipdemes in #1054
- Inline boolean operators in NumPy are bitwise, not logical by @manopapad in #1057
- Fix #1065 ("where" fails with IndexError) by @manopapad in #1067
- Fixes #1069, #1070 (minor einsum bugs) by @manopapad in #1072
📖 Documentation
- Suggest using mamba over conda by @manopapad in #1068
Full Changelog: v23.09.00...v23.11.00
v23.09.00
This release adds support for the quantile
API, and includes some performance and documentation improvements (notably a "Best Practices" guide).
Conda packages for this release are available at https://anaconda.org/legate/cunumeric.
What's Changed
🚀 New Features
- Quantile Implementation by @aschaffer in #664
🛠️ Improvements
- Add missing openmp variants to BitGenerator and UniqueReduce by @rohany in #1010
- Histogram refactor by @aschaffer in #1003
📖 Documentation
🐛 Bug Fixes
- Missing alignment on histogram call by @manopapad in #999
- Fix for control replication violation in test by @ipdemes in #1005
- Fix build instructions link by @bryevdv in #1014
- Add back None as an accepted value for axis on some type sigs by @manopapad in #1017
- If a scalar ufunc arg is cn.ndarray use its type directly by @manopapad in #1011
- Skip the docstrings for functions pulled from cloned modules by @manopapad in #1024
- Fix random test failures in CPU-only runs by @manopapad in #1025
- Don't cast histogram to int64 when density=True by @manopapad in #1042
- Explicitly cast result of shift binary operators by @manopapad in #1046
- Remove use of deprecated np.find_common_type by @manopapad in #1045
New Contributors
- @ajschmidt8 made their first contribution in #1035
Full Changelog: v23.07.00...v23.09.00