Skip to content

Releases: nv-legate/cupynumeric

v25.10.00

30 Oct 21:23
66d872d

Choose a tag to compare

This is a beta release of cuPyNumeric.

Pip wheels are available on PyPI at https://pypi.org/project/nvidia-cupynumeric/, for Linux (x86-64 and ARM64, with CUDA and multi-node support) and macOS (for ARM64). Conda packages are available at https://anaconda.org/legate/cupynumeric, for Linux (x86-64 and ARM64, with CUDA and multi-node support). GASNet-based (rather than UCX-based) conda packages are under the gex label. Windows is currently supported through WSL.

Documentation for this release can be found at https://docs.nvidia.com/cupynumeric/25.10/.

Highlights

Added functionality

  • Implement cupynumeric.in1d.
  • Add DLPack import/export support to cuPyNumeric ndarrays.
  • Allow batched input for cupynumeric.linalg.solve.

Performance improvements

  • Optimized implementation for the special axis= case of cupynumeric.take.
  • Improve heuristics for choosing between batched and unbatched matrix multiplication.
  • Improved implementation of cupynumeric.nonzero that uses no additional scratch space.
  • Identify special cases of advanced indexing that can be executed faster using cupynumeric.einsum.

Documentation / profiling

  • Add a tutorial on using Legate Tasks to extend cuPyNumeric.
  • Add a user warning when an operation (e.g. printing to the console) causes a sharded array to be gathered onto a single memory.
  • Add sub-boxes to the Legate profiler, showing how long the Python interpreter spends inside cuPyNumeric API calls.

Breaking changes

  • Move nightly conda packages to a dedicated channel, -c legate-nightly.

Known issues

  • We are aware of hangs occurring under certain platforms and UCC configurations, when using cuSolverMp-backed multi-GPU operations (Cholesky factorization and linear solve). We expect these to be fixed by the 25.11 release, that updates to cuSolverMp 0.7.

Full Changelog: v25.08.00...v25.10.00

v25.08.00

05 Sep 07:38
7146e78

Choose a tag to compare

This is a beta release of cuPyNumeric.

Pip wheels are available on PyPI at https://pypi.org/project/nvidia-cupynumeric/, for Linux (x86-64 and ARM64, with CUDA and multi-node support) and macOS (for ARM64). Conda packages are available at https://anaconda.org/legate/cupynumeric, for Linux (x86-64 and ARM64, with CUDA and multi-node support). GASNet-based (rather than UCX-based) conda packages are under the gex label. Windows is currently supported through WSL.

Documentation for this release can be found at https://docs.nvidia.com/cupynumeric/25.08/.

New features

Added functionality

  • Multi-node multi-GPU capable SVD, specialized for tall-skinny matrices
  • cupynumeric.cross
  • cupynumeric.insert
  • cupynumeric.logspace
  • cupynumeric.real_if_close
  • cupynumeric.roots
  • cupynumeric.ravel_multi_index
  • cupynumeric.copyto
  • cupynumeric.diagflat
  • cupynumeric.delete
  • cupynumeric.nan_to_num
  • Support multi-axis reductions

Performance Improvements

  • Improve robustness & speed of cupynumeric.sort, by combining allocations where possible, and adding synchronization barriers around NCCL collectives.
  • Remove some extraneous blocking that was only necessary to match the behavior of NumPy 1.x.
  • Improve performance of NumPy fallback, in particular removing extraneous array copies, and adding special cases for quick fallback to functions such as cupynumeric.concatenate.

Miscellaneous

  • Unify all environment variables that control cuPyNumeric's NumPy fallback heuristics, to a single one, CUPYNUMERIC_MAX_EAGER_VOLUME.
  • Allow any available BLAS implementation to be used in a source build.

Full Changelog: v25.07.00...v25.08.00

v25.07.00

09 Jul 18:36
6132d84

Choose a tag to compare

This is a beta release of cuPyNumeric.

Pip wheels are available on PyPI at https://pypi.org/project/nvidia-cupynumeric/, for Linux (x86-64 and ARM64, with CUDA and multi-node support) and macOS (for ARM64). Conda packages are available at https://anaconda.org/legate/cupynumeric, for Linux (x86-64 and ARM64, with CUDA and multi-node support). GASNet-based (rather than UCX-based) conda packages are under the gex label. Windows is currently supported through WSL.

Documentation for this release can be found at https://docs.nvidia.com/cupynumeric/25.07/.

New features

Added functionality

  • Multi-node multi-GPU capable cupynumeric.linalg.solve and cupynumeric.linalg.cholesky, backed by cuSolverMp.
  • Single-GPU cupynumeric.linalg.eigh/eigvalsh, backed by cuSolver.
  • cupynumeric.round

Support matrix changes

  • macOS wheels are now available on PyPI.
  • Add support for Blackwell CUDA architecture and MNNVL.
  • Drop support for Python 3.10 and add support for Python 3.13.
  • Remove NumPy 1.X restriction from packages (now compatible with NumPy 2.X).

Tuning

Documentation

Full Changelog: v25.03.02...v25.07.00

Known issues

  • Multi-node runs can occasionally segfault at exit. This issue is under investigation. Preliminary investigation suggests that the issue depends on the ordering between cuPyNumeric and OpenBLAS teardown. There is no impact to the correctness of the computation and subsequent GPU usage.
  • If the user explicitly forces multi-GPU execution of a sorting operation on very small arrays (about as many elements as the number of GPUs) this can result in CUDA errors. In normal conditions cuPyNumeric would not be GPU-accelerating operations of this size. A fix for this issue is in development and will be made available in an upcoming nightly build.

v25.03.02

09 Apr 19:08
1fa4560

Choose a tag to compare

This is a beta release of cuPyNumeric.

Linux x86 and ARM builds for Python 3.10 - 3.12 are available on PyPI at https://pypi.org/project/nvidia-cupynumeric/, and as conda packages at https://anaconda.org/legate/cupynumeric.

Documentation for this release can be found at https://docs.nvidia.com/cupynumeric/25.03/.

New features

PIP install support

With this release, Linux x86 and ARM builds of cuPyNumeric are available for Python 3.10 - 3.12 as Python wheels on PyPI in addition to conda.

v25.03.00

17 Mar 23:04
e6be689

Choose a tag to compare

This is a beta release of cuPyNumeric.

Linux x86 and ARM conda packages are available for this release at https://anaconda.org/legate/cupynumeric.

Documentation for this release can be found at https://docs.nvidia.com/cupynumeric/25.03/.

New features

Licensing

  • With this release the Legate framework, on which cuPyNumeric is based, becomes open-source, under the Apache-2.0 license. This makes the entire cuPyNumeric stack (anything above the CUDA library level) open-source.

Added functionality

  • Matrix exponential: cupynumeric.linalg.expm
  • Batched eigendecomposition: cupynumeric.linalg.eigvals & cupynumeric.linalg.eig

Performance improvements

  • No longer doing unnecessary streaming when running matrix multiplication on a single processor/GPU.

UX improvements

  • Add thelegate.core.ProfileRange Python context manager, to annotate sub-spans within a larger task span on the profiler visualization.
  • Add the local_task_array helper function, that can be used in Python tasks to create a view over a Store/Array argument, using a NumPy or CuPy array as appropriate based on the type of memory where the data is located.

Documentation improvements

Known issues

  • We are aware of possible performance regressions when using UCX 1.18. We are temporarily restricting our packages to UCX <= 1.17 while we investigate this.

v25.01.00

08 Feb 06:20
0464776

Choose a tag to compare

This is a beta release of cuPyNumeric.

Linux x86 and ARM conda packages are available at https://anaconda.org/legate/cupynumeric.

Documentation for this release can be found at https://docs.nvidia.com/cupynumeric/25.01/.

New features

Added functionality

  • Add the method parameter to cupynumeric.convolve.
  • Increase the maximum array dimension from 4 to 6.
  • Experimental support for NumPy 2.0 (not reflected in the package constraints yet).

Memory management enhancements

  • Updates to take advantage of the deferred-eager pool unification in Legate. This change has the potential to increase the effective available memory capacity by up to 100% for many usecases. It also removes the need for the user to adjust the --eager-alloc-percentage.
  • Add the offload_to() API, that allows a user to offload an array to a particular memory kind, such that any copies in other memories are discarded. This can be useful e.g. to evict an array from GPU memory onto system memory, freeing up space for subsequent GPU tasks.

I/O improvements

  • Use cuFile to accelerate HDF5 reads on the GPU.
  • Add support for reading "binary" HDF5 datasets (in particular useful for reading boolean-type datasets).

UX Improvements

  • Consider NUMA node topology when allocating CPU cores and memory during automatic machine configuration.
  • Add environment variable LEGATE_LIMIT_STDOUT, to only print out the output from one of the copies of the top-level program in a multi-process execution.
  • Remove an extraneous warning about __buffer__ being unimplemented.

Deprecations

  • Drop support for the Maxwell GPU architecture. cuPyNumeric now requires at least Pascal (sm_60).

v24.11.02

07 Dec 06:44
0bc7ba6

Choose a tag to compare

This is a patch release of cuPyNumeric.

Linux x86 and ARM conda packages are available at https://anaconda.org/legate/cupynumeric.

Documentation for this release can be found at https://docs.nvidia.com/cupynumeric/24.11/.

Packaging Changes

  • Update for Legate v24.11.01

v24.11.01

07 Dec 06:42
1207434

Choose a tag to compare

This is a patch release of cuPyNumeric.

Linux x86 and ARM conda packages are available at https://anaconda.org/legate/cupynumeric.

Documentation for this release can be found at https://docs.nvidia.com/cupynumeric/24.11/.

Bug Fixes

  • Explicit fallback to __array__() on __buffer__

v24.11.00

17 Nov 00:51
eedb7e1

Choose a tag to compare

This is a beta release of cuPyNumeric.

Linux x86 and ARM conda packages are available at https://anaconda.org/legate/cupynumeric.

Documentation for this release can be found at https://docs.nvidia.com/cupynumeric/24.11/.

New features

Improved API coverage

  • Implement np.unravel_index
  • Implement np.angle
  • Implement np.median
  • Implement np.ix_
  • Implement np.meshgrid
  • Implement np.expand_dims
  • Implement np.rot90
  • Implement np.round
  • Implement np.fft.fftshift and np.fft.ifftshift
  • Implement np.roll
  • Support full_matrices parameter of np.linalg.svd

Memory management enhancements

  • Memory efficient implementation of matrix multiplication - this implementation batches over the reduction dimension, achieving constant memory overhead regardless of array sizes.
  • Memory efficiency for stencil computation - add np.ndarray.stencil_hint method, that instructs cuPyNumeric to pre-allocate the necessary space for ghost elements when an array is to be used in a stencil computation, reducing intermediate memory use.
  • Memory allocation report - report the object-memory mapping when a computation runs out of memory, to help users debug and optimize memory usage.

Enhanced infrastructure support

  • GH200 Grace Hopper Superchip support - allows users to leverage GH200-based cloud instances and supercomputers.
  • GASNet support - support GASNet as an alternative networking backend to UCX, using a GASNet wrapper, MPI wrapper, and custom build utilities.
  • Initial HDF5 support - distributed read/write of HDF5 files using a POSIX backend.
  • Automatic resource configuration at run time - automatically discover and use all the available compute resources including CPU, GPU, system memory, and framebuffer memory.
  • More enhancements from Legate 24.11

Other

  • Re-implement the RNG module on top of the C++ STL random library, removing the need to have cuRand in CPU-only installations.

Known Issues

cuPyNumeric will emit a false-positive warning like the following:

RuntimeWarning: cuPyNumeric has not implemented numpy.ndarray.__buffer__ and is falling back to canonical NumPy. You may notice significantly decreased performance for this function call.

in cases such as when an arithmetic operation is performed on a scalar array, e.g. cupynumeric.array(42) * 2. There is no actual performance degradation occurring in this case. We are working on a patch that will suppress this warning.

v24.06.01

11 Sep 20:36
v24.06.01
370f766

Choose a tag to compare

This is a patch release, and includes the following fixes:

x86 conda packages with multi-node support (based on UCX) are available at https://anaconda.org/legate/cunumeric.

Documentation for this release can be found at https://docs.nvidia.com/cunumeric/24.06/.