v2.5.0

Latest

Latest

mattmartineau released this 21 Dec 15:42

· 5 commits to main since this release

cc1cebd

Summary of Key Changes in AMGX (v2.4.0 → v2.5.0)

CUDA Upgrades

Blackwell (B200/GB200/RTX Pro 6000) support
CUDA 13 support
Minimum CUDA version raised from 10.0 to 12.0, tested up to 13.0
Dropped support for older GPU architectures: SM20, SM35, SM52, SM60 removed
New minimum: Volta (SM70)+, with support for SM75, SM80, SM86, SM89, SM90, SM100, SM120
- Tested on Hopper, Blackwell (incl. RTX Pro 6000)
Consolidated and removed repetitive / redundant architecture-specific code

Build System Changes

Deprecated CUDA_ARCH in favor of standard CMAKE_CUDA_ARCHITECTURES
Removed Thrust submodule dependency - now uses system/CUDA-bundled Thrust
Removed OpenMP dependency entirely
Removed NVTX linking in favor of NVTX3

cuSPARSE API Updates

Removed mixed-precision support (DISABLE_MIXED_PRECISION removed)
Consolidated to use only generic cuSPARSE SpGEMM interfaces and added multiple flags
-- use_cusparse_spgemm, cusparse_spgemm_alg, etc.
Removed legacy cusparseCsrgemm2 wrapper implementations

Error Handling Improvements

New AMGX_CHECK_API_ERROR_NORSRC macro for resource-independent error checks
Improved error handling throughout

Memory Management

Added runtime detection for cudaMallocAsync support via cudaDevAttrMemoryPoolsSupported
Fallback behavior when async memory pools aren't supported by the device

Perf optimizations

Optimized hash_set insertion
Fixed perf bug with fill_A_kernel_1x1

Bug fixes

MPI comm dup bug
Convergence check for absolute testing against relative
Block size handling in distributed_arranger resize of A
Block size handling in renumbering and reordering components
Bug in scaled norm factor calculation
Fixed output performance for matrix writer
Integer overflow in dense LU
Fixed handling of latency hiding to use global row count

MPI Example Enhancements (e.g., amgx_mpi_capi.c)

Added -cd flag for diagonal dominance checking
Added -om flag for matrix output/writing
Added -r flag for performance repeat runs

Assets 2