v0.29.0
Highlights
- Support for
mxfp4quantization (Metal, CPU) - More performance improvements, bug fixes, features in CUDA backend
mx.distributedsupports NCCL back-end for CUDA
What's Changed
- [CUDA] Optimize set_mm_device_pointers for small ndim by @zcbenz in #2473
- Fix logsumexp/softmax not fused for some cases by @zcbenz in #2474
- Use CMake <4.1 to avoid the nvpl error by @angeloskath in #2489
- Fix incorrect interpretation of unsigned dtypes in reduce ops by @abeleinin in #2477
- make code blocks copyable by @Dan-Yeh in #2480
- Rename cu::Matmul to CublasGemm by @zcbenz in #2488
- Faster general unary op by @awni in #2472
- The naive_conv_2d is no longer used by @zcbenz in #2496
- Remove the hack around SmallVector in cpu compile by @zcbenz in #2494
- Clean up code handling both std::vector and SmallVector by @zcbenz in #2493
- [CUDA] Fix conv grads with groups by @zcbenz in #2495
- Update cuDNN Frontend to v1.14 by @zcbenz in #2505
- Ensure small sort doesn't use indices if not argsort by @angeloskath in #2506
- Ensure no oob read in gemv_masked by @angeloskath in #2508
- fix custom kernel test by @awni in #2510
- No segfault with uninitialized array.at by @awni in #2514
- Fix lapack svd by @awni in #2515
- Split cuDNN helpers into a separate header by @zcbenz in #2491
- [CUDA] Add GEMM-based fallback convolution kernels by @zcbenz in #2511
- Fix docs by @russellizadi in #2518
- Fix overflow in large filter small channels by @angeloskath in #2520
- [CUDA] Fix stride of singleton dims before passing to cuDNN by @zcbenz in #2521
- Custom cuda kernel by @angeloskath in #2517
- Fix docs omission by @angeloskath in #2524
- Fix power by @awni in #2523
- NCCL backend by @nastya236 in #2476
- [CUDA] Nccl pypi dep + default for cuda by @awni in #2526
- Fix warning 186-D from nvcc by @zcbenz in #2527
- [CUDA] Update calls to
cudaMemAdviseandcudaGraphAddDependenciesfor CUDA 13 by @andportnoy in #2525 - nccl default for backend=any by @awni in #2528
- Fix allocation bug in NCCL by @awni in #2530
- Enable COMPILE_WARNING_AS_ERROR for linux builds in CI by @zcbenz in #2534
- [CUDA] Remove thrust in arange by @zcbenz in #2535
- Use nccl header only when nccl is not present by @awni in #2539
- Allow pathlib.Path to save/load functions by @awni in #2541
- Remove nccl install in release by @awni in #2542
- [CUDA] Implement DynamicSlice/DynamicSliceUpdate by @zcbenz in #2533
- Remove stream from average grads so it uses default by @awni in #2532
- Enable cuda graph toggle by @awni in #2545
- Tests for save/load with
Pathby @awni in #2543 - Run CPP tests for CUDA build in CI by @zcbenz in #2544
- Separate cpu compilation cache by versions by @zcbenz in #2548
- [CUDA] Link with nccl by @awni in #2546
- [CUDA] Use ConcurrentContext in concatenate_gpu by @zcbenz in #2549
- [CUDA] fix sort by @awni in #2550
- Add mode parameter for quantization by @awni in #2499
- Bump xcode in circle by @awni in #2551
- Fix METAL quantization in JIT + fix release build by @awni in #2553
- Faster contiguous gather for indices in the first axis by @awni in #2552
- version bump by @awni in #2554
- Fix quantized vjp for mxfp4 by @awni in #2555
New Contributors
- @Dan-Yeh made their first contribution in #2480
- @russellizadi made their first contribution in #2518
- @andportnoy made their first contribution in #2525
Full Changelog: v0.28.0...v0.29.0