Release v0.29.0 · ml-explore/mlx

Highlights

Support for mxfp4 quantization (Metal, CPU)
More performance improvements, bug fixes, features in CUDA backend
mx.distributed supports NCCL back-end for CUDA

What's Changed

[CUDA] Optimize set_mm_device_pointers for small ndim by @zcbenz in #2473
Fix logsumexp/softmax not fused for some cases by @zcbenz in #2474
Use CMake <4.1 to avoid the nvpl error by @angeloskath in #2489
Fix incorrect interpretation of unsigned dtypes in reduce ops by @abeleinin in #2477
make code blocks copyable by @Dan-Yeh in #2480
Rename cu::Matmul to CublasGemm by @zcbenz in #2488
Faster general unary op by @awni in #2472
The naive_conv_2d is no longer used by @zcbenz in #2496
Remove the hack around SmallVector in cpu compile by @zcbenz in #2494
Clean up code handling both std::vector and SmallVector by @zcbenz in #2493
[CUDA] Fix conv grads with groups by @zcbenz in #2495
Update cuDNN Frontend to v1.14 by @zcbenz in #2505
Ensure small sort doesn't use indices if not argsort by @angeloskath in #2506
Ensure no oob read in gemv_masked by @angeloskath in #2508
fix custom kernel test by @awni in #2510
No segfault with uninitialized array.at by @awni in #2514
Fix lapack svd by @awni in #2515
Split cuDNN helpers into a separate header by @zcbenz in #2491
[CUDA] Add GEMM-based fallback convolution kernels by @zcbenz in #2511
Fix docs by @russellizadi in #2518
Fix overflow in large filter small channels by @angeloskath in #2520
[CUDA] Fix stride of singleton dims before passing to cuDNN by @zcbenz in #2521
Custom cuda kernel by @angeloskath in #2517
Fix docs omission by @angeloskath in #2524
Fix power by @awni in #2523
NCCL backend by @nastya236 in #2476
[CUDA] Nccl pypi dep + default for cuda by @awni in #2526
Fix warning 186-D from nvcc by @zcbenz in #2527
[CUDA] Update calls to cudaMemAdvise and cudaGraphAddDependencies for CUDA 13 by @andportnoy in #2525
nccl default for backend=any by @awni in #2528
Fix allocation bug in NCCL by @awni in #2530
Enable COMPILE_WARNING_AS_ERROR for linux builds in CI by @zcbenz in #2534
[CUDA] Remove thrust in arange by @zcbenz in #2535
Use nccl header only when nccl is not present by @awni in #2539
Allow pathlib.Path to save/load functions by @awni in #2541
Remove nccl install in release by @awni in #2542
[CUDA] Implement DynamicSlice/DynamicSliceUpdate by @zcbenz in #2533
Remove stream from average grads so it uses default by @awni in #2532
Enable cuda graph toggle by @awni in #2545
Tests for save/load with Path by @awni in #2543
Run CPP tests for CUDA build in CI by @zcbenz in #2544
Separate cpu compilation cache by versions by @zcbenz in #2548
[CUDA] Link with nccl by @awni in #2546
[CUDA] Use ConcurrentContext in concatenate_gpu by @zcbenz in #2549
[CUDA] fix sort by @awni in #2550
Add mode parameter for quantization by @awni in #2499
Bump xcode in circle by @awni in #2551
Fix METAL quantization in JIT + fix release build by @awni in #2553
Faster contiguous gather for indices in the first axis by @awni in #2552
version bump by @awni in #2554
Fix quantized vjp for mxfp4 by @awni in #2555

New Contributors

@Dan-Yeh made their first contribution in #2480
@russellizadi made their first contribution in #2518
@andportnoy made their first contribution in #2525

Full Changelog: v0.28.0...v0.29.0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

v0.29.0

Choose a tag to compare

Sorry, something went wrong.

Sorry, something went wrong.

Uh oh!

No results found

Highlights

What's Changed

New Contributors

Contributors

Uh oh!