Skip to content

Releases: ml-explore/mlx

v0.29.3

17 Oct 19:11
4bce5f9

Choose a tag to compare

โญ๏ธ

v0.29.2

26 Sep 22:51
7a6adda

Choose a tag to compare

โฌ†๏ธ

v0.29.1

12 Sep 00:12
ee18e1c

Choose a tag to compare

๐Ÿš€

v0.29.0

29 Aug 17:08
8ce49cd

Choose a tag to compare

Highlights

  • Support for mxfp4 quantization (Metal, CPU)
  • More performance improvements, bug fixes, features in CUDA backend
  • mx.distributed supports NCCL back-end for CUDA

What's Changed

New Contributors

Full Changelog: v0.28.0...v0.29.0

v0.28.0

07 Aug 07:50
56be773

Choose a tag to compare

Highlights

  • First version of fused sdpa vector for CUDA
  • Convolutions in CUDA
  • Speed improvements in CUDA normalization layers, softmax, compiled kernels, overheads and more

What's Changed

New Contributors

Full Changelog: v0.27.1...v0.28.0

v0.27.1

25 Jul 22:48
4ad5341

Choose a tag to compare

Highlights

  • Initial PyPi release of the CUDA back-end.
  • CUDA back-end works for well with mlx-lm:
    • Reasonably fast for LLM inference
    • Supports single-machine training and LoRA fine-tuning

What's Changed

Read more

v0.26.5

18 Jul 22:20
84b4d96

Choose a tag to compare

๐Ÿš€

v0.26.3

08 Jul 21:26
fb4e8b8

Choose a tag to compare

๐Ÿš€

v0.26.2

01 Jul 22:08
58f3860

Choose a tag to compare

๐Ÿš€

v0.26.0

02 Jun 23:24
0408ba0

Choose a tag to compare

Highlights

  • 5 bit quantization
  • Significant progress on CUDA back-end by @zcbenz

Core

Features

  • 5bit quants
  • Allow per-target Metal debug flags
  • Add complex eigh
  • reduce vjp for mx.all and mx.any
  • real and imag properties
  • Non-symmetric mx.linalg.eig and mx.linalg.eigh
  • convolution vmap
  • Add more complex unary ops (sqrt, square, ...)
  • Complex scan
  • Add mx.broadcast_shapes
  • Added output_padding parameters in conv_transpose
  • Add random normal distribution for complex numbers
  • Add mx.fft.fftshift and mx.fft.ifftshift` helpers
  • Enable vjp for quantized scale and bias

Performance

  • Optimizing Complex Matrix Multiplication using Karatsubaโ€™s Algorithm
  • Much faster 1D conv

Cuda

  • Generalize gpu backend
  • Use fallbacks in fast primitives when eval_gpu is not implemented
  • Add memory cache to CUDA backend
  • Do not check event.is_signaled() in eval_impl
  • Build for compute capability 70 instead of 75 in CUDA backend
  • CUDA backend: backbone

Bug Fixes

  • Fix out-of-bounds default value in logsumexp/softmax
  • include mlx::core::version() symbols in the mlx static library
  • Fix Nearest upsample
  • Fix large arg reduce
  • fix conv grad
  • Fix some complex vjps
  • Fix typo in row_reduce_small
  • Fix put_along_axis for empty arrays
  • Close a couple edge case bugs: hadamard and addmm on empty inputs
  • Fix fft for integer overflow with large batches
  • fix: conv_general differences between gpu, cpu
  • Fix batched vector sdpa
  • GPU Hadamard for large N
  • Improve bandwidth for elementwise ops
  • Fix compile merging
  • Fix shapeless export to throw on dim mismatch
  • Fix mx.linalg.pinv for singular matrices
  • Fixed shift operations
  • Fix integer overflow in qmm

Contributors

Thanks to some awesome contributors!

@ivanfioravanti, @awni, @angeloskath, @zcbenz, @Jckwind, @iExalt, @thesuryash, @ParamThakkar123, @djphoenix, @ita9naiwa, @hdeng-apple, @Redempt1onzzZZ, @charan-003, @skyzh, @wisefool769, @barronalex @aturker1