v0.26.0

awni released this 02 Jun 23:24

· 398 commits to main since this release

0408ba0

Highlights

5 bit quantization
Significant progress on CUDA back-end by @zcbenz

Core

Features

5bit quants
Allow per-target Metal debug flags
Add complex eigh
reduce vjp for mx.all and mx.any
real and imag properties
Non-symmetric mx.linalg.eig and mx.linalg.eigh
convolution vmap
Add more complex unary ops (sqrt, square, ...)
Complex scan
Add mx.broadcast_shapes
Added output_padding parameters in conv_transpose
Add random normal distribution for complex numbers
Add mx.fft.fftshift and mx.fft.ifftshift` helpers
Enable vjp for quantized scale and bias

Performance

Optimizing Complex Matrix Multiplication using Karatsuba’s Algorithm
Much faster 1D conv

Cuda

Generalize gpu backend
Use fallbacks in fast primitives when eval_gpu is not implemented
Add memory cache to CUDA backend
Do not check event.is_signaled() in eval_impl
Build for compute capability 70 instead of 75 in CUDA backend
CUDA backend: backbone

Bug Fixes

Fix out-of-bounds default value in logsumexp/softmax
include mlx::core::version() symbols in the mlx static library
Fix Nearest upsample
Fix large arg reduce
fix conv grad
Fix some complex vjps
Fix typo in row_reduce_small
Fix put_along_axis for empty arrays
Close a couple edge case bugs: hadamard and addmm on empty inputs
Fix fft for integer overflow with large batches
fix: conv_general differences between gpu, cpu
Fix batched vector sdpa
GPU Hadamard for large N
Improve bandwidth for elementwise ops
Fix compile merging
Fix shapeless export to throw on dim mismatch
Fix mx.linalg.pinv for singular matrices
Fixed shift operations
Fix integer overflow in qmm

Contributors

Thanks to some awesome contributors!

@ivanfioravanti, @awni, @angeloskath, @zcbenz, @Jckwind, @iExalt, @thesuryash, @ParamThakkar123, @djphoenix, @ita9naiwa, @hdeng-apple, @Redempt1onzzZZ, @charan-003, @skyzh, @wisefool769, @barronalex @aturker1

Contributors

zcbenz, ivanfioravanti, and 15 other contributors

Assets 2