v0.26.0
Highlights
- 5 bit quantization
- Significant progress on CUDA back-end by @zcbenz
Core
Features
- 5bit quants
- Allow per-target Metal debug flags
- Add complex eigh
- reduce vjp for
mx.allandmx.any realandimagproperties- Non-symmetric
mx.linalg.eigandmx.linalg.eigh - convolution vmap
- Add more complex unary ops (
sqrt,square, ...) - Complex scan
- Add
mx.broadcast_shapes - Added
output_paddingparameters inconv_transpose - Add random normal distribution for complex numbers
- Add
mx.fft.fftshift andmx.fft.ifftshift` helpers - Enable vjp for quantized scale and bias
Performance
- Optimizing Complex Matrix Multiplication using Karatsuba’s Algorithm
- Much faster 1D conv
Cuda
- Generalize gpu backend
- Use fallbacks in fast primitives when
eval_gpuis not implemented - Add memory cache to CUDA backend
- Do not check
event.is_signaled()ineval_impl - Build for compute capability 70 instead of 75 in CUDA backend
- CUDA backend: backbone
Bug Fixes
- Fix out-of-bounds default value in logsumexp/softmax
- include
mlx::core::version()symbols in the mlx static library - Fix Nearest upsample
- Fix large arg reduce
- fix conv grad
- Fix some complex vjps
- Fix typo in row_reduce_small
- Fix
put_along_axisfor empty arrays - Close a couple edge case bugs:
hadamardandaddmmon empty inputs - Fix fft for integer overflow with large batches
- fix:
conv_generaldifferences between gpu, cpu - Fix batched vector sdpa
- GPU Hadamard for large N
- Improve bandwidth for elementwise ops
- Fix compile merging
- Fix shapeless export to throw on dim mismatch
- Fix
mx.linalg.pinvfor singular matrices - Fixed shift operations
- Fix integer overflow in qmm
Contributors
Thanks to some awesome contributors!
@ivanfioravanti, @awni, @angeloskath, @zcbenz, @Jckwind, @iExalt, @thesuryash, @ParamThakkar123, @djphoenix, @ita9naiwa, @hdeng-apple, @Redempt1onzzZZ, @charan-003, @skyzh, @wisefool769, @barronalex @aturker1