Releases: ml-explore/mlx
Releases · ml-explore/mlx
v0.15.1
v0.15.0
Highlights
- Fast Metal GPU FFTs
- On average ~30x faster than CPU
- More benchmarks
mx.distributedwithall_sumandall_gather
Core
- Added dlpack device
__dlpack_device__ - Fast GPU FFTs benchmarks
- Add docs for the
mx.distributed - Add
mx.viewop
NN
softmin,hardshrink, andhardtanhactivations
Bugfixes
- Fix broadcast bug in bitwise ops
- Allow more buffers for JIT compilation
- Fix matvec vector stride bug
- Fix multi-block sort stride management
- Stable cumprod grad at 0
- Buf fix with race condition in scan
v0.14.1
v0.14.0
Highlights
- Small-size build that JIT compiles kernels and omits the CPU backend which results in a binary <4MB
mx.gather_qmmquantized equivalent formx.gather_mmwhich speeds up MoE inference by ~2x- Grouped 2D convolutions
Core
mx.conjugatemx.conv3dandnn.Conv3d- List based indexing
- Started
mx.distributedwhich uses MPI (if installed) for communication across machinesmx.distributed.initmx.distributed.all_gathermx.distributed.all_reduce_sum
- Support conversion to and from dlpack
mx.linalg.choleskyon CPUmx.quantized_matmulsped up for vector-matrix productsmx.tracemx.block_masked_mmnow supports floating point masks!
Fixes
- Error messaging in eval
- Add some missing docs
- Scatter index bug
- The extensions example now compiles and runs
- CPU copy bug with many dimensions
v0.13.1
v0.13.0
Highlights
- Block sparse matrix multiply speeds up MoEs by >2x
- Improved quantization algorithm should work well for all networks
- Improved gpu command submission speeds up training and inference
Core
- Bitwise ops added:
mx.bitwise_[or|and|xor],mx.[left|right]_shift, operator overloads
- Groups added to Conv1d
- Added
mx.metal.device_infoto get better informed memory limits - Added resettable memory stats
mlx.optimizers.clip_grad_normandmlx.utils.tree_reduceadded- Add
mx.arctan2 - Unary ops now accept array-like inputs ie one can do
mx.sqrt(2)
Bugfixes
- Fixed shape for slice update
- Bugfix in quantize that used slightly wrong scales/biases
- Fixed memory leak for multi-output primitives encountered with gradient checkpointing
- Fixed conversion from other frameworks for all datatypes
- Fixed index overflow for matmul with large batch size
- Fixed initialization ordering that occasionally caused segfaults
v0.12.2
v0.12.0
Highlights
- Faster quantized matmul
- Up to 40% faster QLoRA or prompt processing, some numbers
Core
mx.synchronizeto wait for computation dispatched withmx.async_evalmx.radiansandmx.degreesmx.metal.clear_cacheto return to the OS the memory held by MLX as a cache for future allocations- Change quantization to always represent 0 exactly (relevant issue)
Bugfixes
- Fixed quantization of a block with all 0s that produced NaNs
- Fixed the
lenfield in the buffer protocol implementation
v0.11.0
v0.10.0
Highlights
- Improvements for LLM generation
- Reshapeless quant matmul/matvec
mx.async_eval- Async command encoding
Core
- Slightly faster reshapeless quantized gemms
- Option for precise softmax
mx.metal.start_captureandmx.metal.stop_capturefor GPU debug/profilemx.expm1mx.stdmx.meshgrid- CPU only
mx.random.multivariate_normal mx.cumsum(and other scans) forbfloat- Async command encoder with explicit barriers / dependency management
NN
nn.upsamplesupport bicubic interpolation
Misc
- Updated MLX Extension to work with nanobind
Bugfixes
- Fix buffer donation in softmax and fast ops
- Bug in layer norm vjp
- Bug initializing from lists with scalar
- Bug in indexing
- CPU compilation bug
- Multi-output compilation bug
- Fix stack overflow issues in eval and array destruction