Releases: ml-explore/mlx
Releases ยท ml-explore/mlx
v0.18.1
v0.18.0
Highlights
- Speed improvements:
- Up to 2x faster I/O: benchmarks.
- Faster transposed copies, unary, and binary ops
- Transposed convolutions
- Improvements to
mx.distributed(send/recv/average_gradients)
Core
-
New features:
mx.conv_transpose{1,2,3}d- Allow
mx.taketo work with integer index - Add
stdas method onmx.array mx.put_along_axismx.cross_productint()andfloat()work on scalarmx.array- Add optional headers to
mx.fast.metal_kernel mx.distributed.sendandmx.distributed.recvmx.linalg.pinv
-
Performance
- Up to 2x faster I/O
- Much faster CPU convolutions
- Faster general n-dimensional copies, unary, and binary ops for both CPU and GPU
- Put reduction ops in default stream with async for faster comms
- Overhead reductions in
mx.fast.metal_kernel - Improve donation heuristics to reduce memory use
-
Misc
- Support Xcode 160
NN
- Faster RNN layers
nn.ConvTranspose{1,2,3}dmlx.nn.average_gradientsdata parallel helper for distributed training
Bug Fixes
- Fix boolean all reduce bug
- Fix extension metal library finding
- Fix ternary for large arrays
- Make eval just wait if all arrays are scheduled
- Fix CPU softmax by removing redundant coefficient in neon_fast_exp
- Fix JIT reductions
- Fix overflow in quantize/dequantize
- Fix compile with byte sized constants
- Fix copy in the sort primitive
- Fix reduce edge case
- Fix slice data size
- Throw for certain cases of non captured inputs in compile
- Fix copying scalars by adding fill_gpu
- Fix bug in module attribute set, reset, set
- Ensure io/comm streams are active before eval
- Fix
mx.clip - Override class function in Repr so
mx.arrayis not confused witharray.array - Avoid using find_library to make install truly portable
- Remove fmt dependencies from MLX install
- Fix for partition VJP
- Avoid command buffer timeout for IO on large arrays
v0.17.3
๐
v0.17.1
๐
v0.17.0
Highlights
mx.einsum: PR- Big speedups in reductions: benchmarks
- 2x faster model loading: PR
mx.fast.metal_kernelfor custom GPU kernels: docs
Core
- Faster program exits
- Laplace sampling
mx.nan_to_numnn.tanhgelu approximation- Fused GPU quantization ops
- Faster group norm
- bf16 winograd conv
- vmap support for
mx.scatter mx.pad"edge" padding- More numerically stable
mx.var mx.linalg.cholesky_inv/mx.linalg.tri_invmx.isfinite- Complex
mx.signnow mirrors NumPy 2.0 behaviour - More flexible
mx.fast.rope - Update to
nanobind2.1
Bug Fixes
- gguf zero initialization
- expm1f overflow handling
- bfloat16 hadamard
- large arrays for various ops
- rope fix
- bf16 array creation
- preserve dtype in
nn.Dropout nn.TransformerEncoderwithnorm_first=False- excess copies from contiguity bug
v0.16.3
v0.16.2
๐๐
0.16.1
v0.16.0
Highlights
@mx.custom_functionfor customvjp/jvp/vmaptransforms- Up to 2x faster Metal GEMV and fast masked GEMV
- Fast
hadamard_transform
Core
- Metal 3.2 support
- Reduced CPU binary size
- Added quantized GPU ops to JIT
- Faster GPU compilation
- Added grads for bitwise ops + indexing
Bug Fixes
- 1D scatter bug
- Strided sort bug
- Reshape copy bug
- Seg fault in
mx.compile - Donation condition in compilation
- Compilation of accelerate on iOS