Release v0.31.2 · ml-explore/mlx

Highlights

Wider support for cuda quantized matmuls (#3352, #3268, #3321, #3417, #3255)
MLX can be used by multiple threads for independent computations (#3405, #3348, #3281, #3423)
Added CUDA FFT support
JACCL is now a standalone lib (#3412)

What's Changed

Bump by @angeloskath in #3244
win: re-enable and fix cuDNN performance by @dhiltgen in #3242
Fix crashes in multi-threaded process teardown by @louen in #3167
[CUDA] Add FFT support by @lucasnewman in #3243
[CUDA] Implement MaskedScatter by @Lyxot in #3151
docs: fix PyTorch to MLX conversion example by @LxYuan0420 in #3265
update requirements for Macbook Neo by @tosh in #3257
fix comparison op JVP returning bool tangents instead of input dtype by @mm65x in #3253
fix nn.GRU skipping bhn bias when hidden is None by @mm65x in #3252
[CUDA] Pipelined QMM by @zcbenz in #3255
tests: harden memory leak check in test_siblings_without_eval by @booxter in #3088
Slice update with operation by @angeloskath in #3266
Nax Refactor by @jagrit06 in #3271
Fix building with CUDA toolkit 13.2 by @zcbenz in #3273
[CUDA] fp and int4 quants for qmm_sm80 by @zcbenz in #3268
Fix repr of conv layers by @angeloskath in #3275
Merge DeviceStream into CommandEncoder by @zcbenz in #3264
[CUDA] Search system-installed CUDA toolkit for headers by @zcbenz in #3277
Create default random key lazily by @zcbenz in #3278
Support indexing with any type which implmented __index__ by @aisk in #3210
Fix sort NaN handling for float16 and bfloat16 by @Lyxot in #3269
Use thread local storage for frontend compile cache by @zcbenz in #3280
[Metal][Performance]: Add split-K for quantized matmul (small M) by @Ziqiao-git in #3120
[Metal] Fix depthwise conv 1D kernel name for large variant by @Brooooooklyn in #3289
Fix stale transform copy-chain leaks by @Brooooooklyn in #3290
Implement Pad::vmap to replace NYI stub by @Aristide021 in #3304
logo files by @andresy in #3308
Fix vmap + floor_divide: preserve integer dtype by @robert-johansson in #3292
Fix moved-from shape bug in broadcast_arrays causing vmap bus error by @Aristide021 in #3310
Use nb::ndarray for checking arrays by @zcbenz in #3283
Add output_shapes for AddMM by @pHequals7 in #3262
Manage Metal objects with smart pointers by @zcbenz in #3282
[CUDA] support sorting complex numbers by @Lyxot in #3286
Add norm parameter to FFT transforms (backward/ortho/forward) by @Aristide021 in #3287
Make each thread have its own default stream by @zcbenz in #3281
[CUDA] Implement BlockMaskedMM by @Lyxot in #3299
Fix np bfloat16 misinterpreted as complex by @kellen-sun in #3146
Remove no longer needed const_cast by @zcbenz in #3325
Bump actions/deploy-pages from 4 to 5 by @dependabot[bot] in #3334
Fix use after move by @angeloskath in #3343
Decouple CommandEncoder from Device by @zcbenz in #3316
Add vmap for BroadcastAxes by @angeloskath in #3344
Add fftfreq, rfftfreq and scalar axes for fftshift/ifftshift by @declanhealy2 in #3298
[Metal] Support sorting complex numbers by @Lyxot in #3314
[CUDA] Fallback QMM by @zcbenz in #3315
Make CommandEncoder thread local by @zcbenz in #3348
[CUDA] 3/5/6-bit quants for qmm_naive by @zcbenz in #3352
Fix regression in array creation by @angeloskath in #3353
Use metal as the front-end for the metal linker by @louen in #3354
Add printoptions by @ChristophePRAT in #3333
Add a convenience for making local streams in python by @angeloskath in #3355
Fix CMake finding wrong Python during pip install by @fijimunkii in #3375
[CUDA] Add GatherQMM for quantized gather matmul by @Lyxot in #3321
fix: fail build when Metal compiler header resolution fails by @dogukanveziroglu in #3332
Fix: Correct cross-attention query routing in Post-LN TransformerDecoderLayer by @suryawanshishantanu6 in #3382
[CUDA] Thread safety by @zcbenz in #3367
Fix test "test get streams" missing initialization by @dseredkin in #3376
Conjugate VJP and JVP support by @CameronChurchwell in #3386
Fix int16 overflow in SDPA NAX mask indexing for KV sequences > 32K by @Clydingus in #3361
Avoid joining threads on exit by @zcbenz in #3388
Add clear_streams API for cleanup before exit by @zcbenz in #3395
Update nanobind version to v2.12.0 by @jrp2014 in #3396
Jaccl refactor by @angeloskath in #3412
Fixes for CUDA CI by @zcbenz in #3413
Validate safetensors data offsets by @MillaFleurs in #3364
Validate safetensors data offsets against file boundaries by @matinsaurralde in #3410
Document sort stability and NaN handling by @NeuralNoble in #3400
ThreadLocalStream in C++ by @zcbenz in #3405
Fix jaccl init bug by @angeloskath in #3418
Segmented mm nax kernel by @angeloskath in #3419
[CUDA] gather_mm by @zcbenz in #3414
[CUDA] GatherQMM matrix-matrix sm80/naive path by @Lyxot in #3417
[CUDA] Handle residue k in qmm_naive by @zcbenz in #3379
Speed up NAX split-K by better tuning and routing and fix NAX addmm by @angeloskath in #3422
Make Scheduler::enqueue thread safe by @zcbenz in #3423
Fix flaky TestVmap.test_vmap_masked_scatter by @zcbenz in #3421
Fix synchronize for ThreadLocalStream by @angeloskath in #3429
Fix bytes_per_key truncation in random kernels (Metal + CUDA) by @dogukanveziroglu in #3432
Throw meaningful error when Metal device is not found by @dogukanveziroglu in #3428
Fix kernel cache collision in Compiled constructor by @dogukanveziroglu in #3427
Fix mx.prod vjp for complex types by @CameronChurchwell in #3433

New Contributors

@LxYuan0420 made their first contribution in #3265
@tosh made their first contribution in #3257
@mm65x made their first contribution in #3253
@booxter made their first contribution in #3088
@Ziqiao-git made their first contribution in #3120
@Brooooooklyn made their first contribution in #3289
@Aristide021 made their first contribution in #3304
@pHequals7 made their first contribution in #3262
@declanhealy2 made their first contribution in #3298
@fijimunkii made their first contribution in #3375
@dogukanveziroglu made their first contribution in #3332
@suryawanshishantanu6 made their first contribution in #3382
@dseredkin made their first contribution in #3376
@CameronChurchwell made their first contribution in #3386
@Clydingus made their first contribution in #3361
@jrp2014 made their first contribution in #3396
@matinsaurralde made their first contribution in #3410
@NeuralNoble made their first contribution in #3400

Full Changelog: v0.31.1...v0.31.2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v0.31.2

Choose a tag to compare

Sorry, something went wrong.

Sorry, something went wrong.

Uh oh!

No results found

Highlights

What's Changed

New Contributors

Contributors

Uh oh!