Highlights
- Wider support for cuda quantized matmuls (#3352, #3268, #3321, #3417, #3255)
- MLX can be used by multiple threads for independent computations (#3405, #3348, #3281, #3423)
- Added CUDA FFT support
- JACCL is now a standalone lib (#3412)
What's Changed
- Bump by @angeloskath in #3244
- win: re-enable and fix cuDNN performance by @dhiltgen in #3242
- Fix crashes in multi-threaded process teardown by @louen in #3167
- [CUDA] Add FFT support by @lucasnewman in #3243
- [CUDA] Implement MaskedScatter by @Lyxot in #3151
- docs: fix PyTorch to MLX conversion example by @LxYuan0420 in #3265
- update requirements for Macbook Neo by @tosh in #3257
- fix comparison op JVP returning bool tangents instead of input dtype by @mm65x in #3253
- fix nn.GRU skipping bhn bias when hidden is None by @mm65x in #3252
- [CUDA] Pipelined QMM by @zcbenz in #3255
- tests: harden memory leak check in test_siblings_without_eval by @booxter in #3088
- Slice update with operation by @angeloskath in #3266
- Nax Refactor by @jagrit06 in #3271
- Fix building with CUDA toolkit 13.2 by @zcbenz in #3273
- [CUDA] fp and int4 quants for qmm_sm80 by @zcbenz in #3268
- Fix repr of conv layers by @angeloskath in #3275
- Merge DeviceStream into CommandEncoder by @zcbenz in #3264
- [CUDA] Search system-installed CUDA toolkit for headers by @zcbenz in #3277
- Create default random key lazily by @zcbenz in #3278
- Support indexing with any type which implmented
__index__by @aisk in #3210 - Fix sort NaN handling for float16 and bfloat16 by @Lyxot in #3269
- Use thread local storage for frontend compile cache by @zcbenz in #3280
- [Metal][Performance]: Add split-K for quantized matmul (small M) by @Ziqiao-git in #3120
- [Metal] Fix depthwise conv 1D kernel name for large variant by @Brooooooklyn in #3289
- Fix stale transform copy-chain leaks by @Brooooooklyn in #3290
- Implement Pad::vmap to replace NYI stub by @Aristide021 in #3304
- logo files by @andresy in #3308
- Fix vmap + floor_divide: preserve integer dtype by @robert-johansson in #3292
- Fix moved-from shape bug in broadcast_arrays causing vmap bus error by @Aristide021 in #3310
- Use nb::ndarray for checking arrays by @zcbenz in #3283
- Add output_shapes for AddMM by @pHequals7 in #3262
- Manage Metal objects with smart pointers by @zcbenz in #3282
- [CUDA] support sorting complex numbers by @Lyxot in #3286
- Add norm parameter to FFT transforms (backward/ortho/forward) by @Aristide021 in #3287
- Make each thread have its own default stream by @zcbenz in #3281
- [CUDA] Implement BlockMaskedMM by @Lyxot in #3299
- Fix np bfloat16 misinterpreted as complex by @kellen-sun in #3146
- Remove no longer needed const_cast by @zcbenz in #3325
- Bump actions/deploy-pages from 4 to 5 by @dependabot[bot] in #3334
- Fix use after move by @angeloskath in #3343
- Decouple CommandEncoder from Device by @zcbenz in #3316
- Add vmap for BroadcastAxes by @angeloskath in #3344
- Add fftfreq, rfftfreq and scalar axes for fftshift/ifftshift by @declanhealy2 in #3298
- [Metal] Support sorting complex numbers by @Lyxot in #3314
- [CUDA] Fallback QMM by @zcbenz in #3315
- Make CommandEncoder thread local by @zcbenz in #3348
- [CUDA] 3/5/6-bit quants for qmm_naive by @zcbenz in #3352
- Fix regression in array creation by @angeloskath in #3353
- Use
metalas the front-end for the metal linker by @louen in #3354 - Add printoptions by @ChristophePRAT in #3333
- Add a convenience for making local streams in python by @angeloskath in #3355
- Fix CMake finding wrong Python during pip install by @fijimunkii in #3375
- [CUDA] Add GatherQMM for quantized gather matmul by @Lyxot in #3321
- fix: fail build when Metal compiler header resolution fails by @dogukanveziroglu in #3332
- Fix: Correct cross-attention query routing in Post-LN TransformerDecoderLayer by @suryawanshishantanu6 in #3382
- [CUDA] Thread safety by @zcbenz in #3367
- Fix test "test get streams" missing initialization by @dseredkin in #3376
- Conjugate VJP and JVP support by @CameronChurchwell in #3386
- Fix int16 overflow in SDPA NAX mask indexing for KV sequences > 32K by @Clydingus in #3361
- Avoid joining threads on exit by @zcbenz in #3388
- Add clear_streams API for cleanup before exit by @zcbenz in #3395
- Update nanobind version to v2.12.0 by @jrp2014 in #3396
- Jaccl refactor by @angeloskath in #3412
- Fixes for CUDA CI by @zcbenz in #3413
- Validate safetensors data offsets by @MillaFleurs in #3364
- Validate safetensors data offsets against file boundaries by @matinsaurralde in #3410
- Document sort stability and NaN handling by @NeuralNoble in #3400
- ThreadLocalStream in C++ by @zcbenz in #3405
- Fix jaccl init bug by @angeloskath in #3418
- Segmented mm nax kernel by @angeloskath in #3419
- [CUDA] gather_mm by @zcbenz in #3414
- [CUDA] GatherQMM matrix-matrix sm80/naive path by @Lyxot in #3417
- [CUDA] Handle residue k in qmm_naive by @zcbenz in #3379
- Speed up NAX split-K by better tuning and routing and fix NAX addmm by @angeloskath in #3422
- Make Scheduler::enqueue thread safe by @zcbenz in #3423
- Fix flaky TestVmap.test_vmap_masked_scatter by @zcbenz in #3421
- Fix synchronize for ThreadLocalStream by @angeloskath in #3429
- Fix bytes_per_key truncation in random kernels (Metal + CUDA) by @dogukanveziroglu in #3432
- Throw meaningful error when Metal device is not found by @dogukanveziroglu in #3428
- Fix kernel cache collision in Compiled constructor by @dogukanveziroglu in #3427
- Fix mx.prod vjp for complex types by @CameronChurchwell in #3433
New Contributors
- @LxYuan0420 made their first contribution in #3265
- @tosh made their first contribution in #3257
- @mm65x made their first contribution in #3253
- @booxter made their first contribution in #3088
- @Ziqiao-git made their first contribution in #3120
- @Brooooooklyn made their first contribution in #3289
- @Aristide021 made their first contribution in #3304
- @pHequals7 made their first contribution in #3262
- @declanhealy2 made their first contribution in #3298
- @fijimunkii made their first contribution in #3375
- @dogukanveziroglu made their first contribution in #3332
- @suryawanshishantanu6 made their first contribution in #3382
- @dseredkin made their first contribution in #3376
- @CameronChurchwell made their first contribution in #3386
- @Clydingus made their first contribution in #3361
- @jrp2014 made their first contribution in #3396
- @matinsaurralde made their first contribution in #3410
- @NeuralNoble made their first contribution in #3400
Full Changelog: v0.31.1...v0.31.2