Highlights
- Support for Neural Accelerators on M5 (macOS >= 26.2)
What's Changed
- Fix AdamW weight_decay default value in docstring by @goingreen in #2557
- Fix dequantize python sig by @wrmsr in #2562
- fix copies in sdpa by @awni in #2563
- chore: Update Docs With Slice Copy Example by @krishi-saripalli in #2559
- Fixed several type annotations in the MLX stubs which degraded to Unknown/Any by @Maalvi14 in #2560
- typing: add type hints to mlx.core.array, linalg, and random by @XXXXRT666 in #2565
- Set ccache size before building by @zcbenz in #2570
- Faster fully depthwise-separable 1D conv by @awni in #2567
- Fix a few ccache cache miss by @zcbenz in #2573
- Some tweaks in cmake files by @zcbenz in #2574
- Add batch offsets for mx.fast.rope by @awni in #2564
- [CUDA] Use GEMM with epilogue instead of AddMM by @zcbenz in #2569
- [CUDA] Fix alpha not respected when using bias epilogue by @zcbenz in #2578
- Fix flaky addmm tests by @zcbenz in #2581
- Adding Relu2 by @Goekdeniz-Guelmez in #2582
- Add sdpa with sinks by @awni in #2558
- [CUDA] Set bias as input when using bias epilogue by @zcbenz in #2584
- [CUDA] Fix NCCL stub for release build by @awni in #2587
- patch bump by @awni in #2588
- Refactor code examples to use 'gelu' by @umbertomig in #2592
- Fix metal scan by @awni in #2591
- Fix typo in average_gradients function call by @umbertomig in #2594
- No copy batch rope by @awni in #2595
- Update export function example for array input by @umbertomig in #2598
- Expose
mx.dependsto Python by @awni in #2606 - fix: library loading for swift dynamic frameworks by @bilousoleksandr in #2568
- Detect cache thrashing in LRUCache by @zcbenz in #2600
- Lower sorted QMM gather threshold by @awni in #2609
- implement Convolution::output_shape by @josharian in #2601
- Avoid producing NaN in attention by @awni in #2608
- [CUDA] Recycle CUDA events by @zcbenz in #2604
- [CUDA] fix cudaGraphLaunch by @CC-Yeh in #2613
- Support pickling array for bfloat16 by @CC-Yeh in #2586
- New tuning for small K gemv by @jagrit06 in #2620
- Allow None input to compiled functions by @awni in #2621
- Compiled should not end in broadcast by @angeloskath in #2622
- Bump the version by @angeloskath in #2627
- [CUDA] Make CudaEvent work with multi-device by @zcbenz in #2614
- Fix incorrect path and typos by @aisk in #2630
- Fix for max block dim by @awni in #2631
- Compile now can attach arbitrary data to an entry by @angeloskath in #2634
- [CUDA] Wait for tasks in cuda by @awni in #2636
- Fix status message by @angeloskath in #2638
- fix cross entropy axis param by @awni in #2641
- Faster triu, tril, where with scalar by @awni in #2644
- [CUDA] Add a small column specialization to reduce by @angeloskath in #2642
- [CUDA] Fix flaky test by @awni in #2646
- Configure CMake to export
compile_commands.jsonby @andportnoy in #2645 - Faster complex matmul by @CC-Yeh in #2571
- Fix compile when outputs change by @awni in #2648
- Speed up compile for node with many parents by @awni in #2649
- Fix and refactor row-reduce by @angeloskath in #2650
- [CUDA] Fix jit file cache for large kernel names by @angeloskath in #2656
- Fix all_gather vjp by @awni in #2654
- Fix fast synch when fence is waited before a command buffer is created by @awni in #2657
- Fix cumulative operations when axis=None by @aisk in #2653
- Export with callback by @awni in #2612
- bump patch by @awni in #2658
- Enable addmm low-precision cpu by @awni in #2661
- Precise sigmoid by @awni in #2659
- Debug cuda conv by @awni in #2662
- Speed up scalars part 2 by @awni in #2669
- Normalize README bullet formatting and other Markdown small fixes by @Mistobaan in #2671
- Modified sort behavior when running CPU or Metal to match NumPy/JAX by @Maalvi14 in #2667
- remove unused unary file by @awni in #2672
- Nccl timeout by @nastya236 in #2673
- suppress gcc 10.1 warnings by @awni in #2679
- patch bump by @awni in #2680
- Improved mx.split() docs by @Maalvi14 in #2689
- fix warnings showing up with -Wall by @andresy in #2692
- Einsum error msg improvement by @Maalvi14 in #2690
- optionally load metallib from framework by @davidkoski in #2702
- Fix addmm cpu for beta != 1.0 by @awni in #2699
- Add
mx.medianop by @awni in #2705 - bump python by @awni in #2694
- Fp8 conversion by @awni in #2686
- fix: linux-{fedora}x86_64-build by @incertum in #2707
- Add quantize/dequantize for mxfp8 and nvfp4 by @awni in #2688
- Migrate CircleCI to GitHub Actions by @madrob in #2716
- Fix KeyError for missing domain_uuid_key in Thunderbolt setup by @thechriswebb in #2682
- fix memory count bug by @awni in #2717
- Fix the order of hosts in the ring by @angeloskath in #2718
- Fix docs path by @madrob in #2719
- Use faster dequant for fp4 by @awni in #2720
- update: add linux fedora container CI - CPP build test only by @incertum in #2722
- add null check -- the bundleIdentifier is optional by @davidkoski in #2709
- Fix compile multi capture by @awni in #2678
- Set up publishing to PyPI and Test-PyPI by @madrob in #2721
- Check isnan in maximum / minimum with CPU backend by @aisk in #2652
- Fix addmm with empty matrices and beta != 1.0 by @harsh-sutariya in #2715
- skip self-hosted runners on forks by @madrob in #2730
- only build for macos 14 and up by @awni in #2731
- don't test when doing release by @awni in #2734
- Make cpu binary_op easily accessible by @angeloskath in #2733
- fix property name by @madrob in #2736
- Nccl reduce scatter, all gather by @nastya236 in #2727
- [CUDA] Reduce use of managed memory by @awni in #2725
- Shapeless support for
zeros/ones_likeby @CC-Yeh in #2726 - Compatibility with pip-installed openmpi by @pcuenca in #2741
- Fix release builds by @awni in #2746
- patch bump by @awni in #2750
- Fix dequantize python sig (dtype default) by @wrmsr in #2752
- remove circle by @awni in #2753
- Fix irregular_strides benchmark shape type by @wrmsr in #2754
- Linux on arm by @awni in #2751
- minor debugging for publishing by @madrob in #2739
- Export custom kernel by @awni in #2756
- Fix slice with negative strides by @awni in #2758
- [CUDA] Check CUDA error in synchronize by @zcbenz in #2757
- fix release by @awni in #2759
- [CUDA] cuDNN forward attention by @zcbenz in #2743
- Fix exporting with constants by @awni in #2769
- Separate test-linux from build-linux/cuda in GitHub Actions by @zcbenz in #2765
- [CUDA] Use arch specific targets when possible by @awni in #2771
- Fix MPI distributed tests with CUDA backend by @zcbenz in #2775
- Fix warnings with cmake 4.1 by @zcbenz in #2774
- Use ccache in GitHub Actions by @zcbenz in #2773
- [CUDA] Tune ops per buffer based on device by @awni in #2761
- fix release 2 by @awni in #2767
- Run CI for pushes by @zcbenz in #2777
- Remove pip cache in GitHub Actions by @zcbenz in #2776
- Build and test with multiple CUDA versions by @zcbenz in #2780
- Use std::optional for mask_arr arg by @zcbenz in #2763
- Do not run CPU tests in CUDA builds by @zcbenz in #2784
- Test every commit in main branch by @zcbenz in #2781
- Fix nightly build by @zcbenz in #2785
- Remove unneeded tests in nightly build by @zcbenz in #2786
- Fix building with CUDA < 12.8 by @zcbenz in #2782
- Avoid duplicate CI runs when starting a PR from upstream branch by @zcbenz in #2788
- build docs on linux by @awni in #2787
- [CUDA] cuDNN backward attention by @zcbenz in #2762
- more accurate rope fallback by @awni in #2792
- Fix version tag by @awni in #2790
- version by @awni in #2797
- Add Masked Scatter by @CC-Yeh in #2663
- Add Neural Accelerator Support by @jagrit06 in #2772
New Contributors
- @goingreen made their first contribution in #2557
- @krishi-saripalli made their first contribution in #2559
- @Maalvi14 made their first contribution in #2560
- @XXXXRT666 made their first contribution in #2565
- @umbertomig made their first contribution in #2592
- @bilousoleksandr made their first contribution in #2568
- @josharian made their first contribution in #2601
- @aisk made their first contribution in #2630
- @Mistobaan made their first contribution in #2671
- @incertum made their first contribution in #2707
- @thechriswebb made their first contribution in #2682
- @harsh-sutariya made their first contribution in #2715
- @pcuenca made their first contribution in #2741
Full Changelog: v0.29.0...v0.30.0