Releases: ml-explore/mlx
Releases ยท ml-explore/mlx
v0.30.0
Highlights
- Support for Neural Accelerators on M5 (macOS >= 26.2)
What's Changed
- Fix AdamW weight_decay default value in docstring by @goingreen in #2557
- Fix dequantize python sig by @wrmsr in #2562
- fix copies in sdpa by @awni in #2563
- chore: Update Docs With Slice Copy Example by @krishi-saripalli in #2559
- Fixed several type annotations in the MLX stubs which degraded to Unknown/Any by @Maalvi14 in #2560
- typing: add type hints to mlx.core.array, linalg, and random by @XXXXRT666 in #2565
- Set ccache size before building by @zcbenz in #2570
- Faster fully depthwise-separable 1D conv by @awni in #2567
- Fix a few ccache cache miss by @zcbenz in #2573
- Some tweaks in cmake files by @zcbenz in #2574
- Add batch offsets for mx.fast.rope by @awni in #2564
- [CUDA] Use GEMM with epilogue instead of AddMM by @zcbenz in #2569
- [CUDA] Fix alpha not respected when using bias epilogue by @zcbenz in #2578
- Fix flaky addmm tests by @zcbenz in #2581
- Adding Relu2 by @Goekdeniz-Guelmez in #2582
- Add sdpa with sinks by @awni in #2558
- [CUDA] Set bias as input when using bias epilogue by @zcbenz in #2584
- [CUDA] Fix NCCL stub for release build by @awni in #2587
- patch bump by @awni in #2588
- Refactor code examples to use 'gelu' by @umbertomig in #2592
- Fix metal scan by @awni in #2591
- Fix typo in average_gradients function call by @umbertomig in #2594
- No copy batch rope by @awni in #2595
- Update export function example for array input by @umbertomig in #2598
- Expose
mx.dependsto Python by @awni in #2606 - fix: library loading for swift dynamic frameworks by @bilousoleksandr in #2568
- Detect cache thrashing in LRUCache by @zcbenz in #2600
- Lower sorted QMM gather threshold by @awni in #2609
- implement Convolution::output_shape by @josharian in #2601
- Avoid producing NaN in attention by @awni in #2608
- [CUDA] Recycle CUDA events by @zcbenz in #2604
- [CUDA] fix cudaGraphLaunch by @CC-Yeh in #2613
- Support pickling array for bfloat16 by @CC-Yeh in #2586
- New tuning for small K gemv by @jagrit06 in #2620
- Allow None input to compiled functions by @awni in #2621
- Compiled should not end in broadcast by @angeloskath in #2622
- Bump the version by @angeloskath in #2627
- [CUDA] Make CudaEvent work with multi-device by @zcbenz in #2614
- Fix incorrect path and typos by @aisk in #2630
- Fix for max block dim by @awni in #2631
- Compile now can attach arbitrary data to an entry by @angeloskath in #2634
- [CUDA] Wait for tasks in cuda by @awni in #2636
- Fix status message by @angeloskath in #2638
- fix cross entropy axis param by @awni in #2641
- Faster triu, tril, where with scalar by @awni in #2644
- [CUDA] Add a small column specialization to reduce by @angeloskath in #2642
- [CUDA] Fix flaky test by @awni in #2646
- Configure CMake to export
compile_commands.jsonby @andportnoy in #2645 - Faster complex matmul by @CC-Yeh in #2571
- Fix compile when outputs change by @awni in #2648
- Speed up compile for node with many parents by @awni in #2649
- Fix and refactor row-reduce by @angeloskath in #2650
- [CUDA] Fix jit file cache for large kernel names by @angeloskath in #2656
- Fix all_gather vjp by @awni in #2654
- Fix fast synch when fence is waited before a command buffer is created by @awni in #2657
- Fix cumulative operations when axis=None by @aisk in #2653
- Export with callback by @awni in #2612
- bump patch by @awni in #2658
- Enable addmm low-precision cpu by @awni in #2661
- Precise sigmoid by @awni in #2659
- Debug cuda conv by @awni in #2662
- Speed up scalars part 2 by @awni in #2669
- Normalize README bullet formatting and other Markdown small fixes by @Mistobaan in #2671
- Modified sort behavior when running CPU or Metal to match NumPy/JAX by @Maalvi14 in #2667
- remove unused unary file by @awni in #2672
- Nccl timeout by @nastya236 in #2673
- suppress gcc 10.1 warnings by @awni in #2679
- patch bump by @awni in #2680
- Improved mx.split() docs by @Maalvi14 in #2689
- fix warnings showing up with -Wall by @andresy in #2692
- Einsum error msg improvement by @Maalvi14 in #2690
- optionally load metallib from framework by @davidkoski in #2702
- Fix addmm cpu for beta != 1.0 by @awni in #2699
- Add
mx.medianop by @awni in #2705 - bump python by @awni in #2694
- Fp8 conversion by @awni in #2686
- fix: linux-{fedora}x86_64-build by @incertum in #2707
- Add quantize/dequantize for mxfp8 and nvfp4 by @awni in #2688
- Migrate CircleCI to GitHub Actions by @madrob in #2716
- Fix KeyError for missing domain_uuid_key in Thunderbolt setup by @thechriswebb in #2682
- fix memory count bug by @awni in #2717
- Fix the order of hosts in the ring by @angeloskath in #2718
- Fix docs path by @madrob in #2719
- Use faster dequant for fp4 by @awni in #2720
- update: add linux fedora container CI - CPP build test only by @incertum in #2722
- add null check -- the bundleIdentifier is optional by @davidkoski in #2709
- Fix compile multi capture by @awni in #2678
- Set up publishing to PyPI and Test-PyPI by @madrob in #2721
- Check isnan in maximum / minimum with CPU backend by @aisk in #2652
- Fix addmm with empty matrices and beta != 1.0 by @harsh-sutariya in #2715
- skip self-hosted runners on forks by @madrob in #2730
- only build for macos 14 and up by @awni in #2731
- don't test when doing release by @awni in #2734
- Make cpu binary_op easily accessible by @angeloskath in #2733
- fix property name by @madrob in #2736
- Nccl reduce scatter, all gather by @nastya236 in #2727
- [CUDA] Reduce use of managed memory by @awni in #2725
- Shapeless support for
zeros/ones_likeby @CC-Yeh in #2726 - Compatibility with pip-installed openmpi by @pcuenca in #2741
- Fix release builds by @awni in #2746
- patch bump by @awni in #2750
- Fix dequantize python sig (dtype default) by @wrmsr in #2752
- remove circle by @awni in #2753
- Fix irregular_strides benchmark shape type by @wrmsr in #2754
- Linux on arm by @awni ...
v0.29.4
v0.29.3
v0.29.2
โฌ๏ธ
v0.29.1
v0.29.0
Highlights
- Support for
mxfp4quantization (Metal, CPU) - More performance improvements, bug fixes, features in CUDA backend
mx.distributedsupports NCCL back-end for CUDA
What's Changed
- [CUDA] Optimize set_mm_device_pointers for small ndim by @zcbenz in #2473
- Fix logsumexp/softmax not fused for some cases by @zcbenz in #2474
- Use CMake <4.1 to avoid the nvpl error by @angeloskath in #2489
- Fix incorrect interpretation of unsigned dtypes in reduce ops by @abeleinin in #2477
- make code blocks copyable by @Dan-Yeh in #2480
- Rename cu::Matmul to CublasGemm by @zcbenz in #2488
- Faster general unary op by @awni in #2472
- The naive_conv_2d is no longer used by @zcbenz in #2496
- Remove the hack around SmallVector in cpu compile by @zcbenz in #2494
- Clean up code handling both std::vector and SmallVector by @zcbenz in #2493
- [CUDA] Fix conv grads with groups by @zcbenz in #2495
- Update cuDNN Frontend to v1.14 by @zcbenz in #2505
- Ensure small sort doesn't use indices if not argsort by @angeloskath in #2506
- Ensure no oob read in gemv_masked by @angeloskath in #2508
- fix custom kernel test by @awni in #2510
- No segfault with uninitialized array.at by @awni in #2514
- Fix lapack svd by @awni in #2515
- Split cuDNN helpers into a separate header by @zcbenz in #2491
- [CUDA] Add GEMM-based fallback convolution kernels by @zcbenz in #2511
- Fix docs by @russellizadi in #2518
- Fix overflow in large filter small channels by @angeloskath in #2520
- [CUDA] Fix stride of singleton dims before passing to cuDNN by @zcbenz in #2521
- Custom cuda kernel by @angeloskath in #2517
- Fix docs omission by @angeloskath in #2524
- Fix power by @awni in #2523
- NCCL backend by @nastya236 in #2476
- [CUDA] Nccl pypi dep + default for cuda by @awni in #2526
- Fix warning 186-D from nvcc by @zcbenz in #2527
- [CUDA] Update calls to
cudaMemAdviseandcudaGraphAddDependenciesfor CUDA 13 by @andportnoy in #2525 - nccl default for backend=any by @awni in #2528
- Fix allocation bug in NCCL by @awni in #2530
- Enable COMPILE_WARNING_AS_ERROR for linux builds in CI by @zcbenz in #2534
- [CUDA] Remove thrust in arange by @zcbenz in #2535
- Use nccl header only when nccl is not present by @awni in #2539
- Allow pathlib.Path to save/load functions by @awni in #2541
- Remove nccl install in release by @awni in #2542
- [CUDA] Implement DynamicSlice/DynamicSliceUpdate by @zcbenz in #2533
- Remove stream from average grads so it uses default by @awni in #2532
- Enable cuda graph toggle by @awni in #2545
- Tests for save/load with
Pathby @awni in #2543 - Run CPP tests for CUDA build in CI by @zcbenz in #2544
- Separate cpu compilation cache by versions by @zcbenz in #2548
- [CUDA] Link with nccl by @awni in #2546
- [CUDA] Use ConcurrentContext in concatenate_gpu by @zcbenz in #2549
- [CUDA] fix sort by @awni in #2550
- Add mode parameter for quantization by @awni in #2499
- Bump xcode in circle by @awni in #2551
- Fix METAL quantization in JIT + fix release build by @awni in #2553
- Faster contiguous gather for indices in the first axis by @awni in #2552
- version bump by @awni in #2554
- Fix quantized vjp for mxfp4 by @awni in #2555
New Contributors
- @Dan-Yeh made their first contribution in #2480
- @russellizadi made their first contribution in #2518
- @andportnoy made their first contribution in #2525
Full Changelog: v0.28.0...v0.29.0
v0.28.0
Highlights
- First version of fused sdpa vector for CUDA
- Convolutions in CUDA
- Speed improvements in CUDA normalization layers, softmax, compiled kernels, overheads and more
What's Changed
- [CUDA] Fix segfault on exit by @awni in #2424
- [CUDA] No occupancy query for launch params by @awni in #2426
- [CUDA] More sizes for gemv by @awni in #2429
- Add more CUDA architectures for PyPi package by @awni in #2427
- Use ccache in CI by @zcbenz in #2414
- [CUDA] Use aligned vector in Layer Norm and RMS norm by @awni in #2433
- Cuda faster softmax by @awni in #2435
- Remove the kernel arg from get_launch_args by @zcbenz in #2437
- Move arange to its own file by @zcbenz in #2438
- Use load_vector in arg_reduce by @zcbenz in #2439
- Make CI faster by @zcbenz in #2440
- [CUDA] Quantized refactoring by @angeloskath in #2442
- fix circular reference by @awni in #2443
- [CUDA] Fix gemv regression by @awni in #2445
- Fix wrong graph key when using concurrent context by @zcbenz in #2447
- Fix custom metal extension by @awni in #2446
- Add tests for export including control flow models and quantized models by @junpeiz in #2430
- [CUDA] Backward convolution by @zcbenz in #2431
- [CUDA] Save primitive inputs faster by @zcbenz in #2449
- [CUDA] Vectorize generated kernels by @angeloskath in #2444
- [CUDA] Matmul utils initial commit by @angeloskath in #2441
- Fix arctan2 grads by @angeloskath in #2453
- Use LRU cache for cuda graph by @zcbenz in #2448
- Add missing algorithm header to jit_compiler.cpp for Linux builds by @zamderax in #2460
- Default install cuda on linux by @awni in #2462
- fix wraps compile by @awni in #2461
- Feat: add USE_SYSTEM_FMT CMake option by @GaetanLepage in #2219
- Use SmallVector for shapes and strides by @zcbenz in #2454
- Fix install tags by @awni in #2464
- Faster gather qmm sorted test by @awni in #2463
- Fix cublas on h100 by @awni in #2466
- revert default cuda install by @awni in #2465
- feat: support a destinations based in tree flatten/unflatten by @LVivona in #2450
- Fix typo in metal command encoder by @angeloskath in #2471
- Update CUDA sdpa by @jagrit06 in #2468
- version by @awni in #2470
New Contributors
- @junpeiz made their first contribution in #2430
- @zamderax made their first contribution in #2460
- @GaetanLepage made their first contribution in #2219
- @LVivona made their first contribution in #2450
Full Changelog: v0.27.1...v0.28.0
v0.27.1
Highlights
- Initial PyPi release of the CUDA back-end.
- CUDA back-end works for well with mlx-lm:
- Reasonably fast for LLM inference
- Supports single-machine training and LoRA fine-tuning
What's Changed
- Avoid invoking allocator::malloc when creating CUDA event by @zcbenz in #2232
- Share more common code in Compiled by @zcbenz in #2240
- Avoid atomic updates across CPU/GPU in CUDA event by @zcbenz in #2231
- Perf regression fix by @angeloskath in #2243
- Add profiler annotations in common primitives for CUDA backend by @zcbenz in #2244
- Default strict mode for module
updateandupdate_modulesby @awni in #2239 - Fix linux linking error by @awni in #2248
- Improve metal elementwise kernels by @awni in #2247
- CUDA backend: matmul by @zcbenz in #2241
- Change layernorms to two pass algorithm by @angeloskath in #2246
- Fix unintuitive metal kernel caching by @awni in #2242
- Refactor the lu test by @emmanuel-ferdman in #2250
- CUDA backend: unary ops by @zcbenz in #2158
- Fix export to work with gather/scatter axis by @awni in #2263
- CUDA backend: binary ops by @zcbenz in #2259
- Report number of missing parameters by @FL33TW00D in #2264
- CUDA backend: sort by @zcbenz in #2262
- CUDA backend: random by @zcbenz in #2261
- Fix conv export by @awni in #2265
- CUDA backend: copy ops by @zcbenz in #2260
- Fix building cpp benchmarks on Linux by @zcbenz in #2268
- Add load_safe to the general conv loaders by @angeloskath in #2258
- start cuda circle config by @awni in #2256
- CUDA backend: reduce by @zcbenz in #2269
- CUDA backend: argreduce by @zcbenz in #2270
- CUDA backend: softmax by @zcbenz in #2272
- CUDA backend: layernorm by @zcbenz in #2271
- Fix warnings from latest CUDA toolkit by @zcbenz in #2275
- Make sliceUpdate general by @awni in #2282
- CUDA backend: compile by @zcbenz in #2276
- [CUDA] RMSNorm and VJP by @awni in #2280
- [CUDA] Fix build by @awni in #2284
- [CUDA] ternary with select op by @awni in #2283
- CUDA backend: indexing ops by @zcbenz in #2277
- Collection of refactors by @jagrit06 in #2274
- Fix complex power and print by @awni in #2286
- fix cuda jit by @awni in #2287
- Fix cuda gemm for bf16 by @awni in #2288
- Fix cuda arg reduce by @awni in #2291
- RoPE for CUDA by @angeloskath in #2293
- Add python testing for cuda with ability to skip list of tests by @awni in #2295
- [CUDA] Fix back-end bugs and enable corresponding tests by @awni in #2296
- Cuda bug fixes 2 by @awni in #2298
- [CUDA] Divmod, Partition, and sort fixes by @awni in #2302
- [CUDA] synch properly waits for all tasks to finish and clear by @awni in #2303
- Make ptx cache settable by environment variable by @angeloskath in #2304
- Build CUDA release in Circle by @awni in #2306
- Cuda perf tuning by @awni in #2307
- Fix
update_modules()when providing a subset by @angeloskath in #2308 - Compile float64 functions on CPU by @awni in #2311
- Fix get 2d grid dims by @angeloskath in #2316
- Split broadcast so it is always fused in compile by @angeloskath in #2318
- [CUDA] Fix reductions by @angeloskath in #2314
- Fix module update in strict mode by @awni in #2321
- MLX_SWITCH macros to templates by @angeloskath in #2320
- Use fp32 for testing, add more complex ops by @awni in #2322
- Patch bump by @awni in #2324
- Allow parameters to be deleted from a module by @awni in #2325
- Fix compilation error from integral_constant by @zcbenz in #2326
- [CUDA] Switch to CUDA graphs by @awni in #2317
- [CUDA] Fix graphs for older cuda by @awni in #2328
- [CUDA] Add MLX_CUDA_GRAPH_CACHE_SIZE env for setting graph cache size by @zcbenz in #2329
- Fix layernorm race condition by @angeloskath in #2340
- Build with all cpu cores by default by @zcbenz in #2336
- [CUDA] Do vectorized store/load in binary ops by @zcbenz in #2330
- Auto build linux release by @awni in #2341
- MoE backward improvements by @angeloskath in #2335
- Fix compilation with CUDA 11 by @zcbenz in #2331
- patch bump by @awni in #2343
- Align mlx::core::max op nan propagation with NumPy by @jhavukainen in #2339
- Add zero for argsort vjp by @awni in #2345
- [CUDA] Do vectorized store/load in contiguous elementwise ops by @zcbenz in #2342
- Align mlx::core::min op nan propagation with NumPy by @jhavukainen in #2346
- [CUDA] Set current device before cudaGraphLaunch by @zcbenz in #2351
- [CUDA] Put version in ptx cache dir path by @zcbenz in #2352
- Fix type promotion in Adam with bias correction by @angeloskath in #2350
- Fix edge check in QuantizedBlockLoader for qmm_n by @angeloskath in #2355
- [CUDA] Implement Scan kernel by @zcbenz in #2347
- [Metal] fix copy dispatch by @awni in #2360
- [CUDA] Bundle CCCL for JIT compilation by @zcbenz in #2357
- [CUDA] Do not put kernels in annoymous namespace by @zcbenz in #2362
- Fix imag() vjp by @angeloskath in #2367
- Add Primitive::name and remove Primitive::print by @zcbenz in #2365
- update linux build by @awni in #2370
- [CUDA] Affine quantize by @awni in #2354
- Fix flaky linux test by @awni in #2371
- Install linux with mlx[cuda] and mlx[cpu] by @awni in #2356
- [CUDA] Use cuda::std::complex in place of cuComplex by @zcbenz in #2372
- lower memory uniform sampling by @awni in #2361
- [CUDA] Fix complex reduce + nan propagation in min and max by @awni in #2377
- Rename the copy util in cpu/copy.h to copy_cpu by @zcbenz in #2378
- fix ring distributed test by @awni in #2380
- Test with CUDA 12.2 by @awni in #2375
- [CUDA] Add work per thread to compile by @angeloskath in #2368
- [CUDA] Fix resource leaks in matmul and graph by @awni in #2383
- [CUDA] Add more ways finding CCCL headers in JIT by @zcbenz in #2382
- Add contiguous_copy_gpu util for copying array by @zcbenz in #2379
- Adding support for the Muon Optimizer by @Goekdeniz-Guelmez in #1914
- Patch bump by @awni in #2386
- Fix release build + patch bump by @awni in #2387
- Fix cuda manylinux version to match others by @awni in #2388
- [CUDA] speedup handling scalars by @awni in #2389
- Remove thrust iterators by @zcbenz in https://g...