Releases · ml-explore/mlx

[CUDA] Optimize set_mm_device_pointers for small ndim by @zcbenz in #2473
Fix logsumexp/softmax not fused for some cases by @zcbenz in #2474
Use CMake <4.1 to avoid the nvpl error by @angeloskath in #2489
Fix incorrect interpretation of unsigned dtypes in reduce ops by @abeleinin in #2477
make code blocks copyable by @Dan-Yeh in #2480
Rename cu::Matmul to CublasGemm by @zcbenz in #2488
Faster general unary op by @awni in #2472
The naive_conv_2d is no longer used by @zcbenz in #2496
Remove the hack around SmallVector in cpu compile by @zcbenz in #2494
Clean up code handling both std::vector and SmallVector by @zcbenz in #2493
[CUDA] Fix conv grads with groups by @zcbenz in #2495
Update cuDNN Frontend to v1.14 by @zcbenz in #2505
Ensure small sort doesn't use indices if not argsort by @angeloskath in #2506
Ensure no oob read in gemv_masked by @angeloskath in #2508
fix custom kernel test by @awni in #2510
No segfault with uninitialized array.at by @awni in #2514
Fix lapack svd by @awni in #2515
Split cuDNN helpers into a separate header by @zcbenz in #2491
[CUDA] Add GEMM-based fallback convolution kernels by @zcbenz in #2511
Fix docs by @russellizadi in #2518
Fix overflow in large filter small channels by @angeloskath in #2520
[CUDA] Fix stride of singleton dims before passing to cuDNN by @zcbenz in #2521
Custom cuda kernel by @angeloskath in #2517
Fix docs omission by @angeloskath in #2524
Fix power by @awni in #2523
NCCL backend by @nastya236 in #2476
[CUDA] Nccl pypi dep + default for cuda by @awni in #2526
Fix warning 186-D from nvcc by @zcbenz in #2527
[CUDA] Update calls to cudaMemAdvise and cudaGraphAddDependencies for CUDA 13 by @andportnoy in #2525
nccl default for backend=any by @awni in #2528
Fix allocation bug in NCCL by @awni in #2530
Enable COMPILE_WARNING_AS_ERROR for linux builds in CI by @zcbenz in #2534
[CUDA] Remove thrust in arange by @zcbenz in #2535
Use nccl header only when nccl is not present by @awni in #2539
Allow pathlib.Path to save/load functions by @awni in #2541
Remove nccl install in release by @awni in #2542
[CUDA] Implement DynamicSlice/DynamicSliceUpdate by @zcbenz in #2533
Remove stream from average grads so it uses default by @awni in #2532
Enable cuda graph toggle by @awni in #2545
Tests for save/load with Path by @awni in #2543
Run CPP tests for CUDA build in CI by @zcbenz in #2544
Separate cpu compilation cache by versions by @zcbenz in #2548
[CUDA] Link with nccl by @awni in #2546
[CUDA] Use ConcurrentContext in concatenate_gpu by @zcbenz in #2549
[CUDA] fix sort by @awni in #2550
Add mode parameter for quantization by @awni in #2499
Bump xcode in circle by @awni in #2551
Fix METAL quantization in JIT + fix release build by @awni in #2553
Faster contiguous gather for indices in the first axis by @awni in #2552
version bump by @awni in #2554
Fix quantized vjp for mxfp4 by @awni in #2555

New Contributors

@Dan-Yeh made their first contribution in #2480
@russellizadi made their first contribution in #2518
@andportnoy made their first contribution in #2525

Full Changelog: v0.28.0...v0.29.0

Contributors

zcbenz, angeloskath, and 6 other contributors

Assets 2

07 Aug 07:50

angeloskath

v0.28.0

56be773

v0.28.0

Highlights

First version of fused sdpa vector for CUDA
Convolutions in CUDA
Speed improvements in CUDA normalization layers, softmax, compiled kernels, overheads and more

What's Changed

[CUDA] Fix segfault on exit by @awni in #2424
[CUDA] No occupancy query for launch params by @awni in #2426
[CUDA] More sizes for gemv by @awni in #2429
Add more CUDA architectures for PyPi package by @awni in #2427
Use ccache in CI by @zcbenz in #2414
[CUDA] Use aligned vector in Layer Norm and RMS norm by @awni in #2433
Cuda faster softmax by @awni in #2435
Remove the kernel arg from get_launch_args by @zcbenz in #2437
Move arange to its own file by @zcbenz in #2438
Use load_vector in arg_reduce by @zcbenz in #2439
Make CI faster by @zcbenz in #2440
[CUDA] Quantized refactoring by @angeloskath in #2442
fix circular reference by @awni in #2443
[CUDA] Fix gemv regression by @awni in #2445
Fix wrong graph key when using concurrent context by @zcbenz in #2447
Fix custom metal extension by @awni in #2446
Add tests for export including control flow models and quantized models by @junpeiz in #2430
[CUDA] Backward convolution by @zcbenz in #2431
[CUDA] Save primitive inputs faster by @zcbenz in #2449
[CUDA] Vectorize generated kernels by @angeloskath in #2444
[CUDA] Matmul utils initial commit by @angeloskath in #2441
Fix arctan2 grads by @angeloskath in #2453
Use LRU cache for cuda graph by @zcbenz in #2448
Add missing algorithm header to jit_compiler.cpp for Linux builds by @zamderax in #2460
Default install cuda on linux by @awni in #2462
fix wraps compile by @awni in #2461
Feat: add USE_SYSTEM_FMT CMake option by @GaetanLepage in #2219
Use SmallVector for shapes and strides by @zcbenz in #2454
Fix install tags by @awni in #2464
Faster gather qmm sorted test by @awni in #2463
Fix cublas on h100 by @awni in #2466
revert default cuda install by @awni in #2465
feat: support a destinations based in tree flatten/unflatten by @LVivona in #2450
Fix typo in metal command encoder by @angeloskath in #2471
Update CUDA sdpa by @jagrit06 in #2468
version by @awni in #2470

New Contributors

@junpeiz made their first contribution in #2430
@zamderax made their first contribution in #2460
@GaetanLepage made their first contribution in #2219
@LVivona made their first contribution in #2450

Full Changelog: v0.27.1...v0.28.0

Contributors

zcbenz, angeloskath, and 6 other contributors

Assets 2

25 Jul 22:48

awni

v0.27.1

4ad5341

v0.27.1

Highlights

Initial PyPi release of the CUDA back-end.
CUDA back-end works for well with mlx-lm:
- Reasonably fast for LLM inference
- Supports single-machine training and LoRA fine-tuning

What's Changed

Avoid invoking allocator::malloc when creating CUDA event by @zcbenz in #2232
Share more common code in Compiled by @zcbenz in #2240
Avoid atomic updates across CPU/GPU in CUDA event by @zcbenz in #2231
Perf regression fix by @angeloskath in #2243
Add profiler annotations in common primitives for CUDA backend by @zcbenz in #2244
Default strict mode for module update and update_modules by @awni in #2239
Fix linux linking error by @awni in #2248
Improve metal elementwise kernels by @awni in #2247
CUDA backend: matmul by @zcbenz in #2241
Change layernorms to two pass algorithm by @angeloskath in #2246
Fix unintuitive metal kernel caching by @awni in #2242
Refactor the lu test by @emmanuel-ferdman in #2250
CUDA backend: unary ops by @zcbenz in #2158
Fix export to work with gather/scatter axis by @awni in #2263
CUDA backend: binary ops by @zcbenz in #2259
Report number of missing parameters by @FL33TW00D in #2264
CUDA backend: sort by @zcbenz in #2262
CUDA backend: random by @zcbenz in #2261
Fix conv export by @awni in #2265
CUDA backend: copy ops by @zcbenz in #2260
Fix building cpp benchmarks on Linux by @zcbenz in #2268
Add load_safe to the general conv loaders by @angeloskath in #2258
start cuda circle config by @awni in #2256
CUDA backend: reduce by @zcbenz in #2269
CUDA backend: argreduce by @zcbenz in #2270
CUDA backend: softmax by @zcbenz in #2272
CUDA backend: layernorm by @zcbenz in #2271
Fix warnings from latest CUDA toolkit by @zcbenz in #2275
Make sliceUpdate general by @awni in #2282
CUDA backend: compile by @zcbenz in #2276
[CUDA] RMSNorm and VJP by @awni in #2280
[CUDA] Fix build by @awni in #2284
[CUDA] ternary with select op by @awni in #2283
CUDA backend: indexing ops by @zcbenz in #2277
Collection of refactors by @jagrit06 in #2274
Fix complex power and print by @awni in #2286
fix cuda jit by @awni in #2287
Fix cuda gemm for bf16 by @awni in #2288
Fix cuda arg reduce by @awni in #2291
RoPE for CUDA by @angeloskath in #2293
Add python testing for cuda with ability to skip list of tests by @awni in #2295
[CUDA] Fix back-end bugs and enable corresponding tests by @awni in #2296
Cuda bug fixes 2 by @awni in #2298
[CUDA] Divmod, Partition, and sort fixes by @awni in #2302
[CUDA] synch properly waits for all tasks to finish and clear by @awni in #2303
Make ptx cache settable by environment variable by @angeloskath in #2304
Build CUDA release in Circle by @awni in #2306
Cuda perf tuning by @awni in #2307
Fix update_modules() when providing a subset by @angeloskath in #2308
Compile float64 functions on CPU by @awni in #2311
Fix get 2d grid dims by @angeloskath in #2316
Split broadcast so it is always fused in compile by @angeloskath in #2318
[CUDA] Fix reductions by @angeloskath in #2314
Fix module update in strict mode by @awni in #2321
MLX_SWITCH macros to templates by @angeloskath in #2320
Use fp32 for testing, add more complex ops by @awni in #2322
Patch bump by @awni in #2324
Allow parameters to be deleted from a module by @awni in #2325
Fix compilation error from integral_constant by @zcbenz in #2326
[CUDA] Switch to CUDA graphs by @awni in #2317
[CUDA] Fix graphs for older cuda by @awni in #2328
[CUDA] Add MLX_CUDA_GRAPH_CACHE_SIZE env for setting graph cache size by @zcbenz in #2329
Fix layernorm race condition by @angeloskath in #2340
Build with all cpu cores by default by @zcbenz in #2336
[CUDA] Do vectorized store/load in binary ops by @zcbenz in #2330
Auto build linux release by @awni in #2341
MoE backward improvements by @angeloskath in #2335
Fix compilation with CUDA 11 by @zcbenz in #2331
patch bump by @awni in #2343
Align mlx::core::max op nan propagation with NumPy by @jhavukainen in #2339
Add zero for argsort vjp by @awni in #2345
[CUDA] Do vectorized store/load in contiguous elementwise ops by @zcbenz in #2342
Align mlx::core::min op nan propagation with NumPy by @jhavukainen in #2346
[CUDA] Set current device before cudaGraphLaunch by @zcbenz in #2351
[CUDA] Put version in ptx cache dir path by @zcbenz in #2352
Fix type promotion in Adam with bias correction by @angeloskath in #2350
Fix edge check in QuantizedBlockLoader for qmm_n by @angeloskath in #2355
[CUDA] Implement Scan kernel by @zcbenz in #2347
[Metal] fix copy dispatch by @awni in #2360
[CUDA] Bundle CCCL for JIT compilation by @zcbenz in #2357
[CUDA] Do not put kernels in annoymous namespace by @zcbenz in #2362
Fix imag() vjp by @angeloskath in #2367
Add Primitive::name and remove Primitive::print by @zcbenz in #2365
update linux build by @awni in #2370
[CUDA] Affine quantize by @awni in #2354
Fix flaky linux test by @awni in #2371
Install linux with mlx[cuda] and mlx[cpu] by @awni in #2356
[CUDA] Use cuda::std::complex in place of cuComplex by @zcbenz in #2372
lower memory uniform sampling by @awni in #2361
[CUDA] Fix complex reduce + nan propagation in min and max by @awni in #2377
Rename the copy util in cpu/copy.h to copy_cpu by @zcbenz in #2378
fix ring distributed test by @awni in #2380
Test with CUDA 12.2 by @awni in #2375
[CUDA] Add work per thread to compile by @angeloskath in #2368
[CUDA] Fix resource leaks in matmul and graph by @awni in #2383
[CUDA] Add more ways finding CCCL headers in JIT by @zcbenz in #2382
Add contiguous_copy_gpu util for copying array by @zcbenz in #2379
Adding support for the Muon Optimizer by @Goekdeniz-Guelmez in #1914
Patch bump by @awni in #2386
Fix release build + patch bump by @awni in #2387
Fix cuda manylinux version to match others by @awni in #2388
[CUDA] speedup handling scalars by @awni in #2389
Remove thrust iterators by @zcbenz in https://g...