v0.28.0
Highlights
- First version of fused sdpa vector for CUDA
- Convolutions in CUDA
- Speed improvements in CUDA normalization layers, softmax, compiled kernels, overheads and more
What's Changed
- [CUDA] Fix segfault on exit by @awni in #2424
- [CUDA] No occupancy query for launch params by @awni in #2426
- [CUDA] More sizes for gemv by @awni in #2429
- Add more CUDA architectures for PyPi package by @awni in #2427
- Use ccache in CI by @zcbenz in #2414
- [CUDA] Use aligned vector in Layer Norm and RMS norm by @awni in #2433
- Cuda faster softmax by @awni in #2435
- Remove the kernel arg from get_launch_args by @zcbenz in #2437
- Move arange to its own file by @zcbenz in #2438
- Use load_vector in arg_reduce by @zcbenz in #2439
- Make CI faster by @zcbenz in #2440
- [CUDA] Quantized refactoring by @angeloskath in #2442
- fix circular reference by @awni in #2443
- [CUDA] Fix gemv regression by @awni in #2445
- Fix wrong graph key when using concurrent context by @zcbenz in #2447
- Fix custom metal extension by @awni in #2446
- Add tests for export including control flow models and quantized models by @junpeiz in #2430
- [CUDA] Backward convolution by @zcbenz in #2431
- [CUDA] Save primitive inputs faster by @zcbenz in #2449
- [CUDA] Vectorize generated kernels by @angeloskath in #2444
- [CUDA] Matmul utils initial commit by @angeloskath in #2441
- Fix arctan2 grads by @angeloskath in #2453
- Use LRU cache for cuda graph by @zcbenz in #2448
- Add missing algorithm header to jit_compiler.cpp for Linux builds by @zamderax in #2460
- Default install cuda on linux by @awni in #2462
- fix wraps compile by @awni in #2461
- Feat: add USE_SYSTEM_FMT CMake option by @GaetanLepage in #2219
- Use SmallVector for shapes and strides by @zcbenz in #2454
- Fix install tags by @awni in #2464
- Faster gather qmm sorted test by @awni in #2463
- Fix cublas on h100 by @awni in #2466
- revert default cuda install by @awni in #2465
- feat: support a destinations based in tree flatten/unflatten by @LVivona in #2450
- Fix typo in metal command encoder by @angeloskath in #2471
- Update CUDA sdpa by @jagrit06 in #2468
- version by @awni in #2470
New Contributors
- @junpeiz made their first contribution in #2430
- @zamderax made their first contribution in #2460
- @GaetanLepage made their first contribution in #2219
- @LVivona made their first contribution in #2450
Full Changelog: v0.27.1...v0.28.0