Highlights
x86-64 CPU Improvements
CPU performance for 4bit is significantly improved on x86-64, with optimized kernel paths for CPUs that have AVX512 or AVX512BF16 support.
AMD ROCm Experimental Wheels
- Experimental support for AMD devices is now included in our PyPI wheels on Linux x86-64.
- We've added additional GPU target devices as outlined in our docs.
- Support for using the default blocksize of 64 for 4bit was added for RDNA GPUs in #1748.
macOS 14+ Wheels
- We're now publishing wheels for macOS 14+!
- The 4bit and 8bit quantization features are supported on MPS by slow implementations. We plan to enable Metal kernels with improved performance in the future.
🚨 Breaking Changes
- Dropped support for Python 3.9.
- Dropped compilation support for Maxwell GPUs in the CUDA backend.
What's Changed
- [ROCm] Update build targets by @matthewdouglas in #1788
- Drop Python 3.9 support by @matthewdouglas in #1795
- Fix indexing overflow issue for blockwise quantization on AMD by @sstamenk in #1796
- Tests: Run CPU tests against PyTorch 2.9 by @matthewdouglas in #1797
- Remove deprecated code by @matthewdouglas in #1798
- Cpu C++ kernel by @jiqing-feng in #1789
- fix build error: "no case matching constant switch condition" by @yuguo68 in #1802
- CI: skip rebuilding CPU lib when building/installing wheels by @matthewdouglas in #1803
- add support for 64 block size on 32 warp size supported amd gpus by @electron271 in #1748
- Enable more tests on AMD for warp size 32 by @sstamenk in #1805
- CUDA: Drop compilation compatibility with Maxwell by @matthewdouglas in #1806
- ROCm: Add build for ROCm 7.1 by @matthewdouglas in #1807
- CI: Enable tests on Linux x86-64 with CUDA 13 by @matthewdouglas in #1808
- Replace NULL with nullptr in pythonInterface.cpp by @yuguo68 in #1809
- CI: Run tests on PRs, refactor nightly test workflow by @matthewdouglas in #1811
- Remove old nightly workflow by @matthewdouglas in #1812
- Cpu fused kernel by @jiqing-feng in #1804
- Update README by @matthewdouglas in #1816
- Cleanup: remove FastBinarySearch by @matthewdouglas in #1817
- Enable publishing of macOS wheel by @matthewdouglas in #1818
- ROCm: reduce size of builds by @matthewdouglas in #1819
- CUDA 13: aggressive compression of binary size by @matthewdouglas in #1820
- ROCm: Add gfx1150/gfx1151 to build targets by @matthewdouglas in #1822
- Update workflow dependencies by @matthewdouglas in #1824
- Hf kernel by @jiqing-feng in #1814
- CUDA/ROCm: Remove dead code by @matthewdouglas in #1827
- CPU: workaround avx512 4bit dequantize accuracy issue for large blocksize by @matthewdouglas in #1828
- Update installation doc by @matthewdouglas in #1830
- Add release for DGX Spark cuda121 by @mfuntowicz in #1829
- Fix: Python 3.14 compatibility with PyTorch 2.9 by @matthewdouglas in #1831
New Contributors
- @sstamenk made their first contribution in #1796
- @yuguo68 made their first contribution in #1802
- @electron271 made their first contribution in #1748
- @mfuntowicz made their first contribution in #1829
Full Changelog: 0.48.2...0.49.0