Highlights
- The default blocksize of 64 for 4bit quantization is now supported on ROCm. Previously the default was 128, which was a mismatch from the default for other devices.
- ROCm 7.2 build is now included.
What's Changed
- bug: fix 8bitoptim support with fsdp by @ved1beta in #1840
- Fix xpu 4bit kernel by @jiqing-feng in #1839
- ROCm 7.2 build and doc changes by @sstamenk in #1845
- Add CUDA kernel support for 4-bit quantization with blocksize=32 by @Abdennacer-Badaoui in #1854
- Add blocksize=64 4-bit quantization support for ROCm CDNA (warp64) GPUs by @Abdennacer-Badaoui in #1856
- [Docs Update] QLoRA 4-bit Support on ROCm by @Abdennacer-Badaoui in #1857
- [ROCm] Make blocksize=64 default for 4bit by @matthewdouglas in #1873
- Handle non-contiguous tensors in quantize/dequantize ops by @TimDettmers in #1859
- Fix AdEMAMix scheduler guard and add state_dict round-trip test by @TimDettmers in #1861
New Contributors
- @Abdennacer-Badaoui made their first contribution in #1854
Full Changelog: 0.49.1...0.49.2