|
1 |
| -### 0.45.1 |
| 1 | +### v0.45.1 |
2 | 2 |
|
3 | 3 | #### Improvements:
|
4 | 4 |
|
5 |
| -- Initial Support Blackwell B100 GPUs, RTX 50 Blackwell series GPUs and Jetson Thor Blackwell |
| 5 | +* Compatibility for `triton>=3.2.0` |
| 6 | +* Moved package configuration to `pyproject.toml` |
| 7 | +* Build system: initial support for NVIDIA Blackwell B100 GPUs, RTX 50 Blackwell series GPUs and Jetson Thor Blackwell. |
| 8 | + * Note: Binaries built for these platforms are not included in this release. They will be included in future releases upon the availability of the upcoming CUDA Toolkit 12.7 and 12.8. |
| 9 | + |
| 10 | +#### Bug Fixes: |
| 11 | +* Packaging: wheels will no longer include unit tests. (#1478) |
| 12 | + |
| 13 | +#### Dependencies: |
| 14 | +* Sets the minimum PyTorch version to 2.0.0. |
| 15 | + |
| 16 | +### 0.45.0 |
| 17 | + |
| 18 | +This is a significant release, bringing support for LLM.int8() to NVIDIA Hopper GPUs such as the H100. |
| 19 | + |
| 20 | +As part of the compatibility enhancements, we've rebuilt much of the LLM.int8() code in order to simplify for future compatibility and maintenance. We no longer use the col32 or architecture-specific tensor layout formats while maintaining backwards compatibility. We additionally bring performance improvements targeted for inference scenarios. |
| 21 | + |
| 22 | +#### Performance Improvements |
| 23 | +This release includes broad performance improvements for a wide variety of inference scenarios. See [this X thread](https://x.com/Tim_Dettmers/status/1864706051171287069) for a detailed explanation. |
| 24 | + |
| 25 | +#### Breaking Changes |
| 26 | +🤗[PEFT](https://github.com/huggingface/peft) users wishing to merge adapters with 8-bit weights will need to upgrade to `peft>=0.14.0`. |
| 27 | + |
| 28 | +#### Packaging Improvements |
| 29 | +* The size of our wheel has been reduced by ~43.5% from 122.4 MB to 69.1 MB! This results in an on-disk size decrease from ~396MB to ~224MB. |
| 30 | +* Binaries built with CUDA Toolkit 12.6.2 are now included in the PyPI distribution. |
| 31 | +* The CUDA 12.5.0 build has been updated to CUDA Toolkit 12.5.1. |
| 32 | + |
| 33 | + |
| 34 | +#### Deprecations |
| 35 | +* A number of public API functions have been marked for deprecation and will emit `FutureWarning` when used. These functions will become unavailable in future releases. This should have minimal impact on most end-users. |
| 36 | +* The k-bit quantization features are deprecated in favor of blockwise quantization. For all optimizers, using `block_wise=False` is not recommended and support will be removed in a future release. |
| 37 | +* As part of the refactoring process, we've implemented many new 8bit operations. These operations no longer use specialized data layouts. |
| 38 | + |
| 39 | +#### Full Changelog |
| 40 | + |
| 41 | +* refine docs for multi-backend alpha release by @Titus-von-Koeller in #1380 |
| 42 | +* README: Replace special Unicode text symbols with regular characters by @akx in #1385 |
| 43 | +* Update CI tools & fix typos by @akx in #1386 |
| 44 | +* Fix invalid escape sequence warning in Python 3.12 by @oshiteku in #1420 |
| 45 | +* [Build] Add CUDA 12.6.2 build; update 12.5.0 to 12.5.1 by @matthewdouglas in #1431 |
| 46 | +* LLM.int8() Refactoring: Part 1 by @matthewdouglas in #1401 |
| 47 | + |
| 48 | +### 0.44.1 |
| 49 | + |
| 50 | +#### Bug fixes: |
| 51 | +* Fix optimizer support for Python <= 3.9 by @matthewdouglas in #1379 |
| 52 | + |
| 53 | +### 0.44.0 |
| 54 | + |
| 55 | +#### New: AdEMAMix Optimizer |
| 56 | +The [AdEMAMix](https://hf.co/papers/2409.03137) optimizer is a modification to AdamW which proposes tracking two EMAs to better leverage past gradients. This allows for faster convergence with less training data and improved resistance to forgetting. |
| 57 | + |
| 58 | +We've implemented 8bit and paged variations: `AdEMAMix`, `AdEMAMix8bit`, `PagedAdEMAMix`, and `PagedAdEMAMix8bit`. These can be used with a similar API to existing optimizers. |
| 59 | + |
| 60 | +#### Improvements: |
| 61 | +* **8-bit Optimizers**: The block size for all 8-bit optimizers has been reduced from 2048 to 256 in this release. This is a change from the original implementation proposed in [the paper](https://hf.co/papers/2110.02861) which improves accuracy. |
| 62 | +* **CUDA Graphs support**: A fix to enable [CUDA Graphs](https://pytorch.org/blog/accelerating-pytorch-with-cuda-graphs/) capture of kernel functions was made in #1330. This allows for performance improvements with inference frameworks like vLLM. Thanks @jeejeelee! |
| 63 | + |
| 64 | +#### Full Changelog: |
| 65 | +* Embedding4bit and Embedding8bit implementation by @galqiwi in #1292 |
| 66 | +* Bugfix: Load correct nocublaslt library variant when BNB_CUDA_VERSION override is set by @matthewdouglas in #1318 |
| 67 | +* Enable certain CUDA kernels to accept specified cuda stream by @jeejeelee in #1330 |
| 68 | +* Initial support for ppc64le by @mgiessing in #1316 |
| 69 | +* Cuda source cleanup , refactor and fixes by @abhilash1910 in #1328 |
| 70 | +* Update for VS2022 17.11 compatibility with CUDA < 12.4 by @matthewdouglas in #1341 |
| 71 | +* Bump the minor-patch group with 3 updates by @dependabot in #1362 |
| 72 | +* Update matplotlib requirement from ~=3.9.1 to ~=3.9.2 in the major group by @dependabot in #1361 |
| 73 | +* docs: add internal reference to multi-backend guide by @Titus-von-Koeller in #1352 |
| 74 | +* Add move_to_device kwarg to the optimizer's load_state_dict by @koute in #1344 |
| 75 | +* Add AdEMAMix optimizer by @matthewdouglas in #1360 |
| 76 | +* Change 8bit optimizer blocksize 2048->256; additional bf16 support by @matthewdouglas in #1365 |
6 | 77 |
|
7 | 78 | ### 0.43.3
|
8 | 79 |
|
|
0 commit comments