Releases · bitsandbytes-foundation/bitsandbytes

12 Dec 18:07

e6ccde2

Latest `main` wheel Pre-release

Pre-release

Latest `main` pre-release wheel

This pre-release contains the latest development wheels for all supported platforms, rebuilt automatically on every commit to the main branch.

How to install:
Pick the correct command for your platform and run it in your terminal:

macOS 14+ (arm64)

pip install --force-reinstall https://github.com/bitsandbytes-foundation/bitsandbytes/releases/download/continuous-release_main/bitsandbytes-1.33.7.preview-py3-none-macosx_14_0_arm64.whl

Linux (aarch64)

pip install --force-reinstall https://github.com/bitsandbytes-foundation/bitsandbytes/releases/download/continuous-release_main/bitsandbytes-1.33.7.preview-py3-none-manylinux_2_24_aarch64.whl

Linux (x86_64)

pip install --force-reinstall https://github.com/bitsandbytes-foundation/bitsandbytes/releases/download/continuous-release_main/bitsandbytes-1.33.7.preview-py3-none-manylinux_2_24_x86_64.whl

Windows (x86_64)

pip install --force-reinstall https://github.com/bitsandbytes-foundation/bitsandbytes/releases/download/continuous-release_main/bitsandbytes-1.33.7.preview-py3-none-win_amd64.whl

Note:
These wheels are updated automatically with every commit to main and become available as soon as the python-package.yml workflow finishes.

The version number is replaced with 1.33.7-preview in order to keep the link stable, this however does not affect the installed version at all:

> pip install https://.../bitsandbytes-1.33.7-preview-py3-none-manylinux_2_24_x86_64.whl
Collecting bitsandbytes==1.33.7rc0
...
Successfully installed bitsandbytes-0.49.0.dev0

Assets 6

11 Dec 20:51

matthewdouglas

0.49.0

26f0c7a

0.49.0 Latest

Latest

Highlights

x86-64 CPU Improvements

CPU performance for 4bit is significantly improved on x86-64, with optimized kernel paths for CPUs that have AVX512 or AVX512BF16 support.

AMD ROCm Experimental Wheels

Experimental support for AMD devices is now included in our PyPI wheels on Linux x86-64.
We've added additional GPU target devices as outlined in our docs.
Support for using the default blocksize of 64 for 4bit was added for RDNA GPUs in #1748.

macOS 14+ Wheels

We're now publishing wheels for macOS 14+!
The 4bit and 8bit quantization features are supported on MPS by slow implementations. We plan to enable Metal kernels with improved performance in the future.

🚨 Breaking Changes

Dropped support for Python 3.9.
Dropped compilation support for Maxwell GPUs in the CUDA backend.

What's Changed

[ROCm] Update build targets by @matthewdouglas in #1788
Drop Python 3.9 support by @matthewdouglas in #1795
Fix indexing overflow issue for blockwise quantization on AMD by @sstamenk in #1796
Tests: Run CPU tests against PyTorch 2.9 by @matthewdouglas in #1797
Remove deprecated code by @matthewdouglas in #1798
Cpu C++ kernel by @jiqing-feng in #1789
fix build error: "no case matching constant switch condition" by @yuguo68 in #1802
CI: skip rebuilding CPU lib when building/installing wheels by @matthewdouglas in #1803
add support for 64 block size on 32 warp size supported amd gpus by @electron271 in #1748
Enable more tests on AMD for warp size 32 by @sstamenk in #1805
CUDA: Drop compilation compatibility with Maxwell by @matthewdouglas in #1806
ROCm: Add build for ROCm 7.1 by @matthewdouglas in #1807
CI: Enable tests on Linux x86-64 with CUDA 13 by @matthewdouglas in #1808
Replace NULL with nullptr in pythonInterface.cpp by @yuguo68 in #1809
CI: Run tests on PRs, refactor nightly test workflow by @matthewdouglas in #1811
Remove old nightly workflow by @matthewdouglas in #1812
Cpu fused kernel by @jiqing-feng in #1804
Update README by @matthewdouglas in #1816
Cleanup: remove FastBinarySearch by @matthewdouglas in #1817
Enable publishing of macOS wheel by @matthewdouglas in #1818
ROCm: reduce size of builds by @matthewdouglas in #1819
CUDA 13: aggressive compression of binary size by @matthewdouglas in #1820
ROCm: Add gfx1150/gfx1151 to build targets by @matthewdouglas in #1822
Update workflow dependencies by @matthewdouglas in #1824
Hf kernel by @jiqing-feng in #1814
CUDA/ROCm: Remove dead code by @matthewdouglas in #1827
CPU: workaround avx512 4bit dequantize accuracy issue for large blocksize by @matthewdouglas in #1828
Update installation doc by @matthewdouglas in #1830
Add release for DGX Spark cuda121 by @mfuntowicz in #1829
Fix: Python 3.14 compatibility with PyTorch 2.9 by @matthewdouglas in #1831

New Contributors

@sstamenk made their first contribution in #1796
@yuguo68 made their first contribution in #1802
@electron271 made their first contribution in #1748
@mfuntowicz made their first contribution in #1829

Full Changelog: 0.48.2...0.49.0

Contributors

mfuntowicz, matthewdouglas, and 4 other contributors

Assets 2

29 Oct 21:48

matthewdouglas

0.48.2

b48ecdb

0.48.2

What's Changed

Fix indexing overflow issue for blockwise quantization by @matthewdouglas in #1784
Fix regression with CPU/disk offloading for accelerate + int8 by @matthewdouglas in #1786
XPU: Add Windows build for SYCL kernels by @matthewdouglas in #1787

Full Changelog: 0.48.1...0.48.2

Contributors

matthewdouglas

Assets 2

02 Oct 17:47

matthewdouglas

0.48.1

7e16503

0.48.1

This release fixes a regression introduced in 0.48.0 related to LLM.int8(). This issue caused poor inference results with pre-quantized checkpoints in HF transformers.

What's Changed

Add trove-classifiers requirement to pyproject.toml by @ccoulombe in #1774
Fix regression in 8bit parameter device movement by @matthewdouglas in #1776

Full Changelog: 0.48.0...0.48.1

Contributors

ccoulombe and matthewdouglas

Assets 2

30 Sep 21:48

matthewdouglas

0.48.0

406fb6a

0.48.0: Intel GPU & Gaudi support, CUDA 13, performance improvements, and more!

Highlights

🎉 Intel GPU Support

We now officially support Intel GPUs on Linux and Windows! Support is included for all major features (LLM.int8(), QLoRA, 8bit optimizers) with the exception of the paged optimizer feature.

This support includes the following hardware:

Intel® Arc™ B-Series Graphics
Intel® Arc™ A-Series Graphics
Intel® Data Center GPU Max Series

A compatible PyTorch version with Intel XPU support is required. The current minimum is PyTorch 2.6.0. It is recommended to use the latest stable release. See Getting Started on Intel GPU for guidance.

🎉 Intel Gaudi Support

We now officially support Intel Gaudi2 and Gaudi3 accelerators. This support includes LLM.int8() and QLoRA with the NF4 data type. At this time optimizers are not implemented.

A compatible PyTorch version with Intel Gaudi support is required. The current minimum is Gaudi v1.21 with PyTorch 2.6.0. It is recommended to use the latest stable release. See the Gaudi software installation guide for guidance.

NVIDIA CUDA

The 4bit dequantization kernel was improved by @Mhmd-Hisham in #1746. This change brings noticeable speed improvements for prefill, batch token generation, and training. The improvement is particularly prominent on A100, H100, and B200.
We've added CUDA 13.0 compatibility across Linux x86-64, Linux aarch64, and Windows x86-64 platforms.
- Hardware support for CUDA 13.0 is limited to Turing generation and newer.
- Support for Thor (SM110) is available in the Linux aarch64 build.

🚨 Breaking Changes

Dropped support for PyTorch 2.2. The new minimum requirement is 2.3.0.
Removed Maxwell GPU support for all CUDA builds.

What's Changed

add py.typed by @cyyever in #1726
Enable F841 by @cyyever in #1727
add int mm for xpu after torch 2.9 by @jiqing-feng in #1736
for intel xpu case, use MatMul8bitFp even not use ipex by @kaixuanliu in #1728
4bit quantization for arbitrary nn.Parameter by @matthewdouglas in #1720
Adjust 4bit test tolerance on CPU for larger blocksizes by @matthewdouglas in #1749
Test improvements by @matthewdouglas in #1750
[XPU] Implemented 32bit optimizers in triton by @YangKai0616 in #1710
Add SYCL Kernels for XPU backend by @xiaolil1 in #1679
[XPU] Implemented 8bit optimizers in triton by @Egor-Krivov in #1692
Drop Maxwell (sm50) build from distribution by @matthewdouglas in #1755
Bump minimum PyTorch to 2.3 by @matthewdouglas in #1754
[CUDA] Branchless NF4/FP4 kDequantizeBlockwise kernel for faster dequantization by @Mhmd-Hisham in #1746
Update log by @YangKai0616 in #1758
Add function to reverse 4bit weights for HPU by @vivekgoe in #1757
Add CUDA 13.0 Support by @matthewdouglas in #1761
Fix for warpSize deprecation in ROCm 7.0 by @pnunna93 in #1762
Build/Package Intel XPU binary for Linux by @matthewdouglas in #1763
Update workflow for packaging by @matthewdouglas in #1766
Add Thor support by @jasl in #1764
ROCm: Add 6.4 and 7.0 builds by @matthewdouglas in #1767
Linear8bitLt: support device movement after forward() by @matthewdouglas in #1769

New Contributors

@cyyever made their first contribution in #1726
@kaixuanliu made their first contribution in #1728
@YangKai0616 made their first contribution in #1710
@xiaolil1 made their first contribution in #1679
@vivekgoe made their first contribution in #1757
@jasl made their first contribution in #1764

Full Changelog: 0.47.0...0.48.0

Contributors

jasl, Egor-Krivov, and 9 other contributors

Assets 2

11 Aug 18:59

matthewdouglas

0.47.0

c0dcdf2

0.47.0

Highlights:

FSDP2 compatibility for Params4bit (#1719)
Bugfix for 4bit quantization with large block sizes (#1721)
Further removal of previously deprecated code (#1669)
Improved CPU coverage (#1628)
Include NVIDIA Volta support in CUDA 12.8 and 12.9 builds (#1715)

What's Changed

Enable CPU/XPU native and ipex path by @jiqing-feng in #1628
Fix CI regression by @matthewdouglas in #1666
Add CPU + IPEX to nightly CI by @matthewdouglas in #1667
Fix params4bit passing bnb quantized by @mklabunde in #1665
Deprecation cleanup by @matthewdouglas in #1669
CI workflow: bump torch 2.7.0 to 2.7.1 by @matthewdouglas in #1670
Improvement for torch.compile support on Params4bit by @matthewdouglas in #1673
Fixed a bug in test_fw_bit_quant testing on CPU by @Egor-Krivov in #1675
doc fix signature for 8-bit optim by @ved1beta in #1660
Apply clang-format rules by @matthewdouglas in #1678
Add clang-format by @matthewdouglas in #1677
HPU (Intel gaudi) support for bnb unit tests by @ckvermaAI in #1680
CI: Setup HPU nightly tests by @matthewdouglas in #1681
Update test_kbit_backprop unit test by @ckvermaAI in #1682
Update README.md by @matthewdouglas in #1684
Enable ROCm backend with custom ops integration by @pnunna93 in #1683
Fix AdamW documentation by @agupta2304 in #1686
Make minor improvements to optimizer.py by @agupta2304 in #1687
Add CUDA 12.9 build by @matthewdouglas in #1689
CI: Test with PyTorch 2.8.0 RC by @matthewdouglas in #1693
Automatically call CMake as part of PEP 517 build by @mgorny in #1512
fix log by @jiqing-feng in #1697
[XPU] Add inference benchmark for XPU by @Egor-Krivov in #1696
Add kernel registration for 8bit and 32bit optimizers by @Egor-Krivov in #1706
Create FUNDING.yml by @matthewdouglas in #1714
Add Volta support in cu128/cu129 builds by @matthewdouglas in #1715
Fix Params4bit tensor subclass handling by @ved1beta in #1719
[CUDA] Fixing quantization uint8 packing bug for NF4 and FP4 by @Mhmd-Hisham in #1721

New Contributors

@mklabunde made their first contribution in #1665
@agupta2304 made their first contribution in #1686
@mgorny made their first contribution in #1512
@Mhmd-Hisham made their first contribution in #1721

Full Changelog: 0.46.0...0.47.0

Contributors

mgorny, Egor-Krivov, and 8 other contributors

Assets 2

02 Jul 19:45

matthewdouglas

0.46.1

4bca844

0.46.1

What's Changed

Fix params4bit passing bnb quantized by @mklabunde in #1665
Improvement for torch.compile support on Params4bit by @matthewdouglas in #1673
doc fix signature for 8-bit optim by @ved1beta in #1660
Fix AdamW documentation by @agupta2304 in #1686
Make minor improvements to optimizer.py by @agupta2304 in #1687
Add CUDA 12.9 build by @matthewdouglas in #1689
Automatically call CMake as part of PEP 517 build by @mgorny in #1512

New Contributors

@mklabunde made their first contribution in #1665
@agupta2304 made their first contribution in #1686
@mgorny made their first contribution in #1512

Full Changelog: 0.46.0...0.46.1

Contributors

mgorny, mklabunde, and 3 other contributors

Assets 2

27 May 21:27

matthewdouglas

0.46.0

1e54f91

0.46.0: torch.compile() support; custom ops refactor; Linux aarch64 wheels

Highlights

Support for torch.compile without graph breaks for LLM.int8().
- Compatible with PyTorch 2.4+, but PyTorch 2.6+ is recommended.
- Experimental CPU support is included.
Support torch.compile without graph breaks for 4bit.
- Compatible with PyTorch 2.4+ for fullgraph=False.
- Requires PyTorch 2.8 nightly for fullgraph=True.
We are now publishing wheels for CUDA Linux aarch64 (sbsa)!
- Targets are Turing generation and newer: sm75, sm80, sm90, and sm100.
PyTorch Custom Operators refactoring and integration:
- We have refactored most of the library code to integrate better with PyTorch via the torch.library and custom ops APIs. This helps enable our torch.compile and additional hardware compatibility efforts.
- End-users do not need to change the way they are using bitsandbytes.
Unit tests have been cleaned up for increased determinism and most are now device-agnostic.
- A new nightly CI runs unit tests for CPU (Windows x86-64, Linux x86-64/aarch64) and CUDA (Linux/Windows x86-64).

Compatability Changes

Support for Python 3.8 is dropped.
Support for PyTorch < 2.2.0 is dropped.
CUDA 12.6 and 12.8 builds are now compatible for manylinux_2_24 (previously manylinux_2_34).
Many APIs that were previously marked as deprecated have now been removed.
New deprecations:
- bnb.autograd.get_inverse_transform_indices()
- bnb.autograd.undo_layout()
- bnb.functional.create_quantile_map()
- bnb.functional.estimate_quantiles()
- bnb.functional.get_colrow_absmax()
- bnb.functional.get_row_absmax()
- bnb.functional.histogram_scatter_add_2d()

What's Changed

PyTorch Custom Operator Integration by @matthewdouglas in #1544
Bump CUDA 12.8.0 build to CUDA 12.8.1 by @matthewdouglas in #1575
Drop Python 3.8 support. by @matthewdouglas in #1574
Test cleanup by @matthewdouglas in #1576
Fix: Return tuple in get_cuda_version_tuple by @DevKimbob in #1580
Fix torch.compile issue for LLM.int8() with threshold=0 by @matthewdouglas in #1581
fix for missing cpu lib by @Titus-von-Koeller in #1585
Fix #1588 - torch compatability for <=2.4 by @matthewdouglas in #1590
Add autoloading for backend packages by @matthewdouglas in #1593
Specify blocksize by @cyr0930 in #1586
fix typo getitem by @ved1beta in #1597
fix: Improve CUDA version detection and error handling by @ved1beta in #1599
Support LLM.int8() inference with torch.compile by @matthewdouglas in #1594
Updates for device agnosticism by @matthewdouglas in #1601
Stop building for CUDA toolkit < 11.8 by @matthewdouglas in #1605
fix intel cpu/xpu installation by @jiqing-feng in #1613
Support 4bit torch.compile fullgraph with PyTorch nightly by @matthewdouglas in #1616
Improve torch.compile support for int8 with torch>=2.8 nightly by @matthewdouglas in #1617
Add simple op implementations for CPU by @matthewdouglas in #1602
Set up nightly CI for unit tests by @matthewdouglas in #1619
point to correct latest continuous release main by @winglian in #1621
ARM runners (faster than cross compilation qemu) by @johnnynunez in #1539
Linux aarch64 CI updates by @matthewdouglas in #1622
Moved int8_mm_dequant from CPU to default backend by @Egor-Krivov in #1626
Refresh content for README.md by @matthewdouglas in #1620
C lib loading: add fallback with sensible error msg by @Titus-von-Koeller in #1615
Switch CUDA builds to use Rocky Linux 8 container by @matthewdouglas in #1638
Improvements to test suite by @matthewdouglas in #1636
Additional CI runners by @matthewdouglas in #1639
CI runner updates by @matthewdouglas in #1643
Optimizer backwards compatibility fix by @matthewdouglas in #1647
General cleanup & test improvements by @matthewdouglas in #1646
Add torch.compile tests by @matthewdouglas in #1648
Documentation Cleanup by @matthewdouglas in #1644
simplified non_sign_bits by @ved1beta in #1649

New Contributors

@DevKimbob made their first contribution in #1580
@cyr0930 made their first contribution in #1586
@ved1beta made their first contribution in #1597
@winglian made their first contribution in #1621
@Egor-Krivov made their first contribution in #1626

Full Changelog: 0.45.4...0.46.0

Contributors

winglian, Egor-Krivov, and 7 other contributors

Assets 2

19 May 13:24

github-actions

continuous-release_multi-backend-refactor

5e267f5

Multi-Backend Preview Pre-release

Pre-release

continuous-release_multi-backend-refactor

update compute_type_is_set attr (#1623)

Assets 4

07 Apr 13:37

matthewdouglas

0.45.5

d72cb9c

0.45.5

This is a minor release that affects CPU-only usage of bitsandbytes. The CPU build of the library was inadvertently omitted from the v0.45.4 wheels.

Full Changelog: 0.45.4...0.45.5

Assets 2

Uh oh!

Releases: bitsandbytes-foundation/bitsandbytes

Latest `main` wheel

Latest main pre-release wheel

macOS 14+ (arm64)

Linux (aarch64)

Linux (x86_64)

Windows (x86_64)

Uh oh!

0.49.0

Highlights

x86-64 CPU Improvements

AMD ROCm Experimental Wheels

macOS 14+ Wheels

🚨 Breaking Changes

What's Changed

New Contributors

Contributors

Uh oh!

0.48.2

What's Changed

Contributors

Uh oh!

0.48.1

What's Changed

Contributors

Uh oh!

0.48.0: Intel GPU & Gaudi support, CUDA 13, performance improvements, and more!

Highlights

🎉 Intel GPU Support

🎉 Intel Gaudi Support

NVIDIA CUDA

🚨 Breaking Changes

What's Changed

New Contributors

Contributors

Uh oh!

0.47.0

Highlights:

What's Changed

New Contributors

Contributors

Uh oh!

0.46.1

What's Changed

New Contributors

Contributors

Uh oh!

0.46.0: torch.compile() support; custom ops refactor; Linux aarch64 wheels

Highlights

Compatability Changes

What's Changed

New Contributors

Contributors

Uh oh!

Multi-Backend Preview

Uh oh!

0.45.5

Uh oh!

Latest `main` pre-release wheel