Highlights
We are excited to announce the 0.8.0 release of torchao! In this release we’ve shipped the first CUTLASS kernel in torchAO which adds support for W4A8 linear operator. In addition to this, we’ve also added TTFT benchmarks to torchAO and compared different quantization + sparsity speedups for prefill / decoding.
W4A8 based on CUTLASS
A new W4A8 linear operator is implemented, that corresponds to int8_dynamic_activation_int4_weight quantization where two 4-bit weights get packed into a single 8-bit integer value; also, CUTLASS is made a sub-module of torchao repo, in order to be able to utilize more of its functionality to implement new kernels.
Benchmarks on A100
-q parameter |
Average tokens/sec | Average Bandwidth in GB/s | Peak Memory Usage in GB | Model Size in GB |
---|---|---|---|---|
95.24 | 258.55 | 13.90 | 13.21 | |
-q int8wo |
155.31 | 1028.37 | 8.97 | 6.62 |
-q int4wo-32 |
186.70 | 774.98 | 5.31 | 4.15 |
-q int4wo-hqq |
186.47 | 774.01 | 5.04 | 4.15 |
-q int8dq |
49.64 | 328.72 | 9.44 | 6.62 |
-q w4a8-cutlass (tuned) |
119.31 | 394.86 | 4.52 | 3.31 |
Prefill performance benchmarks
We’ve added TTFT benchmarks to torchAO and compared different quantization + sparsity speedups for prefill / decoding. During prefill, we are compute bound and find that dynamic quantization offers greater speedups over weight-only quantization, which is faster for prefill. We’ve also added an option for int8 dynamic quantization that will selectively use prefill during LLM decoding.
BC Breaking
Delete the float8-all-gather-only functionality from float8 training (#1451)
The use_fp8_all_gather_only
was an experimental flag, off by default, which was not marketed and not used by anyone as far as we know. We are removing it to simplify the code.
Before
config = Float8LinearConfig(
...,
# the option below is being removed
use_fp8_all_gather_only = True,
)
convert_to_float8_training(model, config=config, ...)
After
The use_fp8_all_gather_only
option is no longer supported.
New Features
- Add TTFT benchmarks + update sparsity benchmarks (#1140)
- Gemlite integration in torchao (#1034)
- W4A8 based on CUTLASS (#880)
Improvement
quantize_
- Expose zero_point_domain as arguments (#1401)
- Add convert path for quantize_ QAT API (#1540)
- Int8 dynamic prefill weight only decode (#1436)
autoquant
- Make int8 dynamic quant in autoquant serializable (#1484)
- Additional fixes for autoquant serialization (#1486)
- Add exhaustive config option to intmm kernel (#1392)
float8 training
- [float8] Allow specifying arbitrary dtype for each tensor, enabling recipes with e4m3 in both the forward and the backward (#1378)
experimental
- Remove temp build files from torchao (#1551)
other
- Torchao setup.py with cmake (#1490)
Bug Fixes
- Fix bfloat16/float16/float32 options (#1369)
- Fix a bug in LinearActivationQuantizedTensor (#1400)
- Fix error message in float8 FSDP utils (#1423)
- Fixes observer attachment to model based on config for wanda sparsifier (#1265)
- [resubmit] Gemlite fix (#1435)
- 🐛 Fix: Memory leak in image processing endpoint (#1513)
Performance
- [float8] Re-enable slow-accum in the bwd of axis-wise scaling schemes (#1377)
Documentation
- Update api_ref_quantization.rst (#1408)
- Update index.rst (#1409)
- Update QAT READMEs using new APIs (#1541)
Developers
New Contributors
- @sanchitintel made their first contribution in #1375
- @philipbutler made their first contribution in #1337
- @airMeng made their first contribution in #1401
- @DerekLiu35 made their first contribution in #1299
- @agrawal-aka made their first contribution in #1265
- @gmagogsfm made their first contribution in #1443
- @dongxiaolong made their first contribution in #1513
Full Changelog: v0.7.0...v0.8.0-rc2