Highlights

We are excited to announce the 0.8.0 release of torchao! In this release we’ve shipped the first CUTLASS kernel in torchAO which adds support for W4A8 linear operator. In addition to this, we’ve also added TTFT benchmarks to torchAO and compared different quantization + sparsity speedups for prefill / decoding.

W4A8 based on CUTLASS

A new W4A8 linear operator is implemented, that corresponds to int8_dynamic_activation_int4_weight quantization where two 4-bit weights get packed into a single 8-bit integer value; also, CUTLASS is made a sub-module of torchao repo, in order to be able to utilize more of its functionality to implement new kernels.

Benchmarks on A100

`-q parameter`	Average tokens/sec	Average Bandwidth in GB/s	Peak Memory Usage in GB	Model Size in GB
	95.24	258.55	13.90	13.21
`-q int8wo`	155.31	1028.37	8.97	6.62
`-q int4wo-32`	186.70	774.98	5.31	4.15
`-q int4wo-hqq`	186.47	774.01	5.04	4.15
`-q int8dq`	49.64	328.72	9.44	6.62
`-q w4a8-cutlass` (tuned)	119.31	394.86	4.52	3.31

Prefill performance benchmarks

We’ve added TTFT benchmarks to torchAO and compared different quantization + sparsity speedups for prefill / decoding. During prefill, we are compute bound and find that dynamic quantization offers greater speedups over weight-only quantization, which is faster for prefill. We’ve also added an option for int8 dynamic quantization that will selectively use prefill during LLM decoding.

BC Breaking

Delete the float8-all-gather-only functionality from float8 training (#1451)

The use_fp8_all_gather_only was an experimental flag, off by default, which was not marketed and not used by anyone as far as we know. We are removing it to simplify the code.

Before

config = Float8LinearConfig(
...,
# the option below is being removed
use_fp8_all_gather_only = True,  
)  
convert_to_float8_training(model, config=config, ...)

After

The use_fp8_all_gather_only option is no longer supported.

New Features

Add TTFT benchmarks + update sparsity benchmarks (#1140)
Gemlite integration in torchao (#1034)
W4A8 based on CUTLASS (#880)

Improvement

quantize_

Expose zero_point_domain as arguments (#1401)
Add convert path for quantize_ QAT API (#1540)
Int8 dynamic prefill weight only decode (#1436)

autoquant

Make int8 dynamic quant in autoquant serializable (#1484)
Additional fixes for autoquant serialization (#1486)
Add exhaustive config option to intmm kernel (#1392)

float8 training

[float8] Allow specifying arbitrary dtype for each tensor, enabling recipes with e4m3 in both the forward and the backward (#1378)

experimental

Remove temp build files from torchao (#1551)

other

Torchao setup.py with cmake (#1490)

Bug Fixes

Fix bfloat16/float16/float32 options (#1369)
Fix a bug in LinearActivationQuantizedTensor (#1400)
Fix error message in float8 FSDP utils (#1423)
Fixes observer attachment to model based on config for wanda sparsifier (#1265)
[resubmit] Gemlite fix (#1435)
🐛 Fix: Memory leak in image processing endpoint (#1513)

Performance

[float8] Re-enable slow-accum in the bwd of axis-wise scaling schemes (#1377)

Documentation

Update api_ref_quantization.rst (#1408)
Update index.rst (#1409)
Update QAT READMEs using new APIs (#1541)

Developers

Pytorch/ao/torchao/experimental/ops/mps/test (#1442)
Verify that submodules are checked out (#1536)

New Contributors

@sanchitintel made their first contribution in #1375
@philipbutler made their first contribution in #1337
@airMeng made their first contribution in #1401
@DerekLiu35 made their first contribution in #1299
@agrawal-aka made their first contribution in #1265
@gmagogsfm made their first contribution in #1443
@dongxiaolong made their first contribution in #1513

Full Changelog: v0.7.0...v0.8.0-rc2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

v0.8.0