Skip to content

v0.8.0

Latest
Compare
Choose a tag to compare
@jainapurva jainapurva released this 15 Jan 18:25
· 24 commits to main since this release

Highlights

We are excited to announce the 0.8.0 release of torchao! In this release we’ve shipped the first CUTLASS kernel in torchAO which adds support for W4A8 linear operator. In addition to this, we’ve also added TTFT benchmarks to torchAO and compared different quantization + sparsity speedups for prefill / decoding.

W4A8 based on CUTLASS

A new W4A8 linear operator is implemented, that corresponds to int8_dynamic_activation_int4_weight quantization where two 4-bit weights get packed into a single 8-bit integer value; also, CUTLASS is made a sub-module of torchao repo, in order to be able to utilize more of its functionality to implement new kernels.

Benchmarks on A100

-q parameter Average tokens/sec Average Bandwidth in GB/s Peak Memory Usage in GB Model Size in GB
95.24 258.55 13.90 13.21
-q int8wo 155.31 1028.37 8.97 6.62
-q int4wo-32 186.70 774.98 5.31 4.15
-q int4wo-hqq 186.47 774.01 5.04 4.15
-q int8dq 49.64 328.72 9.44 6.62
-q w4a8-cutlass (tuned) 119.31 394.86 4.52 3.31

Prefill performance benchmarks

We’ve added TTFT benchmarks to torchAO and compared different quantization + sparsity speedups for prefill / decoding. During prefill, we are compute bound and find that dynamic quantization offers greater speedups over weight-only quantization, which is faster for prefill. We’ve also added an option for int8 dynamic quantization that will selectively use prefill during LLM decoding.

Screenshot 2025-01-15 at 10 06 09 AM

BC Breaking

Delete the float8-all-gather-only functionality from float8 training (#1451)

The use_fp8_all_gather_only was an experimental flag, off by default, which was not marketed and not used by anyone as far as we know. We are removing it to simplify the code.

Before

config = Float8LinearConfig(
...,
# the option below is being removed
use_fp8_all_gather_only = True,  
)  
convert_to_float8_training(model, config=config, ...)

After

The use_fp8_all_gather_only option is no longer supported.

New Features

  • Add TTFT benchmarks + update sparsity benchmarks (#1140)
  • Gemlite integration in torchao (#1034)
  • W4A8 based on CUTLASS (#880)

Improvement

quantize_

  • Expose zero_point_domain as arguments (#1401)
  • Add convert path for quantize_ QAT API (#1540)
  • Int8 dynamic prefill weight only decode (#1436)

autoquant

  • Make int8 dynamic quant in autoquant serializable (#1484)
  • Additional fixes for autoquant serialization (#1486)
  • Add exhaustive config option to intmm kernel (#1392)

float8 training

  • [float8] Allow specifying arbitrary dtype for each tensor, enabling recipes with e4m3 in both the forward and the backward (#1378)

experimental

  • Remove temp build files from torchao (#1551)

other

  • Torchao setup.py with cmake (#1490)

Bug Fixes

  • Fix bfloat16/float16/float32 options (#1369)
  • Fix a bug in LinearActivationQuantizedTensor (#1400)
  • Fix error message in float8 FSDP utils (#1423)
  • Fixes observer attachment to model based on config for wanda sparsifier (#1265)
  • [resubmit] Gemlite fix (#1435)
  • 🐛 Fix: Memory leak in image processing endpoint (#1513)

Performance

  • [float8] Re-enable slow-accum in the bwd of axis-wise scaling schemes (#1377)

Documentation

  • Update api_ref_quantization.rst (#1408)
  • Update index.rst (#1409)
  • Update QAT READMEs using new APIs (#1541)

Developers

  • Pytorch/ao/torchao/experimental/ops/mps/test (#1442)
  • Verify that submodules are checked out (#1536)

New Contributors

Full Changelog: v0.7.0...v0.8.0-rc2