diff --git a/README.md b/README.md index 8d524c5e7b..e744ea9815 100644 --- a/README.md +++ b/README.md @@ -213,6 +213,7 @@ We're also fortunate to be integrated into some of the leading open-source libra 4. [TorchTune](https://pytorch.org/torchtune/main/tutorials/qlora_finetune.html?highlight=qlora) for our QLoRA and QAT recipes 5. VLLM for LLM serving: [usage](https://docs.vllm.ai/en/latest/features/quantization/torchao.html) 6. SGLang for LLM serving: [usage](https://docs.sglang.ai/backend/server_arguments.html#server-arguments) and the major [PR](https://github.com/sgl-project/sglang/pull/1341). +7. Axolotl for [QAT](https://docs.axolotl.ai/docs/qat.html) and [PTQ](https://docs.axolotl.ai/docs/quantize.html) ## Videos * [Keynote talk at GPU MODE IRL](https://youtu.be/FH5wiwOyPX4?si=VZK22hHz25GRzBG1&t=1009) diff --git a/torchao/quantization/qat/README.md b/torchao/quantization/qat/README.md index 42ff4e2567..eee1047199 100644 --- a/torchao/quantization/qat/README.md +++ b/torchao/quantization/qat/README.md @@ -115,11 +115,20 @@ To fake quantize embedding in addition to linear, you can additionally call the following with a filter function during the prepare step: ``` -from torchao.quantization.quant_api import _is_linear +# first apply linear transformation to the model as above +activation_config = FakeQuantizeConfig(torch.int8, "per_token", is_symmetric=False) +weight_config = FakeQuantizeConfig(torch.int4, group_size=32) +quantize_( + model, + IntXQuantizationAwareTrainingConfig(activation_config, weight_config), +) + +# then apply weight-only transformation to embedding layers +# activation fake quantization is not supported for embedding layers quantize_( m, - IntXQuantizationAwareTrainingConfig(weight_config=weight_config), - filter_fn=lambda m, _: isinstance(m, torch.nn.Embedding) or _is_linear(m), + IntXQuantizationAwareTrainingConfig(weight_config=weight_config), + filter_fn=lambda m, _: isinstance(m, torch.nn.Embedding) ) ``` @@ -193,6 +202,19 @@ tune run --nnodes 1 --nproc_per_node 4 qat_lora_finetune_distributed --config ll For more detail, please refer to [this QAT tutorial](https://pytorch.org/torchtune/main/tutorials/qat_finetune.html). +## Axolotl integration + +[Axolotl](https://github.com/axolotl-ai-cloud) uses torchao to support quantized-aware fine-tuning. You can use the following commands to fine-tune, and then quantize a Llama-3.2-3B model: + +```bash +axolotl train examples/llama-3/3b-qat-fsdp2.yaml +# once training is complete, perform the quantization step +axolotl quantize examples/llama-3/3b-qat-fsdp2.yaml +# you should now have a quantized model saved in ./outputs/qat_out/quatized +``` + +Please see the [QAT documentation](https://docs.axolotl.ai/docs/qat.html) in axolotl for more details. + ## Evaluation Results Evaluation was performed on 6-8 A100 GPUs (80GB each) using the torchtune QAT