After performing quantization on llama3_2 using tune run quantize --config custom_quantization.yaml, how should I proceed with inference?

I am a beginner, and while using the quantize.py script for Quantization-Aware Training (QAT) to convert and quantize my model, I encountered an issue where the saved quantized model only contains the data and not the model structure. As a result, when using the original llama3_2 model structure, there is a mismatch between the model parameters and the model structure. After I manually added some quantized nodes, such as 'layers.0.attn.q_proj.scales', 'layers.0.attn.k_proj.scales', and other quantization nodes to fully match the model parameters, the inference output became garbled. However, the model worked correctly before the quantization conversion. What should I do in this case, or could this indicate a failure in my QAT training?I followed this documentation to perform the quantization:[https://meta-pytorch.org/torchtune/main/tutorials/qat_finetune.html](url)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

After performing quantization on llama3_2 using tune run quantize --config custom_quantization.yaml, how should I proceed with inference? #2935

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

After performing quantization on llama3_2 using tune run quantize --config custom_quantization.yaml, how should I proceed with inference? #2935

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions