Problem Description
I am encountering severe model collapse when quantifying the Llama 3.1 8B Instruct model to 2-bit (W2A16) using AutoRound.
Despite using the "Best" configuration (iters=1000, nsamples=512, enable_alg_ext), the resulting model loses all language capabilities and outputs repeating exclamation marks (e.g., !!!!!!!!!!!!!!!!!!!) when loaded in vLLM.
Expected Behavior:
Since Llama 2 (7B/13B) retains some reasoning capabilities under 2-bit quantization in previous benchmarks, I expected Llama 3.1 8B to at least generate coherent text.
Current Behavior:
- The model outputs repeating characters (garbage).
- GSM8K score is exactly 0.
- However, the same setup works perfectly for 4-bit (W4A16), achieving 76.6% on GSM8K. This confirms my environment and basic workflow are correct.
Reproduction Steps
Run the AutoRound quantization with the following "Best" settings for 2-bit:
auto_round \
--model_name "./llama3.1-8B-Instruct" \
--bits 2 \
--group_size 128 \
--format "auto_round" \
--iters 1000 \
--nsamples 512 \
--seqlen 2048 \
--batch_size 8 \
--minmax_lr 2e-3 \
--enable_alg_ext \
--output_dir "./llama3.1-8B-Instruct-W2A16-Best"
Environment Information
- Model: Llama 3.1 8B Instruct
- AutoRound Version: 0.9.4
- Inference Engine: vLLM
- GPU: RTX 6000Ada
- CUDA Version: 12.8
Error Logs
Additional Context
No response