Skip to content

[Bug]:Llama 3.1 8B Instruct W2A16 Quantization leads to Model Collapse (repeating "!") despite using "Best" settings #1342

@manfeilong

Description

@manfeilong

Problem Description

I am encountering severe model collapse when quantifying the Llama 3.1 8B Instruct model to 2-bit (W2A16) using AutoRound.
Despite using the "Best" configuration (iters=1000, nsamples=512, enable_alg_ext), the resulting model loses all language capabilities and outputs repeating exclamation marks (e.g., !!!!!!!!!!!!!!!!!!!) when loaded in vLLM.
Expected Behavior:
Since Llama 2 (7B/13B) retains some reasoning capabilities under 2-bit quantization in previous benchmarks, I expected Llama 3.1 8B to at least generate coherent text.
Current Behavior:

  • The model outputs repeating characters (garbage).
  • GSM8K score is exactly 0.
  • However, the same setup works perfectly for 4-bit (W4A16), achieving 76.6% on GSM8K. This confirms my environment and basic workflow are correct.

Reproduction Steps

Run the AutoRound quantization with the following "Best" settings for 2-bit:

auto_round \
  --model_name "./llama3.1-8B-Instruct" \
  --bits 2 \
  --group_size 128 \
  --format "auto_round" \
  --iters 1000 \
  --nsamples 512 \
  --seqlen 2048 \
  --batch_size 8 \
  --minmax_lr 2e-3 \
  --enable_alg_ext \
  --output_dir "./llama3.1-8B-Instruct-W2A16-Best"

Environment Information

  • Model: Llama 3.1 8B Instruct
  • AutoRound Version: 0.9.4
  • Inference Engine: vLLM
  • GPU: RTX 6000Ada
  • CUDA Version: 12.8

Error Logs

Additional Context

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions