-
Notifications
You must be signed in to change notification settings - Fork 59
Description
Hi team,
I'm trying to quantize a heavily fine-tuned Qwen3 4B (for Agent tool-use) with AutoRound (W4A16), but I'm getting severe performance degradationl.
The Problem:
FP16 model works great.
W4A16 model fails (1:4 win-rate vs FP16) and output is ~25% shorter (e.g., avg. 120 tokens -> 90 tokens), indicating premature stopping.
My Diagnosis:
The model has systemic activation outliers (e.g., Max Abs Value > 5000 in several mlp.down_proj layers), which are caused by the heavy SFT.
The Prefill (Prompt) stage is critical; using "Pure Answer" data to calibrate makes performance even worse.
Crucially, my main fix failed: I tried "selective quantization" (using quant_config to set bits=16 for the Top 4 worst outlier layers). This did not help and performance was still just as bad (1:4 win-rate).
My Questions:
- How to address the sharp decline in output length
- Does this failure (even after skipping layers) prove that W4A16 (
AutoRound/AWQ) is fundamentally unsuited for models with systemic SFT-induced activation outliers?
Thanks!