[Question] How to address the sharp decline in output length

Hi team,

I'm trying to quantize a heavily fine-tuned Qwen3 4B (for Agent tool-use) with `AutoRound` (W4A16), but I'm getting severe performance degradationl.

The Problem:
FP16 model works great.
W4A16 model fails (1:4 win-rate vs FP16) and output is ~25% shorter (e.g., avg. 120 tokens -> 90 tokens), indicating premature stopping.

My Diagnosis:
The model has systemic activation outliers (e.g., `Max Abs Value > 5000` in several `mlp.down_proj` layers), which are caused by the heavy SFT.
The `Prefill` (Prompt) stage is critical; using "Pure Answer" data to calibrate makes performance even worse.
Crucially, my main fix failed: I tried "selective quantization" (using `quant_config` to set `bits=16` for the Top 4 worst outlier layers). This did not help and performance was still just as bad (1:4 win-rate).

My Questions:

1.  How to address the sharp decline in output length
2. Does this failure (even after skipping layers) prove that W4A16 (`AutoRound`/`AWQ`) is fundamentally unsuited for models with systemic SFT-induced activation outliers?
Thanks!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Question] How to address the sharp decline in output length #984

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Question] How to address the sharp decline in output length #984

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions