Skip to content

[Question] How to address the sharp decline in output length #984

@blue-whale00

Description

@blue-whale00

Hi team,

I'm trying to quantize a heavily fine-tuned Qwen3 4B (for Agent tool-use) with AutoRound (W4A16), but I'm getting severe performance degradationl.

The Problem:
FP16 model works great.
W4A16 model fails (1:4 win-rate vs FP16) and output is ~25% shorter (e.g., avg. 120 tokens -> 90 tokens), indicating premature stopping.

My Diagnosis:
The model has systemic activation outliers (e.g., Max Abs Value > 5000 in several mlp.down_proj layers), which are caused by the heavy SFT.
The Prefill (Prompt) stage is critical; using "Pure Answer" data to calibrate makes performance even worse.
Crucially, my main fix failed: I tried "selective quantization" (using quant_config to set bits=16 for the Top 4 worst outlier layers). This did not help and performance was still just as bad (1:4 win-rate).

My Questions:

  1. How to address the sharp decline in output length
  2. Does this failure (even after skipping layers) prove that W4A16 (AutoRound/AWQ) is fundamentally unsuited for models with systemic SFT-induced activation outliers?
    Thanks!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions