⚙️ Your current environment
Environment Information
Operating System: Linux-6.8.0-100-generic-x86_64-with-glibc2.39
Python Version: 3.12.11 | packaged by Anaconda, Inc. | (main, Jun 5 2025, 13:09:17) [GCC 11.2.0]
llm-compressor Version: 0.10.0.1
compressed-tensors Version: 0.14.0.1
transformers Version: 4.57.6
torch Version: 2.10.0
CUDA Devices: ['NVIDIA A100-SXM4-80GB', 'NVIDIA A100-SXM4-80GB']
AMD Devices: None
NPU Devices: None
🐛 Describe the bug
AWQ W4A16 quantization via llm-compressor produces catastrophically degraded output quality on Gemma 3-based models (tested on google/medgemma-27b-it). Accuracy drops from 85.6% (FP16 baseline) to 69.2% on the CareQA medical benchmark — a ~16 point degradation. In contrast, GPTQ W4A16 on the same model retains 83.6%, demonstrating that the issue is specific to AWQ's scaling mechanism, not a general quantization sensitivity.
#2500 (comment)
🛠️ Steps to reproduce
import torch
from datasets import load_dataset
from transformers import AutoModelForCausalLM, AutoTokenizer
from llmcompressor import oneshot
from llmcompressor.modifiers.awq import AWQModifier, AWQMapping
from llmcompressor.modifiers.quantization import QuantizationModifier
# Select calibration dataset
DATASET_ID = "FreedomIntelligence/medical-o1-reasoning-SFT"
DATASET_SPLIT = "train"
language = "en"
# Select model and load it.
MODEL_ID = "google/medgemma-27b-text-it"
model = AutoModelForCausalLM.from_pretrained(MODEL_ID, dtype="auto")
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID, trust_remote_code=True)
# Increasing the number of calib samples to 256 or higher can improve accuracy.
NUM_CALIBRATION_SAMPLES = 512
MAX_SEQUENCE_LENGTH = 2048
# Load dataset and preprocess.
ds = load_dataset(DATASET_ID, language, split=f"{DATASET_SPLIT}[:{NUM_CALIBRATION_SAMPLES}]")
ds = ds.shuffle(seed=42)
print(f"Dataset size: {len(ds)} samples")
# Preprocess: apply chat template and tokenize properly
def preprocess(example):
text = tokenizer.apply_chat_template(
[{"role": "user", "content": example["Question"]}],
tokenize=False,
#add_generation_prompt=True,
)
return tokenizer(
text,
padding=False,
max_length=MAX_SEQUENCE_LENGTH,
truncation=True,
add_special_tokens=False,
)
ds = ds.map(preprocess, remove_columns=ds.column_names)
# Gemma 3 specific mappings (GQA-aware: skip v_proj -> o_proj due to dimension mismatch)
gemma3_mappings = [
# 1. Smooth Input Norm against Q, K, V
AWQMapping(
smooth_layer="re:.*input_layernorm$",
balance_layers=["re:.*q_proj$", "re:.*k_proj$", "re:.*v_proj$"]
),
AWQMapping("re:.*v_proj$", ["re:.*o_proj$"]),
# 2. Smooth the MLP using the pre-feedforward norm
AWQMapping(
smooth_layer="re:.*pre_feedforward_layernorm$",
balance_layers=["re:.*gate_proj$", "re:.*up_proj$"]
),
# 3. Smooth Up-proj against Down-proj
AWQMapping(
smooth_layer="re:.*up_proj$",
balance_layers=["re:.*down_proj$"]
),
]
# Initialize the modifier with these mappings
recipe = [
AWQModifier(
targets=["Linear"],
scheme="W4A16_ASYM",
ignore=["lm_head"],
#mappings=gemma3_mappings,
duo_scaling="both"
)
]
"""
recipe = QuantizationModifier(
targets="Linear",
scheme="FP8_DYNAMIC",
ignore=["lm_head"],
duo_scaling="both"
)
"""
#Run quantization algorithm
oneshot(
model=model,
dataset=ds,
recipe=recipe,
max_seq_length=MAX_SEQUENCE_LENGTH,
num_calibration_samples=NUM_CALIBRATION_SAMPLES,
)
⚙️ Your current environment
Environment Information
Operating System:
Linux-6.8.0-100-generic-x86_64-with-glibc2.39Python Version:
3.12.11 | packaged by Anaconda, Inc. | (main, Jun 5 2025, 13:09:17) [GCC 11.2.0]llm-compressor Version:
0.10.0.1compressed-tensors Version:
0.14.0.1transformers Version:
4.57.6torch Version:
2.10.0CUDA Devices:
['NVIDIA A100-SXM4-80GB', 'NVIDIA A100-SXM4-80GB']AMD Devices:
NoneNPU Devices:
None🐛 Describe the bug
AWQ W4A16 quantization via llm-compressor produces catastrophically degraded output quality on Gemma 3-based models (tested on
google/medgemma-27b-it). Accuracy drops from 85.6% (FP16 baseline) to 69.2% on the CareQA medical benchmark — a ~16 point degradation. In contrast, GPTQ W4A16 on the same model retains 83.6%, demonstrating that the issue is specific to AWQ's scaling mechanism, not a general quantization sensitivity.#2500 (comment)
🛠️ Steps to reproduce