Skip to content

[Bug]: Evaluate AWQ and GPTQ for Gemma #2522

@Jeevi10

Description

@Jeevi10

⚙️ Your current environment

Environment Information

Operating System: Linux-6.8.0-100-generic-x86_64-with-glibc2.39
Python Version: 3.12.11 | packaged by Anaconda, Inc. | (main, Jun 5 2025, 13:09:17) [GCC 11.2.0]
llm-compressor Version: 0.10.0.1
compressed-tensors Version: 0.14.0.1
transformers Version: 4.57.6
torch Version: 2.10.0
CUDA Devices: ['NVIDIA A100-SXM4-80GB', 'NVIDIA A100-SXM4-80GB']
AMD Devices: None
NPU Devices: None

🐛 Describe the bug

AWQ W4A16 quantization via llm-compressor produces catastrophically degraded output quality on Gemma 3-based models (tested on google/medgemma-27b-it). Accuracy drops from 85.6% (FP16 baseline) to 69.2% on the CareQA medical benchmark — a ~16 point degradation. In contrast, GPTQ W4A16 on the same model retains 83.6%, demonstrating that the issue is specific to AWQ's scaling mechanism, not a general quantization sensitivity.

#2500 (comment)

🛠️ Steps to reproduce

import torch
from datasets import load_dataset
from transformers import AutoModelForCausalLM, AutoTokenizer

from llmcompressor import oneshot
from llmcompressor.modifiers.awq import AWQModifier, AWQMapping
from llmcompressor.modifiers.quantization import QuantizationModifier

# Select calibration dataset
DATASET_ID = "FreedomIntelligence/medical-o1-reasoning-SFT"
DATASET_SPLIT = "train"
language = "en"



# Select model and load it.
MODEL_ID = "google/medgemma-27b-text-it"

model = AutoModelForCausalLM.from_pretrained(MODEL_ID, dtype="auto")
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID, trust_remote_code=True)

# Increasing the number of calib samples to 256 or higher can improve accuracy.
NUM_CALIBRATION_SAMPLES = 512
MAX_SEQUENCE_LENGTH = 2048

# Load dataset and preprocess.
ds = load_dataset(DATASET_ID, language, split=f"{DATASET_SPLIT}[:{NUM_CALIBRATION_SAMPLES}]")
ds = ds.shuffle(seed=42)
print(f"Dataset size: {len(ds)} samples")


# Preprocess: apply chat template and tokenize properly
def preprocess(example):
    text = tokenizer.apply_chat_template(
        [{"role": "user", "content": example["Question"]}],
        tokenize=False,
        #add_generation_prompt=True,
    )
    return tokenizer(
        text,
        padding=False,
        max_length=MAX_SEQUENCE_LENGTH,
        truncation=True,
        add_special_tokens=False,
    )


ds = ds.map(preprocess, remove_columns=ds.column_names)

# Gemma 3 specific mappings (GQA-aware: skip v_proj -> o_proj due to dimension mismatch)
gemma3_mappings = [
    # 1. Smooth Input Norm against Q, K, V
    AWQMapping(
        smooth_layer="re:.*input_layernorm$",
        balance_layers=["re:.*q_proj$", "re:.*k_proj$", "re:.*v_proj$"]
    ),


    AWQMapping("re:.*v_proj$", ["re:.*o_proj$"]),
    # 2. Smooth the MLP using the pre-feedforward norm
    AWQMapping(
        smooth_layer="re:.*pre_feedforward_layernorm$",
        balance_layers=["re:.*gate_proj$", "re:.*up_proj$"]
    ),
    # 3. Smooth Up-proj against Down-proj
    AWQMapping(
        smooth_layer="re:.*up_proj$",
        balance_layers=["re:.*down_proj$"]
    ),
]

# Initialize the modifier with these mappings
recipe = [
    AWQModifier(
        targets=["Linear"],
        scheme="W4A16_ASYM",
        ignore=["lm_head"],
        #mappings=gemma3_mappings,
        duo_scaling="both"
    )
]

"""
 recipe = QuantizationModifier(
     targets="Linear",
     scheme="FP8_DYNAMIC",
     ignore=["lm_head"],
     duo_scaling="both"
 )
"""


#Run quantization algorithm
oneshot(
    model=model,
    dataset=ds,
    recipe=recipe,
    max_seq_length=MAX_SEQUENCE_LENGTH,
    num_calibration_samples=NUM_CALIBRATION_SAMPLES,
)

Metadata

Metadata

Labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions