[Bug]: Evaluate AWQ and GPTQ for Gemma

### ⚙️ Your current environment

### Environment Information ###
Operating System: `Linux-6.8.0-100-generic-x86_64-with-glibc2.39`
Python Version: `3.12.11 | packaged by Anaconda, Inc. | (main, Jun  5 2025, 13:09:17) [GCC 11.2.0]`
llm-compressor Version: `0.10.0.1`
compressed-tensors Version: `0.14.0.1`
transformers Version: `4.57.6`
torch Version: `2.10.0`
CUDA Devices: `['NVIDIA A100-SXM4-80GB', 'NVIDIA A100-SXM4-80GB']`
AMD Devices: `None`
NPU Devices: `None`




### 🐛 Describe the bug

AWQ W4A16 quantization via llm-compressor produces catastrophically degraded output quality on Gemma 3-based models (tested on `google/medgemma-27b-it`). Accuracy drops from **85.6%** (FP16 baseline) to **69.2%** on the CareQA medical benchmark — a ~16 point degradation. In contrast, GPTQ W4A16 on the same model retains **83.6%**, demonstrating that the issue is specific to AWQ's scaling mechanism, not a general quantization sensitivity.

https://github.com/vllm-project/llm-compressor/pull/2500#issuecomment-4128173409

### 🛠️ Steps to reproduce

```python
import torch
from datasets import load_dataset
from transformers import AutoModelForCausalLM, AutoTokenizer

from llmcompressor import oneshot
from llmcompressor.modifiers.awq import AWQModifier, AWQMapping
from llmcompressor.modifiers.quantization import QuantizationModifier

# Select calibration dataset
DATASET_ID = "FreedomIntelligence/medical-o1-reasoning-SFT"
DATASET_SPLIT = "train"
language = "en"



# Select model and load it.
MODEL_ID = "google/medgemma-27b-text-it"

model = AutoModelForCausalLM.from_pretrained(MODEL_ID, dtype="auto")
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID, trust_remote_code=True)

# Increasing the number of calib samples to 256 or higher can improve accuracy.
NUM_CALIBRATION_SAMPLES = 512
MAX_SEQUENCE_LENGTH = 2048

# Load dataset and preprocess.
ds = load_dataset(DATASET_ID, language, split=f"{DATASET_SPLIT}[:{NUM_CALIBRATION_SAMPLES}]")
ds = ds.shuffle(seed=42)
print(f"Dataset size: {len(ds)} samples")


# Preprocess: apply chat template and tokenize properly
def preprocess(example):
    text = tokenizer.apply_chat_template(
        [{"role": "user", "content": example["Question"]}],
        tokenize=False,
        #add_generation_prompt=True,
    )
    return tokenizer(
        text,
        padding=False,
        max_length=MAX_SEQUENCE_LENGTH,
        truncation=True,
        add_special_tokens=False,
    )


ds = ds.map(preprocess, remove_columns=ds.column_names)

# Gemma 3 specific mappings (GQA-aware: skip v_proj -> o_proj due to dimension mismatch)
gemma3_mappings = [
    # 1. Smooth Input Norm against Q, K, V
    AWQMapping(
        smooth_layer="re:.*input_layernorm$",
        balance_layers=["re:.*q_proj$", "re:.*k_proj$", "re:.*v_proj$"]
    ),


    AWQMapping("re:.*v_proj$", ["re:.*o_proj$"]),
    # 2. Smooth the MLP using the pre-feedforward norm
    AWQMapping(
        smooth_layer="re:.*pre_feedforward_layernorm$",
        balance_layers=["re:.*gate_proj$", "re:.*up_proj$"]
    ),
    # 3. Smooth Up-proj against Down-proj
    AWQMapping(
        smooth_layer="re:.*up_proj$",
        balance_layers=["re:.*down_proj$"]
    ),
]

# Initialize the modifier with these mappings
recipe = [
    AWQModifier(
        targets=["Linear"],
        scheme="W4A16_ASYM",
        ignore=["lm_head"],
        #mappings=gemma3_mappings,
        duo_scaling="both"
    )
]

"""
 recipe = QuantizationModifier(
     targets="Linear",
     scheme="FP8_DYNAMIC",
     ignore=["lm_head"],
     duo_scaling="both"
 )
"""


#Run quantization algorithm
oneshot(
    model=model,
    dataset=ds,
    recipe=recipe,
    max_seq_length=MAX_SEQUENCE_LENGTH,
    num_calibration_samples=NUM_CALIBRATION_SAMPLES,
)
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug]: Evaluate AWQ and GPTQ for Gemma #2522

⚙️ Your current environment

Environment Information

🐛 Describe the bug

🛠️ Steps to reproduce

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Bug]: Evaluate AWQ and GPTQ for Gemma #2522

Description

⚙️ Your current environment

Environment Information

🐛 Describe the bug

🛠️ Steps to reproduce

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions