Skip to content

[Bug]: AWQ for Gemma 3 fails to quantize, or fails to produce a viable model #2102

@FlorinAndrei

Description

@FlorinAndrei

⚙️ Your current environment

The output of python collect_env.py
### Environment Information ###
Operating System: `Linux-6.14.0-1013-nvidia-aarch64-with-glibc2.39`
Python Version: `3.12.3 (main, Nov  6 2025, 13:44:16) [GCC 13.3.0]`
llm-compressor Version: `0.8.1`
compressed-tensors Version: `0.12.2`
transformers Version: `4.56.2`
torch Version: `2.8.0+cu129`
CUDA Devices: `['NVIDIA GB10']`
AMD Devices: `None`

🐛 Describe the bug

Trying to quantize google/gemma-3-12b-it with AWQ:

from transformers import AutoModelForCausalLM, AutoTokenizer, AutoProcessor
import os
import torch

# from llmcompressor.modifiers.quantization import GPTQModifier
from llmcompressor.modifiers.awq import AWQModifier, AWQMapping
from llmcompressor.entrypoints import oneshot

MODEL_ID = "google/gemma-3-12b-it"
OUTPUT_DIR = "model_gemma3-12b-it-4bit"

model = AutoModelForCausalLM.from_pretrained(
    MODEL_ID,
    device_map="auto",
    torch_dtype="auto",
)
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)
processor = AutoProcessor.from_pretrained(MODEL_ID)

recipe = [
    AWQModifier(ignore=["lm_head"], scheme="W4A16_ASYM", targets=["Linear"]),
]

"""
recipe = AWQModifier(
    ignore=["lm_head"],
    config_groups={
        "group_0": {
            "targets": ["Linear"],
            "weights": {
                "num_bits": 4,
                "type": "int",
                "symmetric": False,
                "strategy": "group",
                # Changed from 128 to 16 to be divisible by 4304
                "group_size": 16,
            },
        }
    },
)
"""

# Calibration dataset - use a representative sample
CALIBRATION_DATASET = "ultrachat-200k"
# quick test
NUM_CALIBRATION_SAMPLES = 16
# full calibration
# NUM_CALIBRATION_SAMPLES = 128
MAX_SEQ_LENGTH = 2048

# Run quantization
oneshot(
    model=model,
    tokenizer=tokenizer,
    dataset=CALIBRATION_DATASET,
    splits="train",
    recipe=recipe,
    num_calibration_samples=NUM_CALIBRATION_SAMPLES,
    max_seq_length=MAX_SEQ_LENGTH,
    preprocessing_num_workers=os.cpu_count(),
    output_dir=OUTPUT_DIR,
)

# Save processor for completeness
processor.save_pretrained(OUTPUT_DIR)

print(f"Quantized model saved to {OUTPUT_DIR}")

If I try the first recipe (the one not commented out) then at the very end of the quantization code I get this error:

https://gist.github.com/FlorinAndrei/22a40707756318a5f7e23ec60daf4d2f

If I try the other recipe (the one commented out) it completes the quantization process just fine, but if I try to load the quantized model in MMLU-Pro https://github.com/FlorinAndrei/MMLU-Pro then I get this error:

https://gist.github.com/FlorinAndrei/8c28c26b8f0c5305dbcee5b75b13ac5d

MMLU-Pro has no problem loading the original Gemma 3 or fine-tuned versions of it.

🛠️ Steps to reproduce

Just run the code.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions