-
Notifications
You must be signed in to change notification settings - Fork 315
Open
Labels
bugSomething isn't workingSomething isn't working
Description
⚙️ Your current environment
The output of python collect_env.py
### Environment Information ###
Operating System: `Linux-6.14.0-1013-nvidia-aarch64-with-glibc2.39`
Python Version: `3.12.3 (main, Nov 6 2025, 13:44:16) [GCC 13.3.0]`
llm-compressor Version: `0.8.1`
compressed-tensors Version: `0.12.2`
transformers Version: `4.56.2`
torch Version: `2.8.0+cu129`
CUDA Devices: `['NVIDIA GB10']`
AMD Devices: `None`
🐛 Describe the bug
Trying to quantize google/gemma-3-12b-it with AWQ:
from transformers import AutoModelForCausalLM, AutoTokenizer, AutoProcessor
import os
import torch
# from llmcompressor.modifiers.quantization import GPTQModifier
from llmcompressor.modifiers.awq import AWQModifier, AWQMapping
from llmcompressor.entrypoints import oneshot
MODEL_ID = "google/gemma-3-12b-it"
OUTPUT_DIR = "model_gemma3-12b-it-4bit"
model = AutoModelForCausalLM.from_pretrained(
MODEL_ID,
device_map="auto",
torch_dtype="auto",
)
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)
processor = AutoProcessor.from_pretrained(MODEL_ID)
recipe = [
AWQModifier(ignore=["lm_head"], scheme="W4A16_ASYM", targets=["Linear"]),
]
"""
recipe = AWQModifier(
ignore=["lm_head"],
config_groups={
"group_0": {
"targets": ["Linear"],
"weights": {
"num_bits": 4,
"type": "int",
"symmetric": False,
"strategy": "group",
# Changed from 128 to 16 to be divisible by 4304
"group_size": 16,
},
}
},
)
"""
# Calibration dataset - use a representative sample
CALIBRATION_DATASET = "ultrachat-200k"
# quick test
NUM_CALIBRATION_SAMPLES = 16
# full calibration
# NUM_CALIBRATION_SAMPLES = 128
MAX_SEQ_LENGTH = 2048
# Run quantization
oneshot(
model=model,
tokenizer=tokenizer,
dataset=CALIBRATION_DATASET,
splits="train",
recipe=recipe,
num_calibration_samples=NUM_CALIBRATION_SAMPLES,
max_seq_length=MAX_SEQ_LENGTH,
preprocessing_num_workers=os.cpu_count(),
output_dir=OUTPUT_DIR,
)
# Save processor for completeness
processor.save_pretrained(OUTPUT_DIR)
print(f"Quantized model saved to {OUTPUT_DIR}")If I try the first recipe (the one not commented out) then at the very end of the quantization code I get this error:
https://gist.github.com/FlorinAndrei/22a40707756318a5f7e23ec60daf4d2f
If I try the other recipe (the one commented out) it completes the quantization process just fine, but if I try to load the quantized model in MMLU-Pro https://github.com/FlorinAndrei/MMLU-Pro then I get this error:
https://gist.github.com/FlorinAndrei/8c28c26b8f0c5305dbcee5b75b13ac5d
MMLU-Pro has no problem loading the original Gemma 3 or fine-tuned versions of it.
🛠️ Steps to reproduce
Just run the code.
Metadata
Metadata
Assignees
Labels
bugSomething isn't workingSomething isn't working