[Bug]: AWQ for Gemma 3 fails to quantize, or fails to produce a viable model

### ⚙️ Your current environment

<details>
<summary>The output of <code>python collect_env.py</code></summary>

```text
### Environment Information ###
Operating System: `Linux-6.14.0-1013-nvidia-aarch64-with-glibc2.39`
Python Version: `3.12.3 (main, Nov  6 2025, 13:44:16) [GCC 13.3.0]`
llm-compressor Version: `0.8.1`
compressed-tensors Version: `0.12.2`
transformers Version: `4.56.2`
torch Version: `2.8.0+cu129`
CUDA Devices: `['NVIDIA GB10']`
AMD Devices: `None`
```

</details>


### 🐛 Describe the bug

Trying to quantize `google/gemma-3-12b-it` with AWQ:

```python
from transformers import AutoModelForCausalLM, AutoTokenizer, AutoProcessor
import os
import torch

# from llmcompressor.modifiers.quantization import GPTQModifier
from llmcompressor.modifiers.awq import AWQModifier, AWQMapping
from llmcompressor.entrypoints import oneshot

MODEL_ID = "google/gemma-3-12b-it"
OUTPUT_DIR = "model_gemma3-12b-it-4bit"

model = AutoModelForCausalLM.from_pretrained(
    MODEL_ID,
    device_map="auto",
    torch_dtype="auto",
)
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)
processor = AutoProcessor.from_pretrained(MODEL_ID)

recipe = [
    AWQModifier(ignore=["lm_head"], scheme="W4A16_ASYM", targets=["Linear"]),
]

"""
recipe = AWQModifier(
    ignore=["lm_head"],
    config_groups={
        "group_0": {
            "targets": ["Linear"],
            "weights": {
                "num_bits": 4,
                "type": "int",
                "symmetric": False,
                "strategy": "group",
                # Changed from 128 to 16 to be divisible by 4304
                "group_size": 16,
            },
        }
    },
)
"""

# Calibration dataset - use a representative sample
CALIBRATION_DATASET = "ultrachat-200k"
# quick test
NUM_CALIBRATION_SAMPLES = 16
# full calibration
# NUM_CALIBRATION_SAMPLES = 128
MAX_SEQ_LENGTH = 2048

# Run quantization
oneshot(
    model=model,
    tokenizer=tokenizer,
    dataset=CALIBRATION_DATASET,
    splits="train",
    recipe=recipe,
    num_calibration_samples=NUM_CALIBRATION_SAMPLES,
    max_seq_length=MAX_SEQ_LENGTH,
    preprocessing_num_workers=os.cpu_count(),
    output_dir=OUTPUT_DIR,
)

# Save processor for completeness
processor.save_pretrained(OUTPUT_DIR)

print(f"Quantized model saved to {OUTPUT_DIR}")
```

If I try the first recipe (the one not commented out) then at the very end of the quantization code I get this error:

https://gist.github.com/FlorinAndrei/22a40707756318a5f7e23ec60daf4d2f

If I try the other recipe (the one commented out) it completes the quantization process just fine, but if I try to load the quantized model in MMLU-Pro https://github.com/FlorinAndrei/MMLU-Pro then I get this error:

https://gist.github.com/FlorinAndrei/8c28c26b8f0c5305dbcee5b75b13ac5d

MMLU-Pro has no problem loading the original Gemma 3 or fine-tuned versions of it.

### 🛠️ Steps to reproduce

Just run the code.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Bug]: AWQ for Gemma 3 fails to quantize, or fails to produce a viable model #2102

⚙️ Your current environment

🐛 Describe the bug

🛠️ Steps to reproduce

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Bug]: AWQ for Gemma 3 fails to quantize, or fails to produce a viable model #2102

Description

⚙️ Your current environment

🐛 Describe the bug

🛠️ Steps to reproduce

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions