[Bug]: marlin_gemm shape mismatch (size_k doubled) for google/gemma-4-12B-it-qat-w4a16-ct on vLLM v0.22.0

## Your current environment

- **vLLM version**: v0.22.0+cu129
- **GPU**: NVIDIA L40S (46 GB VRAM)
- **CUDA**: 12.8 (driver 570.124.06)
- **Python**: 3.12
- **Model**: google/gemma-4-12B-it-qat-w4a16-ct (compressed-tensors W4A16)

## 🐛 Describe the bug

When serving `google/gemma-4-12B-it-qat-w4a16-ct` with `vllm serve` on v0.22.0, the engine crashes on the first real forward pass with a Marlin GEMM shape mismatch:

```
RuntimeError: Shape mismatch: a.size(1) = 4096, size_k = 8192
  torch.ops._C.marlin_gemm.default(
      reinterpret_tensor(arg0_1, (570, 4096), ...),
      size_m=570, size_n=3840, size_k=8192, ...
  )
```

Crashes identically with and without `--enforce-eager` (no torch.compile frames in traceback).

## Root cause (updated after deeper investigation)

**vLLM v0.22.0 has no native `Gemma4UnifiedForConditionalGeneration` support** — the architecture was added in PR #44429 which is not in v0.22.0 or v0.22.1. vLLM falls back to the generic `TransformersMultiModalForCausalLM` wrapper (confirmed in logs: `WARNING: TransformersMultiModalForCausalLM has no VLLM implementation, falling back to Transformers implementation`).

The generic Transformers wrapper in `vllm/model_executor/models/transformers/base.py` unconditionally applies `quant_config` to **every** `nn.Linear` it finds, including `vision_embedder.patch_dense`. The checkpoint's compressed-tensors ignore list (`model.vision_embedder.patch_dense`) never matches because the qual_name the wrapper assigns differs from the ignore pattern.

Result: `patch_dense` gets quantized when the checkpoint contains plain unquantized weights for it → wrong packed weight shape → `size_k=8192` (doubled from the real 4096) passed to `marlin_gemm`.

This is the same underlying issue as PR #44571 (`[Bugfix] Exclude vision embedder from quantization in Gemma4 Unified`), but PR #44571 patches the **native** `gemma4_unified.py` introduced in PR #44429 — code that doesn't exist in v0.22.0. The Transformers fallback path in v0.22.0 hits the same bug through a different code location.

## Fix

The fix is in `main` via:
- **PR #44429**: adds native `Gemma4UnifiedForConditionalGeneration` support, bypassing the Transformers fallback entirely
- **PR #44571**: fixes the `patch_dense` prefix/quant propagation in the native implementation

Neither is included in v0.22.0 or v0.22.1 (v0.22.1 release notes confirm it only contains unrelated fixes).

## Workaround for v0.22.0 users

Hotpatch the native implementation from `main` into the installed venv:

```bash
VENV_MODELS=<your-venv>/lib/python3.12/site-packages/vllm/model_executor/models

# Copy native Gemma4Unified implementation (includes #44429 + #44571)
# from a local clone of vllm main branch
cp vllm/model_executor/models/gemma4_unified.py $VENV_MODELS/

# Register in registry.py (add after Gemma4ForConditionalGeneration entry)
# "Gemma4UnifiedForConditionalGeneration": ("gemma4_unified", "Gemma4UnifiedForConditionalGeneration"),
```

## To reproduce

```bash
vllm serve google/gemma-4-12B-it-qat-w4a16-ct \
  --host 0.0.0.0 --port 8002 \
  --gpu-memory-utilization 0.38 \
  --max-model-len 8192 \
  --limit-mm-per-prompt '{"image": 0, "audio": 0}'
```

## Expected behavior

Model loads and serves correctly once `Gemma4UnifiedForConditionalGeneration` is registered natively (bypassing the Transformers fallback that incorrectly quantizes the vision embedder).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Bug]: marlin_gemm shape mismatch (size_k doubled) for google/gemma-4-12B-it-qat-w4a16-ct on vLLM v0.22.0 #44796

Your current environment

🐛 Describe the bug

Root cause (updated after deeper investigation)

Fix

Workaround for v0.22.0 users

To reproduce

Expected behavior

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

[Bug]: marlin_gemm shape mismatch (size_k doubled) for google/gemma-4-12B-it-qat-w4a16-ct on vLLM v0.22.0 #44796

Description

Your current environment

🐛 Describe the bug

Root cause (updated after deeper investigation)

Fix

Workaround for v0.22.0 users

To reproduce

Expected behavior

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions