Your current environment
- vLLM version: v0.22.0+cu129
- GPU: NVIDIA L40S (46 GB VRAM)
- CUDA: 12.8 (driver 570.124.06)
- Python: 3.12
- Model: google/gemma-4-12B-it-qat-w4a16-ct (compressed-tensors W4A16)
🐛 Describe the bug
When serving google/gemma-4-12B-it-qat-w4a16-ct with vllm serve on v0.22.0, the engine crashes on the first real forward pass with a Marlin GEMM shape mismatch:
RuntimeError: Shape mismatch: a.size(1) = 4096, size_k = 8192
torch.ops._C.marlin_gemm.default(
reinterpret_tensor(arg0_1, (570, 4096), ...),
size_m=570, size_n=3840, size_k=8192, ...
)
Crashes identically with and without --enforce-eager (no torch.compile frames in traceback).
Root cause (updated after deeper investigation)
vLLM v0.22.0 has no native Gemma4UnifiedForConditionalGeneration support — the architecture was added in PR #44429 which is not in v0.22.0 or v0.22.1. vLLM falls back to the generic TransformersMultiModalForCausalLM wrapper (confirmed in logs: WARNING: TransformersMultiModalForCausalLM has no VLLM implementation, falling back to Transformers implementation).
The generic Transformers wrapper in vllm/model_executor/models/transformers/base.py unconditionally applies quant_config to every nn.Linear it finds, including vision_embedder.patch_dense. The checkpoint's compressed-tensors ignore list (model.vision_embedder.patch_dense) never matches because the qual_name the wrapper assigns differs from the ignore pattern.
Result: patch_dense gets quantized when the checkpoint contains plain unquantized weights for it → wrong packed weight shape → size_k=8192 (doubled from the real 4096) passed to marlin_gemm.
This is the same underlying issue as PR #44571 ([Bugfix] Exclude vision embedder from quantization in Gemma4 Unified), but PR #44571 patches the native gemma4_unified.py introduced in PR #44429 — code that doesn't exist in v0.22.0. The Transformers fallback path in v0.22.0 hits the same bug through a different code location.
Fix
The fix is in main via:
Neither is included in v0.22.0 or v0.22.1 (v0.22.1 release notes confirm it only contains unrelated fixes).
Workaround for v0.22.0 users
Hotpatch the native implementation from main into the installed venv:
VENV_MODELS=<your-venv>/lib/python3.12/site-packages/vllm/model_executor/models
# Copy native Gemma4Unified implementation (includes #44429 + #44571)
# from a local clone of vllm main branch
cp vllm/model_executor/models/gemma4_unified.py $VENV_MODELS/
# Register in registry.py (add after Gemma4ForConditionalGeneration entry)
# "Gemma4UnifiedForConditionalGeneration": ("gemma4_unified", "Gemma4UnifiedForConditionalGeneration"),
To reproduce
vllm serve google/gemma-4-12B-it-qat-w4a16-ct \
--host 0.0.0.0 --port 8002 \
--gpu-memory-utilization 0.38 \
--max-model-len 8192 \
--limit-mm-per-prompt '{"image": 0, "audio": 0}'
Expected behavior
Model loads and serves correctly once Gemma4UnifiedForConditionalGeneration is registered natively (bypassing the Transformers fallback that incorrectly quantizes the vision embedder).
Your current environment
🐛 Describe the bug
When serving
google/gemma-4-12B-it-qat-w4a16-ctwithvllm serveon v0.22.0, the engine crashes on the first real forward pass with a Marlin GEMM shape mismatch:Crashes identically with and without
--enforce-eager(no torch.compile frames in traceback).Root cause (updated after deeper investigation)
vLLM v0.22.0 has no native
Gemma4UnifiedForConditionalGenerationsupport — the architecture was added in PR #44429 which is not in v0.22.0 or v0.22.1. vLLM falls back to the genericTransformersMultiModalForCausalLMwrapper (confirmed in logs:WARNING: TransformersMultiModalForCausalLM has no VLLM implementation, falling back to Transformers implementation).The generic Transformers wrapper in
vllm/model_executor/models/transformers/base.pyunconditionally appliesquant_configto everynn.Linearit finds, includingvision_embedder.patch_dense. The checkpoint's compressed-tensors ignore list (model.vision_embedder.patch_dense) never matches because the qual_name the wrapper assigns differs from the ignore pattern.Result:
patch_densegets quantized when the checkpoint contains plain unquantized weights for it → wrong packed weight shape →size_k=8192(doubled from the real 4096) passed tomarlin_gemm.This is the same underlying issue as PR #44571 (
[Bugfix] Exclude vision embedder from quantization in Gemma4 Unified), but PR #44571 patches the nativegemma4_unified.pyintroduced in PR #44429 — code that doesn't exist in v0.22.0. The Transformers fallback path in v0.22.0 hits the same bug through a different code location.Fix
The fix is in
mainvia:Gemma4UnifiedForConditionalGenerationsupport, bypassing the Transformers fallback entirelypatch_denseprefix/quant propagation in the native implementationNeither is included in v0.22.0 or v0.22.1 (v0.22.1 release notes confirm it only contains unrelated fixes).
Workaround for v0.22.0 users
Hotpatch the native implementation from
maininto the installed venv:To reproduce
vllm serve google/gemma-4-12B-it-qat-w4a16-ct \ --host 0.0.0.0 --port 8002 \ --gpu-memory-utilization 0.38 \ --max-model-len 8192 \ --limit-mm-per-prompt '{"image": 0, "audio": 0}'Expected behavior
Model loads and serves correctly once
Gemma4UnifiedForConditionalGenerationis registered natively (bypassing the Transformers fallback that incorrectly quantizes the vision embedder).