Skip to content

[Bug]: marlin_gemm shape mismatch (size_k doubled) for google/gemma-4-12B-it-qat-w4a16-ct on vLLM v0.22.0 #44796

@anouar-bm

Description

@anouar-bm

Your current environment

  • vLLM version: v0.22.0+cu129
  • GPU: NVIDIA L40S (46 GB VRAM)
  • CUDA: 12.8 (driver 570.124.06)
  • Python: 3.12
  • Model: google/gemma-4-12B-it-qat-w4a16-ct (compressed-tensors W4A16)

🐛 Describe the bug

When serving google/gemma-4-12B-it-qat-w4a16-ct with vllm serve on v0.22.0, the engine crashes on the first real forward pass with a Marlin GEMM shape mismatch:

RuntimeError: Shape mismatch: a.size(1) = 4096, size_k = 8192
  torch.ops._C.marlin_gemm.default(
      reinterpret_tensor(arg0_1, (570, 4096), ...),
      size_m=570, size_n=3840, size_k=8192, ...
  )

Crashes identically with and without --enforce-eager (no torch.compile frames in traceback).

Root cause (updated after deeper investigation)

vLLM v0.22.0 has no native Gemma4UnifiedForConditionalGeneration support — the architecture was added in PR #44429 which is not in v0.22.0 or v0.22.1. vLLM falls back to the generic TransformersMultiModalForCausalLM wrapper (confirmed in logs: WARNING: TransformersMultiModalForCausalLM has no VLLM implementation, falling back to Transformers implementation).

The generic Transformers wrapper in vllm/model_executor/models/transformers/base.py unconditionally applies quant_config to every nn.Linear it finds, including vision_embedder.patch_dense. The checkpoint's compressed-tensors ignore list (model.vision_embedder.patch_dense) never matches because the qual_name the wrapper assigns differs from the ignore pattern.

Result: patch_dense gets quantized when the checkpoint contains plain unquantized weights for it → wrong packed weight shape → size_k=8192 (doubled from the real 4096) passed to marlin_gemm.

This is the same underlying issue as PR #44571 ([Bugfix] Exclude vision embedder from quantization in Gemma4 Unified), but PR #44571 patches the native gemma4_unified.py introduced in PR #44429 — code that doesn't exist in v0.22.0. The Transformers fallback path in v0.22.0 hits the same bug through a different code location.

Fix

The fix is in main via:

Neither is included in v0.22.0 or v0.22.1 (v0.22.1 release notes confirm it only contains unrelated fixes).

Workaround for v0.22.0 users

Hotpatch the native implementation from main into the installed venv:

VENV_MODELS=<your-venv>/lib/python3.12/site-packages/vllm/model_executor/models

# Copy native Gemma4Unified implementation (includes #44429 + #44571)
# from a local clone of vllm main branch
cp vllm/model_executor/models/gemma4_unified.py $VENV_MODELS/

# Register in registry.py (add after Gemma4ForConditionalGeneration entry)
# "Gemma4UnifiedForConditionalGeneration": ("gemma4_unified", "Gemma4UnifiedForConditionalGeneration"),

To reproduce

vllm serve google/gemma-4-12B-it-qat-w4a16-ct \
  --host 0.0.0.0 --port 8002 \
  --gpu-memory-utilization 0.38 \
  --max-model-len 8192 \
  --limit-mm-per-prompt '{"image": 0, "audio": 0}'

Expected behavior

Model loads and serves correctly once Gemma4UnifiedForConditionalGeneration is registered natively (bypassing the Transformers fallback that incorrectly quantizes the vision embedder).

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions