Skip to content

SM120 (Bug) (With FIx)(RTX Blackwell) NVFP4 MoE: CUTLASS Grouped GEMM Produces Garbage Output; Fixed via FlashInfer SM120 Patches + compute_120f (CUDA 13.0) — 39 tok/s Native FP4 #3096

@brandonmmusic-max

Description

@brandonmmusic-max

NVFP4 MoE on SM120 (RTX PRO 6000 Blackwell): Full Debug Report

Title

CUTLASS & FlashInfer NVFP4 MoE Grouped GEMM Fails on SM120 Desktop Blackwell GPUs — Debug Journey, Patches, and Benchmark Results

All native FP4 MoE backends produce garbage output or crash on SM120 (compute_120) due to broken CUTLASS grouped GEMM templates. Through systematic patching of FlashInfer 0.6.5's SM120 capability checks and CuTe DSL architecture restrictions, we achieved the first known correct native FP4 MoE output on desktop Blackwell — albeit at reduced speed (14.6 tok/s vs Marlin's 46-49 tok/s) due to FlashInfer autotuner falling back to slow kernel tactics after TMA WS grouped GEMM initialization failures.


Environment

Component Detail
GPUs 4x NVIDIA RTX PRO 6000 Blackwell Workstation Edition (96GB GDDR7 each, 384GB total)
Compute Capability SM 12.0 (sm_120, NOT sm_120a)
Interconnect PCIe (no NVLink)
Driver 582.16
OS Windows 11 Pro + WSL2 Ubuntu 22.04
CUDA 12.8 (primary), 13.0 (available for JIT)
PyTorch 2.10.0+cu128
vLLM 0.17.0
FlashInfer 0.6.5 (upgraded from 0.6.4)
CUTLASS 4.2.1 (vendored in vLLM), 4.4.1 (tested separately)

Model

Parameter Value
Model nvidia/Qwen3.5-397B-A17B-NVFP4
Total Params 397B (17B active per token)
Experts 512 routed + 1 shared, 10 routed per token
Quantization NVFP4 (FP4 weights with FP8 block scales)
Parallelism TP=2 + PP=2 (optimal for PCIe)
KV Cache FP8 e4m3
Max Seq Len 32,768

The Problem

NVFP4 MoE models produce garbage output (random whitespace, commas, fragments) on SM120 desktop Blackwell GPUs when using any backend that relies on CUTLASS grouped block-scaled FP4 GEMM kernels. Dense (non-MoE) FP4 GEMM works correctly — the issue is specifically in the grouped GEMM path used by MoE expert computations.

Symptom

Prompt: "What is the capital of Kentucky?"
Output: "  ,   ,  (!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!"

The model loads, serves requests, and generates tokens — but the MoE expert GEMM produces numerically wrong results, leading to incoherent output.


What We Tried (Chronological)

Phase 1: CUDA Kernel-Level Fixes (vLLM Source Rebuilds)

1. GDC (Grid Dependency Control) Barriers

  • Hypothesis: Missing PDL synchronization barriers in CUTLASS grouped GEMM
  • Action: Added -DCUTLASS_ENABLE_GDC_FOR_SM100=1 to CMakeLists.txt
  • Finding: The flag was silently ignored! compute_120 (without a) doesn't define __CUDA_ARCH_FEAT_SM120_ALL, so the #ifndef CUTLASS_GDC_ENABLED guard evaluated to false
  • Fix: Added -DCUTLASS_GDC_ENABLED directly as a compiler flag
  • Result: GDC barriers now compiled as real PTX instructions (griddepcontrol.wait/launch), but still garbage output

2. FP32 Amax Computation

  • Hypothesis: Half-precision amax in cvt_warp_fp16_to_fp4 causing quantization errors on SM120
  • Action: Patched nvfp4_utils.cuh to compute per-block amax entirely in FP32 (fabsf/fmaxf instead of __habs2/__hmax2)
  • Result: Still garbage. Scale computation was already FP32; the half-precision amax wasn't the root cause.

3. Pingpong Kernel Schedule

  • Hypothesis: Cooperative schedule buggy on SM120, Pingpong might work
  • Action: Changed SM120 GEMM from KernelScheduleAuto to KernelPtrArrayTmaWarpSpecializedPingpong
  • Result: SEGFAULT. Pingpong schedule crashes on SM120.

4. compute_120a Architecture Flag

5. CUTLASS 4.4.1 Upgrade

  • Hypothesis: CUTLASS 4.4.1 changelog mentions SM120 fixes
  • Action: Cloned CUTLASS 4.4.1, set VLLM_CUTLASS_SRC_DIR, rebuilt _C.abi3.so
  • Critical Bug: First clone attempt silently got 4.2.1 due to CMake's FetchContent_Declare overwriting our clone with hardcoded GIT_TAG v4.2.1. Fixed by using VLLM_CUTLASS_SRC_DIR env var.
  • Result: Still garbage. CUTLASS 4.4.1 has the same broken SM120 grouped block-scaled GEMM templates.

Phase 2: Alternative MoE Backends (FlashInfer)

vLLM supports 5 MoE backends for NVFP4:

  1. VLLM_CUTLASS (default) — broken on SM120
  2. FLASHINFER_TRTLLM — blocked by SM100-only capability checks
  3. FLASHINFER_CUTLASS — blocked by SM120 capability checks + missing sm_120a in CuTe DSL
  4. FLASHINFER_CUTEDSL — blocked by SM100-only capability checks
  5. MARLIN — working W4A16 workaround (46-49 tok/s)

6. FlashInfer CUTLASS Backend (The Breakthrough)

Required patches (10+ files):

vLLM Capability Checks (3 files)
# trtllm_nvfp4_moe.py, flashinfer_trtllm_moe.py, flashinfer_cutedsl_moe.py
# Changed:
return p.is_cuda() and p.is_device_capability_family(100)
# To:
return p.is_cuda() and (p.is_device_capability_family(100) or p.is_device_capability_family(120))
FlashInfer JIT Architecture Filters (flashinfer/jit/fused_moe.py)
# Lines 62, 79, 238: Added major version 12
supported_major_versions=[10]      # -> [10, 12]
supported_major_versions=[10, 11]  # -> [10, 11, 12]
FlashInfer Compilation Context (flashinfer/compilation_context.py)
# Changed: major >= 9 adds "a" suffix (generates compute_120a which is needed for CUTLASS MMA)
# SM120 needs "a" suffix for MMA instructions, but not "f" (CUDA 13.0+ only)
CuTe DSL admissible_archs (5 files, 18+ locations)
flashinfer/data/cutlass/python/CuTeDSL/cutlass/cute/nvgpu/cpasync/copy.py  (4 locations)
flashinfer/data/cutlass/python/CuTeDSL/cutlass/cute/nvgpu/tcgen05/mma.py   (2 locations)
flashinfer/data/cutlass/python/CuTeDSL/cutlass/cute/nvgpu/tcgen05/copy.py  (3 locations)
flashinfer/data/cutlass/python/CuTeDSL/cutlass/cute/arch/mbar.py           (8 locations)
flashinfer/data/cutlass/python/CuTeDSL/cutlass/cute/arch/elect.py          (1 location)

Added "sm_120a" after every "sm_100a" in admissible_archs lists.

cuda.py Device Mapping
# Added:
(12, 0): ("Blackwell", "sm_120a", ["sm_120a"]),  # RTX PRO 6000
TRT-LLM C++ Launcher (flashinfer/data/csrc/trtllm_fused_moe_kernel_launcher.cu)
// Lines 417, 1345: Changed == to >=
TVM_FFI_ICHECK_EQ(major, 10)           // -> TVM_FFI_ICHECK_GE(major, 10)
TVM_FFI_ICHECK_EQ(std::get<0>(...), 10) // -> TVM_FFI_ICHECK_GE(...)
Additional Requirements
  • nvcc must be in PATH (FlashInfer JIT needs it)
  • FlashInfer JIT cache must be cleared after patching
  • VLLM_NVFP4_GEMM_BACKEND=cutlass env var for dense layers (use vLLM native CUTLASS)

Result: CORRECT OUTPUT! First known native FP4 MoE on SM120 desktop Blackwell.


Benchmark Results

Launch Command (FlashInfer CUTLASS — Working Native FP4)

export PATH="/usr/local/cuda-12.8/bin:$PATH"  # or cuda-13.0 for compute_120f
export VLLM_NVFP4_GEMM_BACKEND=cutlass
export NCCL_CUMEM_ENABLE=0
export VLLM_WORKER_MULTIPROC_METHOD=spawn

python -m vllm.entrypoints.openai.api_server \
  --model nvidia/Qwen3.5-397B-A17B-NVFP4 \
  --dtype bfloat16 \
  --tensor-parallel-size 2 \
  --pipeline-parallel-size 2 \
  --max-model-len 32768 \
  --gpu-memory-utilization 0.92 \
  --trust-remote-code \
  --moe-backend flashinfer_cutlass

Speed Comparison

Backend MoE Kernel CUDA Single User (tok/s) 4-User (per user) Output
Marlin (--moe-backend marlin) W4A16 dequant 12.8 46-49 ~37 Correct
FlashInfer CUTLASS 120f SM120 CUTLASS JIT 13.0 39.0 18.2 Correct
FlashInfer CUTLASS 120a SM120 CUTLASS JIT 12.8 14.6-14.9 6.9-8.5 Correct
FlashInfer CUTLASS Hybrid SM120 JIT + vLLM dense 12.8 14.8-14.9 6.9 Correct
vLLM Native CUTLASS Grouped block-scaled 12.8 N/A N/A Garbage
CUTLASS 4.4.1 rebuild Grouped block-scaled 12.8 N/A N/A Garbage
FlashInfer TRT-LLM TRT-LLM cubins 12.8 N/A N/A Crash

Why FlashInfer CUTLASS is 3x Slower Than Marlin

FlashInfer's autotuner logs reveal the root cause:

flashinfer.jit: [Autotuner]: Skipping tactic <MoERunner> 14, due to failure:
[TensorRT-LLM][ERROR] Failed to initialize cutlass TMA WS grouped gemm.
Error: Error Internal (cutlass_kernel_file_gemm_grouped_sm120_M128_BS_group2.generated.cu:60)

All TMA warp-specialized grouped GEMM tactics fail to initialize on SM120 with compute_120a. The autotuner falls back to slower, non-TMA tactics. This is a CUTLASS template-level issue where SM120's TMA grouped GEMM doesn't work with the a suffix — it likely requires the f suffix (compute_120f) which is only available with CUDA 13.0+.


Key Technical Findings

1. compute_120 vs compute_120a vs compute_120f

Flag CUDA Version MMA Instructions CUTLASS Grouped GEMM Result
compute_120 12.8+ Not enabled "Arch conditional MMA" error Fails
compute_120a 12.8+ Enabled TMA WS tactics fail, slow fallback 14.6 tok/s
compute_120f 13.0+ only Full feature set Potentially fast tactics Testing

2. SM120 Desktop is NOT SM100 Compatible

Despite sharing the "Blackwell" brand, SM120 (desktop) and SM100 (datacenter) have different:

  • Compute capability families (12 vs 10)
  • Supported architecture features (a vs f suffix)
  • Pre-compiled cubin compatibility (SM100 cubins crash on SM120)

3. The Broken Chain

vLLM CUTLASS grouped GEMM → garbage output (kernel correctness bug)
    ↓ upgrade CUTLASS 4.4.1
Still garbage (same templates, 0 SM120 changes)
    ↓ try FlashInfer CUTLASS
Blocked: SM120 not in capability checks
    ↓ patch 10+ files
Works with correct output, but slow (autotuner fallback)
    ↓ try FlashInfer TRT-LLM
Crash: hardcoded SM==10 in C++ + SM100-only cubins
    ↓ next: compute_120f with CUDA 13.0
Pending...

BREAKTHROUGH: compute_120f with CUDA 13.0

A DGX Spark (SM121) user achieved 35 tok/s with FlashInfer CUTLASS using 12.1f (CUDA 13.0). The f suffix enables the "full" SM120 feature set with working TMA WS grouped GEMM tactics.

Results: compute_120f Nearly Triples Speed

Metric compute_120a (CUDA 12.8) compute_120f (CUDA 13.0) Marlin W4A16
Single user 14.6 tok/s 39.0 tok/s 46-49 tok/s
4-user concurrent 6.9 tok/s/user 18.2 tok/s/user ~37 tok/s/user

compute_120f enabled the fast TMA WS grouped GEMM tactics that failed with compute_120a. This confirms the f suffix is the correct architecture designation for SM120 desktop Blackwell GPUs.

Launch Command (CUDA 13.0 + compute_120f)

export PATH="/usr/local/cuda-13.0/bin:$PATH"
export VLLM_NVFP4_GEMM_BACKEND=cutlass
export NCCL_CUMEM_ENABLE=0
export VLLM_WORKER_MULTIPROC_METHOD=spawn

python -m vllm.entrypoints.openai.api_server \
  --model nvidia/Qwen3.5-397B-A17B-NVFP4 \
  --dtype bfloat16 \
  --tensor-parallel-size 2 \
  --pipeline-parallel-size 2 \
  --max-model-len 32768 \
  --gpu-memory-utilization 0.92 \
  --trust-remote-code \
  --moe-backend flashinfer_cutlass

Why 39 vs 49 tok/s?

The remaining ~20% gap vs Marlin is likely due to:

  • FlashInfer CUTLASS autotuner may not select the absolute optimal tactic
  • Native FP4 GEMM has activation quantization overhead (BF16 -> FP4 per-token)
  • Further kernel tuning by FlashInfer team could close the gap
  • Pipeline parallel bubble overhead affects native FP4 slightly differently than Marlin

Production Recommendation (Current)

Use Marlin for production until compute_120f results are confirmed:

python -m vllm.entrypoints.openai.api_server \
  --model nvidia/Qwen3.5-397B-A17B-NVFP4 \
  --dtype bfloat16 \
  --tensor-parallel-size 2 \
  --pipeline-parallel-size 2 \
  --moe-backend marlin \
  --max-model-len 32768 \
  --gpu-memory-utilization 0.95 \
  --trust-remote-code

Required env vars:

export NCCL_CUMEM_ENABLE=0
export VLLM_WORKER_MULTIPROC_METHOD=spawn

Related Issues


Files Patched (Complete List)

FlashInfer 0.6.5

File Change
flashinfer/compilation_context.py Arch suffix logic for SM120
flashinfer/jit/fused_moe.py (3 locations) Added supported_major_versions 12
flashinfer/data/csrc/trtllm_fused_moe_kernel_launcher.cu (2 locations) ICHECK_EQ -> ICHECK_GE
flashinfer/data/cutlass/python/CuTeDSL/cutlass/cute/nvgpu/cpasync/copy.py (4 locations) Added sm_120a to admissible_archs
flashinfer/data/cutlass/python/CuTeDSL/cutlass/cute/nvgpu/tcgen05/mma.py (2 locations) Added sm_120a to admissible_archs
flashinfer/data/cutlass/python/CuTeDSL/cutlass/cute/nvgpu/tcgen05/copy.py (3 locations) Added sm_120a to admissible_archs
flashinfer/data/cutlass/python/CuTeDSL/cutlass/cute/arch/mbar.py (8 locations) Added sm_120a to admissible_archs
flashinfer/data/cutlass/python/CuTeDSL/cutlass/cute/arch/elect.py (1 location) Added sm_120a to admissible_archs
flashinfer/data/cutlass/python/CuTeDSL/base_dsl/runtime/cuda.py Added (12, 0) device mapping

vLLM 0.17.0

File Change
vllm/model_executor/layers/fused_moe/experts/trtllm_nvfp4_moe.py Added is_device_capability_family(120)
vllm/model_executor/layers/fused_moe/flashinfer_trtllm_moe.py Added is_device_capability_family(120)
vllm/model_executor/layers/fused_moe/flashinfer_cutedsl_moe.py Added is_device_capability_family(120)

vLLM Source (CUDA kernel rebuilds — tested but not needed for FlashInfer path)

File Change
vllm-src/CMakeLists.txt Added -DCUTLASS_GDC_ENABLED, -DCUTLASS_ENABLE_GDC_FOR_SM100=1
vllm-src/csrc/quantization/fp4/nvfp4_utils.cuh FP32 amax computation

Report date: March 8, 2026
Hardware: 4x RTX PRO 6000 Blackwell (SM120, 96GB each)
Tested by: Kentucky Local Counsel local inference team lead Brandon Music

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions