SM120 (Bug) (With FIx)(RTX Blackwell) NVFP4 MoE: CUTLASS Grouped GEMM Produces Garbage Output; Fixed via FlashInfer SM120 Patches + compute_120f (CUDA 13.0) — 39 tok/s Native FP4

# NVFP4 MoE on SM120 (RTX PRO 6000 Blackwell): Full Debug Report

## Title

**CUTLASS & FlashInfer NVFP4 MoE Grouped GEMM Fails on SM120 Desktop Blackwell GPUs — Debug Journey, Patches, and Benchmark Results**

All native FP4 MoE backends produce garbage output or crash on SM120 (`compute_120`) due to broken CUTLASS grouped GEMM templates. Through systematic patching of FlashInfer 0.6.5's SM120 capability checks and CuTe DSL architecture restrictions, we achieved the first known correct native FP4 MoE output on desktop Blackwell — albeit at reduced speed (14.6 tok/s vs Marlin's 46-49 tok/s) due to FlashInfer autotuner falling back to slow kernel tactics after TMA WS grouped GEMM initialization failures.

---

## Environment

| Component | Detail |
|-----------|--------|
| **GPUs** | 4x NVIDIA RTX PRO 6000 Blackwell Workstation Edition (96GB GDDR7 each, 384GB total) |
| **Compute Capability** | SM 12.0 (`sm_120`, NOT `sm_120a`) |
| **Interconnect** | PCIe (no NVLink) |
| **Driver** | 582.16 |
| **OS** | Windows 11 Pro + WSL2 Ubuntu 22.04 |
| **CUDA** | 12.8 (primary), 13.0 (available for JIT) |
| **PyTorch** | 2.10.0+cu128 |
| **vLLM** | 0.17.0 |
| **FlashInfer** | 0.6.5 (upgraded from 0.6.4) |
| **CUTLASS** | 4.2.1 (vendored in vLLM), 4.4.1 (tested separately) |

## Model

| Parameter | Value |
|-----------|-------|
| Model | `nvidia/Qwen3.5-397B-A17B-NVFP4` |
| Total Params | 397B (17B active per token) |
| Experts | 512 routed + 1 shared, 10 routed per token |
| Quantization | NVFP4 (FP4 weights with FP8 block scales) |
| Parallelism | TP=2 + PP=2 (optimal for PCIe) |
| KV Cache | FP8 e4m3 |
| Max Seq Len | 32,768 |

---

## The Problem

NVFP4 MoE models produce **garbage output** (random whitespace, commas, fragments) on SM120 desktop Blackwell GPUs when using any backend that relies on CUTLASS grouped block-scaled FP4 GEMM kernels. Dense (non-MoE) FP4 GEMM works correctly — the issue is specifically in the **grouped GEMM** path used by MoE expert computations.

### Symptom

```
Prompt: "What is the capital of Kentucky?"
Output: "  ,   ,  (!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!"
```

The model loads, serves requests, and generates tokens — but the MoE expert GEMM produces numerically wrong results, leading to incoherent output.

---

## What We Tried (Chronological)

### Phase 1: CUDA Kernel-Level Fixes (vLLM Source Rebuilds)

#### 1. GDC (Grid Dependency Control) Barriers
- **Hypothesis**: Missing PDL synchronization barriers in CUTLASS grouped GEMM
- **Action**: Added `-DCUTLASS_ENABLE_GDC_FOR_SM100=1` to CMakeLists.txt
- **Finding**: The flag was silently ignored! `compute_120` (without `a`) doesn't define `__CUDA_ARCH_FEAT_SM120_ALL`, so the `#ifndef CUTLASS_GDC_ENABLED` guard evaluated to false
- **Fix**: Added `-DCUTLASS_GDC_ENABLED` directly as a compiler flag
- **Result**: GDC barriers now compiled as real PTX instructions (`griddepcontrol.wait/launch`), but **still garbage output**

#### 2. FP32 Amax Computation
- **Hypothesis**: Half-precision amax in `cvt_warp_fp16_to_fp4` causing quantization errors on SM120
- **Action**: Patched `nvfp4_utils.cuh` to compute per-block amax entirely in FP32 (`fabsf`/`fmaxf` instead of `__habs2`/`__hmax2`)
- **Result**: **Still garbage**. Scale computation was already FP32; the half-precision amax wasn't the root cause.

#### 3. Pingpong Kernel Schedule
- **Hypothesis**: Cooperative schedule buggy on SM120, Pingpong might work
- **Action**: Changed SM120 GEMM from `KernelScheduleAuto` to `KernelPtrArrayTmaWarpSpecializedPingpong`
- **Result**: **SEGFAULT**. Pingpong schedule crashes on SM120.

#### 4. `compute_120a` Architecture Flag
- **Hypothesis**: Desktop SM120 supports accelerated MMA instructions
- **Action**: Forced `compute_120a` gencode for FP4 kernel compilation
- **Result**: **SEGFAULT**. RTX PRO 6000 reports compute capability 12.0, not 12.0a. The `a`-specific instructions are not available on desktop Blackwell (confirmed by CUTLASS Issue #2820).

#### 5. CUTLASS 4.4.1 Upgrade
- **Hypothesis**: CUTLASS 4.4.1 changelog mentions SM120 fixes
- **Action**: Cloned CUTLASS 4.4.1, set `VLLM_CUTLASS_SRC_DIR`, rebuilt `_C.abi3.so`
- **Critical Bug**: First clone attempt silently got 4.2.1 due to CMake's `FetchContent_Declare` overwriting our clone with hardcoded `GIT_TAG v4.2.1`. Fixed by using `VLLM_CUTLASS_SRC_DIR` env var.
- **Result**: **Still garbage**. CUTLASS 4.4.1 has the same broken SM120 grouped block-scaled GEMM templates.

### Phase 2: Alternative MoE Backends (FlashInfer)

vLLM supports 5 MoE backends for NVFP4:
1. `VLLM_CUTLASS` (default) — broken on SM120
2. `FLASHINFER_TRTLLM` — blocked by SM100-only capability checks
3. `FLASHINFER_CUTLASS` — blocked by SM120 capability checks + missing `sm_120a` in CuTe DSL
4. `FLASHINFER_CUTEDSL` — blocked by SM100-only capability checks
5. `MARLIN` — working W4A16 workaround (46-49 tok/s)

#### 6. FlashInfer CUTLASS Backend (The Breakthrough)

**Required patches (10+ files):**

##### vLLM Capability Checks (3 files)
```python
# trtllm_nvfp4_moe.py, flashinfer_trtllm_moe.py, flashinfer_cutedsl_moe.py
# Changed:
return p.is_cuda() and p.is_device_capability_family(100)
# To:
return p.is_cuda() and (p.is_device_capability_family(100) or p.is_device_capability_family(120))
```

##### FlashInfer JIT Architecture Filters (flashinfer/jit/fused_moe.py)
```python
# Lines 62, 79, 238: Added major version 12
supported_major_versions=[10]      # -> [10, 12]
supported_major_versions=[10, 11]  # -> [10, 11, 12]
```

##### FlashInfer Compilation Context (flashinfer/compilation_context.py)
```python
# Changed: major >= 9 adds "a" suffix (generates compute_120a which is needed for CUTLASS MMA)
# SM120 needs "a" suffix for MMA instructions, but not "f" (CUDA 13.0+ only)
```

##### CuTe DSL `admissible_archs` (5 files, 18+ locations)
```
flashinfer/data/cutlass/python/CuTeDSL/cutlass/cute/nvgpu/cpasync/copy.py  (4 locations)
flashinfer/data/cutlass/python/CuTeDSL/cutlass/cute/nvgpu/tcgen05/mma.py   (2 locations)
flashinfer/data/cutlass/python/CuTeDSL/cutlass/cute/nvgpu/tcgen05/copy.py  (3 locations)
flashinfer/data/cutlass/python/CuTeDSL/cutlass/cute/arch/mbar.py           (8 locations)
flashinfer/data/cutlass/python/CuTeDSL/cutlass/cute/arch/elect.py          (1 location)
```
Added `"sm_120a"` after every `"sm_100a"` in admissible_archs lists.

##### cuda.py Device Mapping
```python
# Added:
(12, 0): ("Blackwell", "sm_120a", ["sm_120a"]),  # RTX PRO 6000
```

##### TRT-LLM C++ Launcher (flashinfer/data/csrc/trtllm_fused_moe_kernel_launcher.cu)
```cpp
// Lines 417, 1345: Changed == to >=
TVM_FFI_ICHECK_EQ(major, 10)           // -> TVM_FFI_ICHECK_GE(major, 10)
TVM_FFI_ICHECK_EQ(std::get<0>(...), 10) // -> TVM_FFI_ICHECK_GE(...)
```

##### Additional Requirements
- `nvcc` must be in PATH (FlashInfer JIT needs it)
- FlashInfer JIT cache must be cleared after patching
- `VLLM_NVFP4_GEMM_BACKEND=cutlass` env var for dense layers (use vLLM native CUTLASS)

**Result**: **CORRECT OUTPUT!** First known native FP4 MoE on SM120 desktop Blackwell.

---

## Benchmark Results

### Launch Command (FlashInfer CUTLASS — Working Native FP4)
```bash
export PATH="/usr/local/cuda-12.8/bin:$PATH"  # or cuda-13.0 for compute_120f
export VLLM_NVFP4_GEMM_BACKEND=cutlass
export NCCL_CUMEM_ENABLE=0
export VLLM_WORKER_MULTIPROC_METHOD=spawn

python -m vllm.entrypoints.openai.api_server \
  --model nvidia/Qwen3.5-397B-A17B-NVFP4 \
  --dtype bfloat16 \
  --tensor-parallel-size 2 \
  --pipeline-parallel-size 2 \
  --max-model-len 32768 \
  --gpu-memory-utilization 0.92 \
  --trust-remote-code \
  --moe-backend flashinfer_cutlass
```

### Speed Comparison

| Backend | MoE Kernel | CUDA | Single User (tok/s) | 4-User (per user) | Output |
|---------|-----------|------|--------------------|--------------------|--------|
| **Marlin** (`--moe-backend marlin`) | W4A16 dequant | 12.8 | **46-49** | **~37** | Correct |
| **FlashInfer CUTLASS 120f** | SM120 CUTLASS JIT | **13.0** | **39.0** | **18.2** | **Correct** |
| FlashInfer CUTLASS 120a | SM120 CUTLASS JIT | 12.8 | 14.6-14.9 | 6.9-8.5 | Correct |
| FlashInfer CUTLASS Hybrid | SM120 JIT + vLLM dense | 12.8 | 14.8-14.9 | 6.9 | Correct |
| vLLM Native CUTLASS | Grouped block-scaled | 12.8 | N/A | N/A | Garbage |
| CUTLASS 4.4.1 rebuild | Grouped block-scaled | 12.8 | N/A | N/A | Garbage |
| FlashInfer TRT-LLM | TRT-LLM cubins | 12.8 | N/A | N/A | Crash |

### Why FlashInfer CUTLASS is 3x Slower Than Marlin

FlashInfer's autotuner logs reveal the root cause:
```
flashinfer.jit: [Autotuner]: Skipping tactic <MoERunner> 14, due to failure:
[TensorRT-LLM][ERROR] Failed to initialize cutlass TMA WS grouped gemm.
Error: Error Internal (cutlass_kernel_file_gemm_grouped_sm120_M128_BS_group2.generated.cu:60)
```

**All TMA warp-specialized grouped GEMM tactics fail** to initialize on SM120 with `compute_120a`. The autotuner falls back to slower, non-TMA tactics. This is a CUTLASS template-level issue where SM120's TMA grouped GEMM doesn't work with the `a` suffix — it likely requires the `f` suffix (`compute_120f`) which is only available with CUDA 13.0+.

---

## Key Technical Findings

### 1. `compute_120` vs `compute_120a` vs `compute_120f`
| Flag | CUDA Version | MMA Instructions | CUTLASS Grouped GEMM | Result |
|------|-------------|-----------------|---------------------|--------|
| `compute_120` | 12.8+ | Not enabled | "Arch conditional MMA" error | Fails |
| `compute_120a` | 12.8+ | Enabled | TMA WS tactics fail, slow fallback | 14.6 tok/s |
| `compute_120f` | **13.0+ only** | Full feature set | **Potentially fast tactics** | **Testing** |

### 2. SM120 Desktop is NOT SM100 Compatible
Despite sharing the "Blackwell" brand, SM120 (desktop) and SM100 (datacenter) have different:
- Compute capability families (12 vs 10)
- Supported architecture features (`a` vs `f` suffix)
- Pre-compiled cubin compatibility (SM100 cubins crash on SM120)

### 3. The Broken Chain
```
vLLM CUTLASS grouped GEMM → garbage output (kernel correctness bug)
    ↓ upgrade CUTLASS 4.4.1
Still garbage (same templates, 0 SM120 changes)
    ↓ try FlashInfer CUTLASS
Blocked: SM120 not in capability checks
    ↓ patch 10+ files
Works with correct output, but slow (autotuner fallback)
    ↓ try FlashInfer TRT-LLM
Crash: hardcoded SM==10 in C++ + SM100-only cubins
    ↓ next: compute_120f with CUDA 13.0
Pending...
```

---

## BREAKTHROUGH: `compute_120f` with CUDA 13.0

A DGX Spark (SM121) user achieved 35 tok/s with FlashInfer CUTLASS using `12.1f` (CUDA 13.0). The `f` suffix enables the "full" SM120 feature set with working TMA WS grouped GEMM tactics.

### Results: `compute_120f` Nearly Triples Speed

| Metric | `compute_120a` (CUDA 12.8) | `compute_120f` (CUDA 13.0) | Marlin W4A16 |
|--------|---------------------------|---------------------------|-------------|
| **Single user** | 14.6 tok/s | **39.0 tok/s** | 46-49 tok/s |
| **4-user concurrent** | 6.9 tok/s/user | **18.2 tok/s/user** | ~37 tok/s/user |

**`compute_120f` enabled the fast TMA WS grouped GEMM tactics that failed with `compute_120a`.** This confirms the `f` suffix is the correct architecture designation for SM120 desktop Blackwell GPUs.

### Launch Command (CUDA 13.0 + compute_120f)
```bash
export PATH="/usr/local/cuda-13.0/bin:$PATH"
export VLLM_NVFP4_GEMM_BACKEND=cutlass
export NCCL_CUMEM_ENABLE=0
export VLLM_WORKER_MULTIPROC_METHOD=spawn

python -m vllm.entrypoints.openai.api_server \
  --model nvidia/Qwen3.5-397B-A17B-NVFP4 \
  --dtype bfloat16 \
  --tensor-parallel-size 2 \
  --pipeline-parallel-size 2 \
  --max-model-len 32768 \
  --gpu-memory-utilization 0.92 \
  --trust-remote-code \
  --moe-backend flashinfer_cutlass
```

### Why 39 vs 49 tok/s?
The remaining ~20% gap vs Marlin is likely due to:
- FlashInfer CUTLASS autotuner may not select the absolute optimal tactic
- Native FP4 GEMM has activation quantization overhead (BF16 -> FP4 per-token)
- Further kernel tuning by FlashInfer team could close the gap
- Pipeline parallel bubble overhead affects native FP4 slightly differently than Marlin

---

## Production Recommendation (Current)

**Use Marlin** for production until `compute_120f` results are confirmed:

```bash
python -m vllm.entrypoints.openai.api_server \
  --model nvidia/Qwen3.5-397B-A17B-NVFP4 \
  --dtype bfloat16 \
  --tensor-parallel-size 2 \
  --pipeline-parallel-size 2 \
  --moe-backend marlin \
  --max-model-len 32768 \
  --gpu-memory-utilization 0.95 \
  --trust-remote-code
```

Required env vars:
```bash
export NCCL_CUMEM_ENABLE=0
export VLLM_WORKER_MULTIPROC_METHOD=spawn
```

---

## Related Issues

- [CUTLASS #2820](https://github.com/NVIDIA/cutlass/issues/2820) — SM120 Block-Scaled MMA Runtime Assertion Failure
- [CUTLASS #2800](https://github.com/NVIDIA/cutlass/issues/2800) — BlockScaledMmaOp restricts FP4 to sm_100a only
- [vLLM #33416](https://github.com/vllm-project/vllm/issues/33416) — NVFP4 MoE kernels fail on RTX Blackwell (SM12.0)
- [vLLM #33333](https://github.com/vllm-project/vllm/issues/33333) — FLASHINFER_CUTLASS not supported on SM120
- [vLLM #31085](https://github.com/vllm-project/vllm/issues/31085) — Add SM120 support for native NVFP4 MoE kernels
- [FlashInfer #2577](https://github.com/flashinfer-ai/flashinfer/issues/2577) — mm_fp4 GEMM broken on SM120
- [NVIDIA Forum](https://forums.developer.nvidia.com/t/from-20-to-35-tps-on-qwen3-next-nvfp4-w-flashinfer-12-1f/356153) — 35 TPS with FlashInfer 12.1f on DGX Spark

---

## Files Patched (Complete List)

### FlashInfer 0.6.5
| File | Change |
|------|--------|
| `flashinfer/compilation_context.py` | Arch suffix logic for SM120 |
| `flashinfer/jit/fused_moe.py` (3 locations) | Added `supported_major_versions` 12 |
| `flashinfer/data/csrc/trtllm_fused_moe_kernel_launcher.cu` (2 locations) | `ICHECK_EQ` -> `ICHECK_GE` |
| `flashinfer/data/cutlass/python/CuTeDSL/cutlass/cute/nvgpu/cpasync/copy.py` (4 locations) | Added `sm_120a` to admissible_archs |
| `flashinfer/data/cutlass/python/CuTeDSL/cutlass/cute/nvgpu/tcgen05/mma.py` (2 locations) | Added `sm_120a` to admissible_archs |
| `flashinfer/data/cutlass/python/CuTeDSL/cutlass/cute/nvgpu/tcgen05/copy.py` (3 locations) | Added `sm_120a` to admissible_archs |
| `flashinfer/data/cutlass/python/CuTeDSL/cutlass/cute/arch/mbar.py` (8 locations) | Added `sm_120a` to admissible_archs |
| `flashinfer/data/cutlass/python/CuTeDSL/cutlass/cute/arch/elect.py` (1 location) | Added `sm_120a` to admissible_archs |
| `flashinfer/data/cutlass/python/CuTeDSL/base_dsl/runtime/cuda.py` | Added `(12, 0)` device mapping |

### vLLM 0.17.0
| File | Change |
|------|--------|
| `vllm/model_executor/layers/fused_moe/experts/trtllm_nvfp4_moe.py` | Added `is_device_capability_family(120)` |
| `vllm/model_executor/layers/fused_moe/flashinfer_trtllm_moe.py` | Added `is_device_capability_family(120)` |
| `vllm/model_executor/layers/fused_moe/flashinfer_cutedsl_moe.py` | Added `is_device_capability_family(120)` |

### vLLM Source (CUDA kernel rebuilds — tested but not needed for FlashInfer path)
| File | Change |
|------|--------|
| `vllm-src/CMakeLists.txt` | Added `-DCUTLASS_GDC_ENABLED`, `-DCUTLASS_ENABLE_GDC_FOR_SM100=1` |
| `vllm-src/csrc/quantization/fp4/nvfp4_utils.cuh` | FP32 amax computation |

---

*Report date: March 8, 2026*
*Hardware: 4x RTX PRO 6000 Blackwell (SM120, 96GB each)*
*Tested by: Kentucky Local Counsel local inference team lead Brandon Music*


File	Change
`flashinfer/compilation_context.py`	Arch suffix logic for SM120
`flashinfer/jit/fused_moe.py` (3 locations)	Added `supported_major_versions` 12
`flashinfer/data/csrc/trtllm_fused_moe_kernel_launcher.cu` (2 locations)	`ICHECK_EQ` -> `ICHECK_GE`
`flashinfer/data/cutlass/python/CuTeDSL/cutlass/cute/nvgpu/cpasync/copy.py` (4 locations)	Added `sm_120a` to admissible_archs
`flashinfer/data/cutlass/python/CuTeDSL/cutlass/cute/nvgpu/tcgen05/mma.py` (2 locations)	Added `sm_120a` to admissible_archs
`flashinfer/data/cutlass/python/CuTeDSL/cutlass/cute/nvgpu/tcgen05/copy.py` (3 locations)	Added `sm_120a` to admissible_archs
`flashinfer/data/cutlass/python/CuTeDSL/cutlass/cute/arch/mbar.py` (8 locations)	Added `sm_120a` to admissible_archs
`flashinfer/data/cutlass/python/CuTeDSL/cutlass/cute/arch/elect.py` (1 location)	Added `sm_120a` to admissible_archs
`flashinfer/data/cutlass/python/CuTeDSL/base_dsl/runtime/cuda.py`	Added `(12, 0)` device mapping

File	Change
`vllm/model_executor/layers/fused_moe/experts/trtllm_nvfp4_moe.py`	Added `is_device_capability_family(120)`
`vllm/model_executor/layers/fused_moe/flashinfer_trtllm_moe.py`	Added `is_device_capability_family(120)`
`vllm/model_executor/layers/fused_moe/flashinfer_cutedsl_moe.py`	Added `is_device_capability_family(120)`

Component	Detail
GPUs	4x NVIDIA RTX PRO 6000 Blackwell Workstation Edition (96GB GDDR7 each, 384GB total)
Compute Capability	SM 12.0 (`sm_120`, NOT `sm_120a`)
Interconnect	PCIe (no NVLink)
Driver	582.16
OS	Windows 11 Pro + WSL2 Ubuntu 22.04
CUDA	12.8 (primary), 13.0 (available for JIT)
PyTorch	2.10.0+cu128
vLLM	0.17.0
FlashInfer	0.6.5 (upgraded from 0.6.4)
CUTLASS	4.2.1 (vendored in vLLM), 4.4.1 (tested separately)

Parameter	Value
Model	`nvidia/Qwen3.5-397B-A17B-NVFP4`
Total Params	397B (17B active per token)
Experts	512 routed + 1 shared, 10 routed per token
Quantization	NVFP4 (FP4 weights with FP8 block scales)
Parallelism	TP=2 + PP=2 (optimal for PCIe)
KV Cache	FP8 e4m3
Max Seq Len	32,768

Backend	MoE Kernel	CUDA	Single User (tok/s)	4-User (per user)	Output
Marlin (`--moe-backend marlin`)	W4A16 dequant	12.8	46-49	~37	Correct
FlashInfer CUTLASS 120f	SM120 CUTLASS JIT	13.0	39.0	18.2	Correct
FlashInfer CUTLASS 120a	SM120 CUTLASS JIT	12.8	14.6-14.9	6.9-8.5	Correct
FlashInfer CUTLASS Hybrid	SM120 JIT + vLLM dense	12.8	14.8-14.9	6.9	Correct
vLLM Native CUTLASS	Grouped block-scaled	12.8	N/A	N/A	Garbage
CUTLASS 4.4.1 rebuild	Grouped block-scaled	12.8	N/A	N/A	Garbage
FlashInfer TRT-LLM	TRT-LLM cubins	12.8	N/A	N/A	Crash

Flag	CUDA Version	MMA Instructions	CUTLASS Grouped GEMM	Result
`compute_120`	12.8+	Not enabled	"Arch conditional MMA" error	Fails
`compute_120a`	12.8+	Enabled	TMA WS tactics fail, slow fallback	14.6 tok/s
`compute_120f`	13.0+ only	Full feature set	Potentially fast tactics	Testing

Metric	`compute_120a` (CUDA 12.8)	`compute_120f` (CUDA 13.0)	Marlin W4A16
Single user	14.6 tok/s	39.0 tok/s	46-49 tok/s
4-user concurrent	6.9 tok/s/user	18.2 tok/s/user	~37 tok/s/user

File	Change
`vllm-src/CMakeLists.txt`	Added `-DCUTLASS_GDC_ENABLED`, `-DCUTLASS_ENABLE_GDC_FOR_SM100=1`
`vllm-src/csrc/quantization/fp4/nvfp4_utils.cuh`	FP32 amax computation

SM120 (Bug) (With FIx)(RTX Blackwell) NVFP4 MoE: CUTLASS Grouped GEMM Produces Garbage Output; Fixed via FlashInfer SM120 Patches + compute_120f (CUDA 13.0) — 39 tok/s Native FP4 #3096

Description

NVFP4 MoE on SM120 (RTX PRO 6000 Blackwell): Full Debug Report

Title

Environment

Model

The Problem

Symptom

What We Tried (Chronological)

Phase 1: CUDA Kernel-Level Fixes (vLLM Source Rebuilds)

1. GDC (Grid Dependency Control) Barriers

2. FP32 Amax Computation

3. Pingpong Kernel Schedule

4. compute_120a Architecture Flag

5. CUTLASS 4.4.1 Upgrade

Phase 2: Alternative MoE Backends (FlashInfer)

6. FlashInfer CUTLASS Backend (The Breakthrough)

vLLM Capability Checks (3 files)

FlashInfer JIT Architecture Filters (flashinfer/jit/fused_moe.py)

FlashInfer Compilation Context (flashinfer/compilation_context.py)

CuTe DSL admissible_archs (5 files, 18+ locations)

cuda.py Device Mapping

TRT-LLM C++ Launcher (flashinfer/data/csrc/trtllm_fused_moe_kernel_launcher.cu)

Additional Requirements

Benchmark Results

Launch Command (FlashInfer CUTLASS — Working Native FP4)

Speed Comparison

Why FlashInfer CUTLASS is 3x Slower Than Marlin

Key Technical Findings

1. compute_120 vs compute_120a vs compute_120f

2. SM120 Desktop is NOT SM100 Compatible

3. The Broken Chain

BREAKTHROUGH: compute_120f with CUDA 13.0

Results: compute_120f Nearly Triples Speed

Launch Command (CUDA 13.0 + compute_120f)

Why 39 vs 49 tok/s?

Production Recommendation (Current)

Related Issues

Files Patched (Complete List)

FlashInfer 0.6.5

vLLM 0.17.0

vLLM Source (CUDA kernel rebuilds — tested but not needed for FlashInfer path)

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

4. `compute_120a` Architecture Flag

CuTe DSL `admissible_archs` (5 files, 18+ locations)

1. `compute_120` vs `compute_120a` vs `compute_120f`

BREAKTHROUGH: `compute_120f` with CUDA 13.0

Results: `compute_120f` Nearly Triples Speed