Skip to content

perf: optimize MXFP4xBF16 & INT4xFP8 CUTLASS MoE backend for SM90#3084

Merged
samuellees merged 33 commits intoflashinfer-ai:mainfrom
samuellees:feat/w4a16-moe-kernel
Apr 23, 2026
Merged

perf: optimize MXFP4xBF16 & INT4xFP8 CUTLASS MoE backend for SM90#3084
samuellees merged 33 commits intoflashinfer-ai:mainfrom
samuellees:feat/w4a16-moe-kernel

Conversation

@samuellees
Copy link
Copy Markdown
Collaborator

@samuellees samuellees commented Apr 15, 2026

Summary

Port TensorRT-LLM PR #12451 to FlashInfer's cutlass_fused_moe SM90 path. Adds an LDSM + interleaved-LUT weight-load pipeline for 4-bit weights × 16/8-bit activations, plus the two preprocessing helpers the new kernel layout requires.

Changes

Kernel

  • mixed_input_utils.hpp / sm90_mma_array_tma_gmma_rs_warpspecialized_mixed_input_.hpp — sync with TRT-LLM PR #12451 (LDSM path + FP4/INT4 → BF16 LUT converter).
  • moe_gemm_mixed_utils.{cu,h} (new) — per-row CUDA kernels for FP4/INT4 byte interleave.
  • cutlass_heuristic.cpp — for has_w4afp8, skip CtaShape128x128x128B + COOPERATIVE (register overflow on SM90) and pick COOP / PINGPONG per tile.
  • moe_gemm_tma_ws_mixed_input_launcher.inlscheduler.max_swizzle_size = 2, raster_order = Heuristic.

Python

flashinfer/fused_moe/core.py exposes two helpers (re-exported by the package):

  • interleave_moe_weights_for_hopper_mixed_gemm(weight, quant_type) — byte-level interleave for "fp4" / "int4" packed uint8 weights; delegates to the C++ kernel above.
  • interleave_moe_scales_for_hopper_mixed_gemm(scales, group_size=32) — pure PyTorch reshape + permute matching TRT-LLM's WFP4A16FusedMoEMethod.load_quant_scales, factor = 128 // group_size.

Tests — inside tests/moe/test_trtllm_cutlass_fused_moe.py (18 new)

  • test_moe_bf16_mxfp4_hopper_correctness (5 shapes, strict assert_close vs a GPU-side dequantized reference that only materialises active experts to stay under H200 memory at e=256).
  • test_moe_bf16_mxfp4_hopper_coverage (5 shapes, percent-based ≥ 99.9%).
  • test_moe_bf16_mxfp4_hopper_activations (3 SwiGLU variants).
  • test_moe_w4a8_hopper_correctness (2 shapes × bf16/fp16) — envelope matches the upstream CI shape (h = inter = 512, e = 2); larger exceeds strict tolerance because of FP8 + INT4 accumulation noise, same as the existing test_moe_w4a8.
  • test_moe_w4a8_hopper_autotune — smoke that autotune(True) doesn't break the W4A8 path.

All 18 green on H200 in 5.2 s cache-hot.

Performance

H200 (SM90 / HBM3e), hidden = 4096, intermediate = 2048, experts = 256, topk = 6, bf16 output, MXFP4 weights. cutlass_fused_moe median over bench_gpu_time. Weight + scale interleave is a one-shot model-load step and is excluded from timing. autotune column runs one pass under autotune(True) to populate the tactic cache before timing.

batch main no-autotune main autotune PR no-autotune PR autotune speedup (autotune)
4 0.791 ms 0.513 ms 0.221 ms 0.193 ms 2.66×
16 1.598 ms 1.607 ms 0.530 ms 0.532 ms 3.02×
64 3.761 ms 3.757 ms 1.200 ms 1.207 ms 3.11×

Integrate key optimizations from TensorRT-LLM PR #12451 for the
mixed-dtype MoE GEMM path (MXFP4 weights + BF16 activations):

- cutlass_heuristic.cpp: Skip 128x128x128B + COOPERATIVE scheduler
  combo for W4A16 grouped GEMM to avoid register overflow on SM90.
  Fall back to PINGPONG for this tile, keeping COOPERATIVE for others.

- moe_gemm_tma_ws_mixed_input_launcher.inl: Add max_swizzle_size=2
  scheduler hint for better L2 cache locality.

- test_w4a16_moe.py: Add 30 parametrized test cases covering:
  batch_size=[1..512], hidden_size=[2048..7168], num_experts=[8..256],
  top_k=[1..8], intermediate_size=[1024..4096], activation variants.
  Core target config: experts=256, topk=6, hidden=4096, inter=2048.

AI-assisted

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Apr 15, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Restricts a CUTLASS mainloop schedule for one SM90 tile when W4A16+AFP8 is enabled, sets a grouped-GEMM scheduler swizzle-size and raster-order override, adds Hopper-specific weight-interleaving CUDA helpers, new SM90-only parametrized tests for W4A16 MoE, and extends mixed-input conversion utilities (int4→fp8, fp4→bf16 LUTs) and SM100 helpers.

Changes

Cohort / File(s) Summary
CUTLASS heuristics
csrc/nv_internal/tensorrt_llm/kernels/cutlass_kernels/cutlass_heuristic.cpp
When building SM90 candidate configs for W4A16+AFP8, exclude MainloopScheduleType::COOPERATIVE for tile CtaShape128x128x128B; keep PINGPONG.
Grouped GEMM launcher
csrc/nv_internal/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/launchers/moe_gemm_tma_ws_mixed_input_launcher.inl
Add include for scheduler params and set arguments.scheduler.max_swizzle_size = 2 and arguments.scheduler.raster_order = PersistentTileSchedulerSm90Params::RasterOrderOptions::Heuristic before group-size assertion and workspace logic.
MoE mixed GEMM utils (new)
csrc/nv_internal/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_mixed_utils.cu, csrc/nv_internal/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_mixed_utils.h
Add Hopper-specific CUDA kernels and host wrappers to interleave FP4 and INT4 packed weights for mixed-precision GEMM; new exported helper declarations.
Mixed-input conversion & SM100 helpers
csrc/nv_internal/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/detail/collective/mixed_input_utils.hpp
Add int4→fp8 LUT conversion, rework fp4→bf16 LUT path to lane-dependent constants, add MixedInputUtilsSM100 with transform-based dequantization and adjust conversion dispatch and elements_per_smem_zero logic.
Tests (new)
tests/moe/test_w4a16_moe.py
Add SM90-only pytest module exercising fused MoE mixed-dtype (BF16 activations + W4A16 packed weights) with parametrized coverage, activation-parameter variants, and core-config test; skip when SM90 not supported.

Sequence Diagram(s)

sequenceDiagram
    participant Test
    participant Host as CPU Memory
    participant Launcher as GEMM Launcher
    participant Device as GPU Kernel
    Note over Test,Launcher: SM90-only test prepares inputs, packed weights, and quant scales
    Test->>Host: allocate activations, packed weights, scales, router logits
    Test->>Launcher: invoke cutlass_fused_moe(..., use_w4_group_scaling=True)
    Launcher->>Host: configure grouped-GEMM arguments (scheduler.max_swizzle_size=2, raster_order=Heuristic)
    Launcher->>Device: launch interleave_fp4/int4 kernels on stream
    Device-->>Host: write interleaved weight buffers
    Launcher->>Device: launch CUTLASS fused MoE GEMM kernel(s)
    Device-->>Host: write output tensor
    Test->>Host: validate output (finiteness, shape)
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

Suggested labels

run-ci, op: gemm

Suggested reviewers

  • aleozlx
  • jiahanc
  • IwakuraRein
  • yongwww
  • cyx-6
  • djns99

Poem

🐰
I hop through tiles and swap two little nibbles,
Tweak schedulers, pack weights, avoid sleepy dribbles.
Tests bound in fields of BF16 light,
Kernels hum softly into the night.
Small hops, bright kernels — all set just right.

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 22.50% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The title 'perf: optimize MXFP4xBF16 & INT4xFP8 CUTLASS MoE backend for SM90' accurately summarizes the main optimization work across multiple kernel components and configurations for SM90 Hopper hardware.
Description check ✅ Passed The PR description is comprehensive, detailed, and directly addresses the changes in the code.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request adds a test suite for W4A16 MoE kernels on SM90 and optimizes kernel heuristics by disabling the cooperative scheduler for specific tile configurations and adjusting swizzle sizes for better L2 locality. Feedback suggests simplifying conditional logic in the heuristics and addressing unused functions and parameters in the test code.

Comment thread csrc/nv_internal/tensorrt_llm/kernels/cutlass_kernels/cutlass_heuristic.cpp Outdated
Comment thread tests/moe/test_w4a16_moe.py Outdated
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@tests/moe/test_w4a16_moe.py`:
- Around line 124-125: The test currently sets check_correctness=True but never
compares numeric outputs to a reference; update the test so that when
check_correctness is True it runs only the smallest, quick case (e.g., the
minimal batch/num_experts config used for fast CI), computes a deterministic
baseline (seed RNG) or loads a stored reference output, and compares outputs
(e.g., logits/tensors) elementwise or via assert_allclose to the reference;
modify the validation block that currently only checks finiteness/shape (the
code around the finiteness/shape asserts) to perform this numeric comparison
guarded by check_correctness and only for the quick case to avoid long runtimes.
Ensure you reference the check_correctness flag and the small-case configuration
when adding the assert_allclose comparison.
- Around line 16-20: Remove the custom _is_sm90() helper and instead import and
call is_sm90a_supported() from flashinfer.utils; update any conditional test
skips in tests/moe/test_w4a16_moe.py to use is_sm90a_supported() (and add the
import), and delete the _is_sm90() definition so the test follows the repo-wide
standardized architecture check helper pattern.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 0216a0c2-9850-420e-abe7-1b08b3746c09

📥 Commits

Reviewing files that changed from the base of the PR and between 25b324d and 4c2372b.

📒 Files selected for processing (3)
  • csrc/nv_internal/tensorrt_llm/kernels/cutlass_kernels/cutlass_heuristic.cpp
  • csrc/nv_internal/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/launchers/moe_gemm_tma_ws_mixed_input_launcher.inl
  • tests/moe/test_w4a16_moe.py

Comment thread tests/moe/test_w4a16_moe.py Outdated
Comment thread tests/moe/test_w4a16_moe.py Outdated
samuellees and others added 3 commits April 15, 2026 18:30
- Invert condition to eliminate empty if-block in heuristic (readability)
- Use is_sm90a_supported() instead of custom _is_sm90() (codebase convention)
- Remove unused _dequant_mxfp4_host, _compute_reference, check_correctness

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
… #12451

- Add moe_gemm_mixed_utils.cu/.h: CUDA kernels for FP4 and INT4 weight
  interleaving on Hopper, reorganizing weight layout for optimal TMA
  scheduling in mixed-precision grouped GEMM.

- Add RasterOrderOptions::Heuristic to mixed-input launcher scheduler
  config, complementing the existing max_swizzle_size=2 setting.

These are direct ports from TRTLLM PR #12451 commits cd541ba and 79315f6.

Note: The CUTLASS extension changes (mixed_input_utils.hpp LDSM pipeline,
Int4->FP8 LUT, scale_convertor in sm90_mma_array...hpp) are NOT included
in this commit. Those changes involve deep CUTLASS template restructuring
and the upstream PR has open review issues (scale tensormap not updated,
lane mapping errors). They should be migrated after the upstream PR is
finalized and merged.

AI-assisted

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🧹 Nitpick comments (1)
csrc/nv_internal/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_mixed_utils.cu (1)

35-39: Document the swizzle and launch geometry choices.

The lane remaps here and the fixed 1024-block launches are hard to reason about without context. A short note on why Hopper mixed GEMM needs this layout, and why a simpler linear repack or shape-derived grid was not used, would make future tuning much safer.

As per coding guidelines, "For performance-critical hot paths, leave comments with justification for special algorithmic choices and mention alternative approaches considered."

Also applies to: 73-77, 95-108

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@csrc/nv_internal/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_mixed_utils.cu`
around lines 35 - 39, Add concise in-code documentation explaining the swizzle
and launch geometry choices: annotate the computation of interleaved_lane_id
(using lane_id and partition_id) and derived indices col_id and dst_col_id, plus
the fixed 1024-block launch size, with why this particular remapping is required
for Hopper mixed GEMM (e.g., tensor core/hardware lane packing, bank conflicts,
warp-per-thread mapping) and why simpler alternatives (linear repack or
shape-derived grid) were rejected; place these comments next to the interleaving
code (interleaved_lane_id, col_id, dst_col_id) and repeat/expand the
justification near the other identical sections mentioned (the code around lines
referenced as 73-77 and 95-108) so future tuners can understand the trade-offs
and assumptions.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In
`@csrc/nv_internal/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_mixed_utils.cu`:
- Around line 23-109: This file was reformatted by clang-format and the
pre-commit check fails; run clang-format (or your project's formatting tool) on
csrc/nv_internal/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_mixed_utils.cu
to apply the expected style. Ensure you reformat the entire file and stage the
changes before committing; focus on the functions
interleave_fp4_weights_for_Hopper_mixed_gemm_kernel,
interleave_int4_weights_for_Hopper_mixed_gemm_kernel and their callers
interleave_fp4_weights_for_Hopper_mixed_gemm /
interleave_int4_weights_for_Hopper_mixed_gemm so the diff matches the project's
clang-format rules.
- Around line 27-28: The kernels in moe_gemm_mixed_utils.cu assume rows % 16 ==
0 and cols % 64 == 0, but the current launch loops (using block_id from
blockIdx.x and partition_id from threadIdx.y) can read past bounds (e.g.,
accesses like row_id + 8) and silently drop column remainders; add a host-side
validation before launching these kernels that checks the input dimensions (rows
and cols) and either (a) returns/throws an error for unsupported shapes or (b)
pads/rounds up the buffers to multiples of 16 (rows) and 64 (cols) and documents
that fallback behavior; ensure this check is performed wherever these kernels
are invoked so the loops governed by block_id/partition_id never encounter tails
that would corrupt the interleaved buffer.

In
`@csrc/nv_internal/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_mixed_utils.h`:
- Around line 17-32: Run clang-format on this header to match the repo style:
reorder and format includes and function declarations so they follow the
project's include order/spacing rules and line-wrapping conventions.
Specifically, fix the include order and spacing around the header guard/pragma
and reflow the declarations for interleave_fp4_weights_for_Hopper_mixed_gemm and
interleave_int4_weights_for_Hopper_mixed_gemm so they match the repo's
wrapped-declaration style (use the project's preferred line breaks, parameter
alignment, and trailing semicolons). Save the file after clang-format so
pre-commit no longer fails.

---

Nitpick comments:
In
`@csrc/nv_internal/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_mixed_utils.cu`:
- Around line 35-39: Add concise in-code documentation explaining the swizzle
and launch geometry choices: annotate the computation of interleaved_lane_id
(using lane_id and partition_id) and derived indices col_id and dst_col_id, plus
the fixed 1024-block launch size, with why this particular remapping is required
for Hopper mixed GEMM (e.g., tensor core/hardware lane packing, bank conflicts,
warp-per-thread mapping) and why simpler alternatives (linear repack or
shape-derived grid) were rejected; place these comments next to the interleaving
code (interleaved_lane_id, col_id, dst_col_id) and repeat/expand the
justification near the other identical sections mentioned (the code around lines
referenced as 73-77 and 95-108) so future tuners can understand the trade-offs
and assumptions.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: d1944a26-f0ae-49a7-b1ae-2bc06aa19d0a

📥 Commits

Reviewing files that changed from the base of the PR and between 931b87a and 9c4d204.

📒 Files selected for processing (3)
  • csrc/nv_internal/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/launchers/moe_gemm_tma_ws_mixed_input_launcher.inl
  • csrc/nv_internal/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_mixed_utils.cu
  • csrc/nv_internal/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_mixed_utils.h
🚧 Files skipped from review as they are similar to previous changes (1)
  • csrc/nv_internal/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/launchers/moe_gemm_tma_ws_mixed_input_launcher.inl

Comment on lines +27 to +28
for (int block_id = blockIdx.x; block_id < rows / 2; block_id += gridDim.x) {
for (int partition_id = threadIdx.y; partition_id < cols / 64; partition_id += blockDim.y) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Fail fast on unsupported matrix shapes.

These kernels only work when rows is a multiple of 16 and cols is a multiple of 64. On Line 42 and Line 80, row_id + 8 can read past the last tile when rows has a tail, and the partition_id < cols / 64 loops silently drop remainder columns. Please add a host-side check or fallback before launch so unsupported shapes do not corrupt the interleaved buffer.

Also applies to: 41-56, 66-67, 79-88, 95-108

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@csrc/nv_internal/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_mixed_utils.cu`
around lines 27 - 28, The kernels in moe_gemm_mixed_utils.cu assume rows % 16 ==
0 and cols % 64 == 0, but the current launch loops (using block_id from
blockIdx.x and partition_id from threadIdx.y) can read past bounds (e.g.,
accesses like row_id + 8) and silently drop column remainders; add a host-side
validation before launching these kernels that checks the input dimensions (rows
and cols) and either (a) returns/throws an error for unsupported shapes or (b)
pads/rounds up the buffers to multiples of 16 (rows) and 64 (cols) and documents
that fallback behavior; ensure this check is performed wherever these kernels
are invoked so the loops governed by block_id/partition_id never encounter tails
that would corrupt the interleaved buffer.

… #12451

Port core CUTLASS extension changes from TensorRT-LLM PR #12451 (commit
cd541ba) for Hopper mixed-precision MoE GEMM performance:

mixed_input_utils.hpp:
- Add Int4->FP8 E4M3 lookup table conversion (psx_cvt_lut_prmt_int4x8_to_fp8x8)
  with lane-dependent LUT constants for parallel thread processing
- Rename cvt_lut_bf16 -> cvt_lut_fp4_to_bf16 with thread-aware LUT indexing
- Rename psx_cvt_lut_prmt_fp4x8_to_bf16x8 -> _interleaved with optimized
  bit manipulation for interleaved weight layout
- Add copy_tensors_A/copy_tensors_SFA for separated tensor copy paths
- Add int4tofp8_lookup_table_convert template method
- Add UseInt4ToFP8LookupTable constraint flag

sm90_mma_array_tma_gmma_rs_warpspecialized_mixed_input_.hpp:
- Restructure A operand loading with LDSM-based copy/retiling using
  SM75_U32x4_LDSM_N for improved memory access patterns
- Add scale_convertor template for FP2M1 scale type conversion
- Add TensormapUpdateShapesStridesForAandScale flag for conditional
  tensormap shape/stride updates
- Separate scale copying into dedicated copy_tensors_SFA calls
- Enhance tensormaps_replace_global_address/properties/cp_fence_release
  with conditional update logic

Data format note: The interleaved weight layout requires weights to be
preprocessed with interleave_fp4_weights_for_Hopper_mixed_gemm (added in
prior commit) before being passed to the GEMM kernel.

AI-assisted

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
csrc/nv_internal/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/detail/collective/mixed_input_utils.hpp (1)

652-669: Misleading comment should be updated.

The comment on line 652 states KernelConversionMode == ConversionMode::DirectConvert, but this function now dispatches to UseFP4ToBF16LookupTable and UseInt4ToFP8LookupTable paths, which are only enabled when KernelConversionMode == ConversionMode::ConvertAndScale (per the definitions in the collective header). Consider updating or removing the comment to reflect the actual conversion modes handled.

Suggested comment update
-        // KernelConversionMode == ConversionMode::DirectConvert
+        // Type conversion dispatch: LUT paths for FP4→BF16 and INT4→FP8, generic converter otherwise
         CUTLASS_PRAGMA_UNROLL
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@csrc/nv_internal/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/detail/collective/mixed_input_utils.hpp`
around lines 652 - 669, The comment "KernelConversionMode ==
ConversionMode::DirectConvert" is now misleading because this block also
dispatches to UseFP4ToBF16LookupTable and UseInt4ToFP8LookupTable conversion
paths; update or remove the comment to accurately reflect supported modes (e.g.,
mention both DirectConvert and ConvertAndScale or remove mode-specific wording).
Modify the comment near the loop that references
KernelConversionMode/ConversionMode::DirectConvert and ensure it calls out the
conditional branches UseFP4ToBF16LookupTable, UseInt4ToFP8LookupTable, and the
fallback LayoutAwareConvert (and the called helpers
fp4tobf16_lookup_table_convert and int4tofp8_lookup_table_convert) so the
comment matches actual behavior.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In
`@csrc/nv_internal/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/detail/collective/mixed_input_utils.hpp`:
- Around line 652-669: The comment "KernelConversionMode ==
ConversionMode::DirectConvert" is now misleading because this block also
dispatches to UseFP4ToBF16LookupTable and UseInt4ToFP8LookupTable conversion
paths; update or remove the comment to accurately reflect supported modes (e.g.,
mention both DirectConvert and ConvertAndScale or remove mode-specific wording).
Modify the comment near the loop that references
KernelConversionMode/ConversionMode::DirectConvert and ensure it calls out the
conditional branches UseFP4ToBF16LookupTable, UseInt4ToFP8LookupTable, and the
fallback LayoutAwareConvert (and the called helpers
fp4tobf16_lookup_table_convert and int4tofp8_lookup_table_convert) so the
comment matches actual behavior.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 4bacfaeb-b7f8-4a6f-b27f-76978cce257c

📥 Commits

Reviewing files that changed from the base of the PR and between 9c4d204 and 31f80e2.

📒 Files selected for processing (2)
  • csrc/nv_internal/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/detail/collective/mixed_input_utils.hpp
  • csrc/nv_internal/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/gemm/collective/sm90_mma_array_tma_gmma_rs_warpspecialized_mixed_input_.hpp

samuellees and others added 3 commits April 17, 2026 03:30
Benchmark wMXFP4 x BF16 MoE GEMM on H20 (SM90) with configurable
batch size, TP/EP splits, and expert/hidden/intermediate dimensions.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…ctness checks

Revert mixed_input_utils.hpp and sm90_mma_array...hpp to main branch versions.
The TRTLLM PR #12451 versions require interleaved weight layout preprocessing
which is incompatible with FlashInfer's current API — all 18 tests failed with
95.8% element mismatch and max abs error 37.25 (expected < 0.1).

The heuristic, launcher scheduler, and weight interleave utility changes from
TRTLLM PR #12451 are retained as they don't change the GEMM kernel data path.

Also update test_w4a16_moe.py to include dequant-based reference correctness
verification (matching test_trtllm_cutlass_fused_moe.py, rtol=1e-1, atol=1e-1).

AI-assisted

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@samuellees samuellees marked this pull request as draft April 18, 2026 13:18
…ty only

The dequant-based reference comparison (rtol=1e-1, atol=1e-1) only works
reliably at small problem sizes (hidden=128, experts=2-8), matching the
original test_trtllm_cutlass_fused_moe.py methodology. At large sizes
(hidden=4096, experts=256), cumulative FP4 quantization error exceeds
the tolerance threshold.

- CORRECTNESS_CONFIGS: small sizes with strict assert_close
- COVERAGE_CONFIGS: large sizes with finite+shape sanity checks
- ACTIVATION_CONFIGS: small sizes with strict assert_close

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@samuellees samuellees changed the title feat: optimize W4A16 (wMXFP4 x BF16) MoE GEMM for SM90 perf: optimize MXFP4xBF16 & INT4xFP8 CUTLASS MoE backend for SM90 Apr 20, 2026
samuellees and others added 5 commits April 20, 2026 19:33
…ntouched

Moves all PR-specific test coverage out of
tests/moe/test_trtllm_cutlass_fused_moe.py and into two new dedicated files,
so the existing upstream test matrix stays at its original (bs=1, h=128)
scale and the PR does not bundle matrix bumps that regressed unrelated
SM100/SM120 MXFP8 paths.

Files
-----
- tests/moe/test_trtllm_cutlass_fused_moe.py:
  reverted to origin/main (drops the BATCH_SIZES/HIDDEN_SIZES/INTERMEDIATE_SIZES
  bump from (1, 128, 128) to (16, 512, 512), drops the interleave calls added
  inside test_moe_bf16_mxfp4 and test_moe_w4a8, drops the act-scale dtype /
  reference input_scale tweaks inside test_moe_w4a8). All of those logical
  fixes are re-applied in the new files below, but only for the PR's own
  SM90 mixed-input paths.

- tests/moe/test_w4a16_moe.py (restored from earlier history):
  MXFP4 x BF16 (W4A16) SM90 tests. Covers 5 CORRECTNESS_CONFIGS
  (h=128..4096) under strict assert_close, 8 COVERAGE_CONFIGS (primary target
  e=256/topk=6/h=4096/n=2048 stressed along batch / shape / expert-count /
  top_k axes) under a 99.9%-elements-within-tolerance check, 3 SwiGLU
  ACTIVATION_CONFIGS, and a single primary-target smoke. All inputs go
  through interleave_moe_weights_for_hopper_mixed_gemm("fp4") +
  interleave_moe_scales_for_hopper_mixed_gemm before cutlass_fused_moe.
  Reference dequantizes MXFP4 on-device via a small FP4 LUT and only covers
  the top-k active experts (avoids OOM on e=256).

- tests/moe/test_w4a8_moe.py (new):
  INT4 x FP8_e4m3 (W4A8) SM90 tests, dedicated to the W4A8 AWQ-style path
  that uses per-group weight scales + per-channel pre-quant act scales.
  Mirrors the structure of test_w4a16_moe.py (correctness / coverage /
  autotune / core_config) and parametrizes both bf16 and fp16 output dtypes
  on the correctness sweep. Weights go through
  interleave_moe_weights_for_hopper_mixed_gemm("int4"); weight scales use
  TRTLLM's factor-4/2/1 reshape+permute and the SM90 bf16-bitpattern trick;
  activation scales stay in the native dtype (consumed by
  expandInputRows / applyPrequantScale as OutputType).
  fc1_input_scale for the reference is a broadcast max over experts — the
  kernel folds per-expert input scales into a single divisor, so per-expert
  scales in the reference would double-correct and diverge.

AI-assisted
The SM90 W4A8 kernel carries enough FP8 + INT4 accumulation noise that the
assert_close(rtol=1e-2, atol=1e-1) tolerance vs a float32 PyTorch reference
is only achievable at h == intermediate_size == 512 with num_experts == 2.
This matches the envelope the upstream CI test already uses; going beyond
it fails in the upstream test too (independently verified on H200:
e=2/h=2048 and e=8/h=512 both fail in upstream).

Shrinks the file to two strict configs plus one autotune smoke, drops the
coverage / core_config sweeps that exceeded the envelope.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Move the dedicated test_w4a16_moe.py / test_w4a8_moe.py coverage into
test_trtllm_cutlass_fused_moe.py as six new test functions:

  test_moe_bf16_mxfp4_hopper_correctness   (5 configs)
  test_moe_bf16_mxfp4_hopper_coverage      (8 configs, percent >= 99.9%)
  test_moe_bf16_mxfp4_hopper_activations   (3 swiglu variants)
  test_moe_w4a8_hopper_correctness         (2 configs x 2 dtypes)
  test_moe_w4a8_hopper_autotune            (smoke)

Reuses existing module-level helpers (compute_routing, dequantize_int4_to_dtype,
torch_moe_w4a8). Adds a GPU-side MXFP4 dequant + active-expert reference
compute to keep e=256 / h=4096 / n=2048 coverage from OOMing vs
dequant_mxfp4_batches_host.

21/21 green on H200 (bf16 + fp16, strict assert_close for correctness, 99.9%
percent-based for coverage).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Previously batch was only covered at m=4 / m=16; add a single-token entry
so the W4A8 strict envelope matches the W4A16 side (which already covers
m=1 in both correctness and coverage). Verified on H200.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
User-requested trim: remove the m=16 batch across W4A16 coverage, W4A8
correctness, and W4A8 autotune. Replace the six W4A16 coverage entries
that relied on m=16 (to stress non-batch axes) with m=4; drop three
configs that OOM on H200 under parametrize fragmentation
(m=512 / h=7168-e256 / n=4096-e256).

Final envelope: 18 tests, H200 5.21s.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Comment thread csrc/fused_moe/cutlass_backend/flashinfer_cutlass_fused_moe_binding.cu Outdated
Comment thread csrc/fused_moe/cutlass_backend/flashinfer_cutlass_fused_moe_binding.cu Outdated
Comment thread csrc/fused_moe/cutlass_backend/flashinfer_cutlass_fused_moe_binding.cu Outdated
Comment thread tests/moe/test_trtllm_cutlass_fused_moe.py Outdated
Comment thread tests/moe/test_trtllm_cutlass_fused_moe.py
samuellees added a commit to samuellees/flashinfer that referenced this pull request Apr 21, 2026
Addresses review feedback (samuellees on PR flashinfer-ai#3084): function names should
use the SM arch identifier (sm90) rather than the marketing name
(Hopper/hopper), for consistency with the rest of the fused_moe API.

Renamed across:
  C++ kernel impl (moe_gemm_mixed_utils.{cu,h})
  C++ binding + TVM_FFI export (flashinfer_cutlass_fused_moe_binding.cu)
  Python helpers + docstrings (flashinfer/fused_moe/core.py, __init__.py)
  Tests (tests/moe/test_trtllm_cutlass_fused_moe.py)

External references to TRT-LLM's upstream
interleave_4bit_weights_for_Hopper_mixed_gemm retain the original name
(it's still called that in TRT-LLM). Also tightened the block comment
introducing the new tests to no longer depend on cross-references into
the upstream test_moe_w4a8.

Tests: 18/18 green on H200 (cache-cold build + test run 13 min).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Addresses review feedback (samuellees on PR flashinfer-ai#3084): function names should
use the SM arch identifier (sm90) rather than the marketing name
(Hopper/hopper), for consistency with the rest of the fused_moe API.

Renamed across:
  C++ kernel impl (moe_gemm_mixed_utils.{cu,h})
  C++ binding + TVM_FFI export (flashinfer_cutlass_fused_moe_binding.cu)
  Python helpers + docstrings (flashinfer/fused_moe/core.py, __init__.py)
  Tests (tests/moe/test_trtllm_cutlass_fused_moe.py)

External references to TRT-LLM's upstream
interleave_4bit_weights_for_Hopper_mixed_gemm retain the original name
(it's still called that in TRT-LLM). Also tightened the block comment
introducing the new tests to no longer depend on cross-references into
the upstream test_moe_w4a8.

Tests: 18/18 green on H200 (cache-cold build + test run 13 min).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@samuellees samuellees force-pushed the feat/w4a16-moe-kernel branch from 9388f37 to cb90611 Compare April 21, 2026 11:57
@samuellees
Copy link
Copy Markdown
Collaborator Author

/bot run

@flashinfer-bot
Copy link
Copy Markdown
Collaborator

GitLab MR !569 has been updated with latest changes, and the CI pipeline #49088402 is currently running. I'll report back once the pipeline job completes.

samuellees and others added 2 commits April 21, 2026 05:42
- clang-format / ruff-format reflow triggered by the earlier
  Hopper → sm90 rename (shorter names let the formatters pack more on
  fewer lines).
- Add the two new public helpers to docs/api/fused_moe.rst so Sphinx
  autodoc picks them up:
    * interleave_moe_weights_for_sm90_mixed_gemm
    * interleave_moe_scales_for_sm90_mixed_gemm

mypy errors reported by pre-commit (`ActivationType` etc. "not defined")
are pre-existing — same 74 errors on plain origin/main — and come from
the wildcard `from ..tllm_enums import *` in flashinfer/fused_moe/core.py.
Not touched by this PR.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@samuellees
Copy link
Copy Markdown
Collaborator Author

/bot run

@flashinfer-bot
Copy link
Copy Markdown
Collaborator

GitLab MR !569 has been updated with latest changes, and the CI pipeline #49090861 is currently running. I'll report back once the pipeline job completes.

@samuellees
Copy link
Copy Markdown
Collaborator Author

@flashinfer-bot rerun failed

The new SM90 mixed-input MoE kernel (ported from TRT-LLM PR #12451)
expects weights and MXFP4 block scales in an interleaved byte layout.
The upstream tests passed raw weights / raw scales, which produced stale
output:

  test_moe_bf16_mxfp4[*-128-2-2-128-1]: 18-54% mismatched on H200/H100
  test_moe_w4a8: passes at h=128 by tolerance, but is wrong layout-wise

Both tests now apply the public preprocessing helpers added by this PR:

  test_moe_bf16_mxfp4:
    + interleave_moe_weights_for_sm90_mixed_gemm(w, "fp4")
    + interleave_moe_scales_for_sm90_mixed_gemm(w_scale)

  test_moe_w4a8:
    + interleave_moe_weights_for_sm90_mixed_gemm(w, "int4")
    (scale interleave was already done via the local interleave_weights
     helper.)

7 H200 tests under upstream test_moe_{bf16_mxfp4,w4a8} green; together
with the 18 new sm90-hopper tests this PR added: 25/25 in 5.3s.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@samuellees
Copy link
Copy Markdown
Collaborator Author

/bot run

@flashinfer-bot
Copy link
Copy Markdown
Collaborator

GitLab MR !569 has been updated with latest changes, and the CI pipeline #49156313 is currently running. I'll report back once the pipeline job completes.

@samuellees
Copy link
Copy Markdown
Collaborator Author

/bot run

@flashinfer-bot
Copy link
Copy Markdown
Collaborator

GitLab MR !569 has been updated with latest changes, and the CI pipeline #49156422 is currently running. I'll report back once the pipeline job completes.

@samuellees samuellees enabled auto-merge (squash) April 22, 2026 14:41
Copy link
Copy Markdown
Collaborator

@aleozlx aleozlx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@samuellees samuellees merged commit d454492 into flashinfer-ai:main Apr 23, 2026
76 of 78 checks passed
@samuellees samuellees deleted the feat/w4a16-moe-kernel branch April 24, 2026 01:57
@aleozlx aleozlx mentioned this pull request Apr 25, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants