Skip to content

feat: expose swizzled_input_sf parameter for CUTLASS fused MOE#2330

Merged
jiahanc merged 7 commits intomainfrom
claude/issue-2200-20260111-0748
Mar 30, 2026
Merged

feat: expose swizzled_input_sf parameter for CUTLASS fused MOE#2330
jiahanc merged 7 commits intomainfrom
claude/issue-2200-20260111-0748

Conversation

@yzh119
Copy link
Copy Markdown
Collaborator

@yzh119 yzh119 commented Jan 11, 2026

Summary

  • Add swizzled_input_sf parameter to allow users to control whether the input scaling factor is swizzled
  • Enables fusion of the swizzle operation into the MOE kernel after FP4 allgather/alltoall operations
  • Default value is True to maintain backward compatibility

Closes #2200

Generated with Claude Code

Summary by CodeRabbit

  • New Features

    • Added explicit control over input swizzling for Mixture-of-Experts ops via a new boolean parameter (defaults to enabled).
  • API Changes

    • Public MoE operation signatures now accept the swizzling parameter and have adjusted argument ordering; wrappers and docs updated accordingly.
  • Tests

    • Added a CUDA-gated test validating FP4/NVFP4 behavior with swizzled vs linear input layouts.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Jan 11, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Adds a new swizzled_input_sf: bool parameter propagated from Python through TVM/C++ bindings into the CUDA fused MoE kernel; removes internal hard-coded swizzle defaults and adds a GPU test validating swizzled vs unswizzled input_sf handling.

Changes

Cohort / File(s) Summary
CUDA C++ Binding
csrc/fused_moe/cutlass_backend/flashinfer_cutlass_fused_moe_binding.cu
Added bool swizzled_input_sf to FusedMoeRunner::runMoe and runMoeMinLatency signatures; removed internal default swizzle flags and forward caller-provided swizzled_input_sf into kernel runner calls and TVM GetFunction lambdas for "run_moe" / "run_moe_min_latency".
Python API
flashinfer/fused_moe/core.py
Added swizzled_input_sf: bool = True to cutlass_fused_moe, _fake_cutlass_fused_moe, and the exported flashinfer_api wrapper; threaded the parameter into the TVM/C++ invocation and updated docstring to document input_sf layout semantics.
Tests
tests/moe/test_trtllm_cutlass_fused_moe.py
Added test_moe_nvfp4_unswizzled_input_sf() (SM100+-gated) that quantizes with swizzled vs linear input_sf, calls cutlass_fused_moe with matching swizzled_input_sf flags, and asserts outputs match.

Sequence Diagram

sequenceDiagram
    participant User as User Code
    participant PyAPI as Python API\n(cutlass_fused_moe)
    participant TVM as TVM Binding\n(GetFunction)
    participant Cpp as C++ Runner\n(FusedMoeRunner)
    participant Kernel as CUDA Kernel

    User->>PyAPI: call cutlass_fused_moe(..., swizzled_input_sf)
    PyAPI->>TVM: forward args including swizzled_input_sf
    TVM->>Cpp: invoke lambda with swizzled_input_sf
    Cpp->>Kernel: launch kernel using swizzled_input_sf
    Kernel-->>Cpp: kernel completes
    Cpp-->>TVM: return results
    TVM-->>PyAPI: return outputs
    PyAPI-->>User: return outputs
Loading

Estimated Code Review Effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

Suggested labels

run-ci

Suggested reviewers

  • djmmoss
  • cyx-6
  • aleozlx
  • wenscarl
  • jiahanc
  • jimmyzho
  • nv-yunzheq
  • IwakuraRein

Poem

🐇 I tunneled a flag from Py to C,

swizzled or straight — now both roam free.
From quant to kernel my whiskers trace,
matching layouts, outputs in place.
A nibble of code, a carrot of grace.

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 30.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately summarizes the main feature addition - exposing a swizzled_input_sf parameter for CUTLASS fused MOE.
Description check ✅ Passed The description covers the summary, related issues, and acknowledges the template checklist items, though some checklist items lack explicit confirmation.
Linked Issues check ✅ Passed The PR implementation fully addresses issue #2200 by exposing swizzled_input_sf parameter across C++ binding, Python API, and tests with backward-compatible default True.
Out of Scope Changes check ✅ Passed All changes are directly scoped to implementing the swizzled_input_sf parameter feature with no unrelated modifications detected.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch claude/issue-2200-20260111-0748

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@gemini-code-assist
Copy link
Copy Markdown
Contributor

Summary of Changes

Hello @yzh119, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request enhances the CUTLASS fused Mixture-of-Experts (MOE) implementation by exposing a new configuration parameter, swizzled_input_sf. This parameter provides users with explicit control over the swizzling behavior of the input scaling factor, which is crucial for optimizing performance by enabling the fusion of the swizzle operation directly into the MOE kernel. The default setting maintains existing behavior, ensuring a smooth transition for current users while offering new optimization opportunities.

Highlights

  • New Parameter Introduction: A new boolean parameter, swizzled_input_sf, has been introduced to the CUTLASS fused MOE kernel, allowing explicit control over whether the input scaling factor is swizzled.
  • Performance Optimization: This parameter enables the fusion of the swizzle operation directly into the MOE kernel, specifically after FP4 allgather/alltoall operations, which can lead to performance improvements.
  • Backward Compatibility: The swizzled_input_sf parameter defaults to True to ensure backward compatibility with existing implementations and workflows.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
flashinfer/fused_moe/core.py (1)

710-931: Critical: swizzled_input_sf parameter missing from public API function.

The new swizzled_input_sf parameter was added to the inner cutlass_fused_moe function (line 510) and passed to run_moe (line 627), but it's missing from:

  1. The public cutlass_fused_moe function signature (lines 710-740)
  2. The call to the inner function (lines 901-931)

This means users cannot actually control the swizzle behavior through the public API, defeating the purpose of this PR.

🐛 Proposed fix

Add the parameter to the public function signature (after swiglu_limit):

     swiglu_alpha: Optional[torch.Tensor] = None,
     swiglu_beta: Optional[torch.Tensor] = None,
     swiglu_limit: Optional[torch.Tensor] = None,
+    swizzled_input_sf: bool = True,
     tp_size: int = 1,

And pass it in the call (around line 916):

         swiglu_alpha,
         swiglu_beta,
         swiglu_limit,
+        swizzled_input_sf,
         tp_size,

Also add documentation for the parameter in the docstring.

csrc/fused_moe/cutlass_backend/flashinfer_cutlass_fused_moe_binding.cu (1)

420-433: Pipeline failure: clang-format check failed.

The CI indicates a formatting issue around line 424. Based on the code structure, ensure consistent formatting of the function signature.

Run clang-format on this file to fix the formatting:

clang-format -i csrc/fused_moe/cutlass_backend/flashinfer_cutlass_fused_moe_binding.cu
📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 2062dec and 1a50f5a.

📒 Files selected for processing (2)
  • csrc/fused_moe/cutlass_backend/flashinfer_cutlass_fused_moe_binding.cu
  • flashinfer/fused_moe/core.py
🧰 Additional context used
📓 Path-based instructions (2)
flashinfer/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

flashinfer/**/*.py: Use @functools.cache decorator on Python API functions to implement module-level caching and avoid recompilation
Use @flashinfer_api decorator for debugging API calls, enable via FLASHINFER_LOGLEVEL environment variable (0=off, 1=basic, 3=detailed, 5=with stats)

Files:

  • flashinfer/fused_moe/core.py
csrc/**/*.cu

📄 CodeRabbit inference engine (CLAUDE.md)

Framework bindings and PyTorch tensor handling should be implemented in csrc/ via TVM-FFI, not in include/ headers

Files:

  • csrc/fused_moe/cutlass_backend/flashinfer_cutlass_fused_moe_binding.cu
🧬 Code graph analysis (1)
csrc/fused_moe/cutlass_backend/flashinfer_cutlass_fused_moe_binding.cu (2)
flashinfer/comm/mapping.py (1)
  • tp_rank (325-326)
csrc/nv_internal/tensorrt_llm/kernels/cutlass_kernels/include/moe_gemm_kernels.h (1)
  • enable_pdl (220-220)
🪛 GitHub Actions: pre-commit
csrc/fused_moe/cutlass_backend/flashinfer_cutlass_fused_moe_binding.cu

[error] 424-424: clang-format check failed. Files were modified by this hook during pre-commit.

🪛 Ruff (0.14.10)
flashinfer/fused_moe/core.py

669-669: Unused function argument: swizzled_input_sf

(ARG001)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
  • GitHub Check: Deploy Docs
  • GitHub Check: claude-review
🔇 Additional comments (5)
flashinfer/fused_moe/core.py (2)

653-701: The swizzled_input_sf parameter is correctly added to the fake op.

The static analysis warning about the unused parameter is expected for fake ops, as they only define output shapes and dtypes without executing actual logic.


510-510: Parameter correctly threaded through inner function to runtime call.

The swizzled_input_sf parameter is properly added with default True for backward compatibility, and correctly passed to the underlying run_moe call.

Also applies to: 627-627

csrc/fused_moe/cutlass_backend/flashinfer_cutlass_fused_moe_binding.cu (3)

246-250: Parameter correctly integrated into runMoe method.

The swizzled_input_sf parameter is properly:

  1. Added to the function signature after swiglu_limit
  2. Passed through to mKernelRunner->runMoe in both USING_OSS_CUTLASS_MOE_GEMM and non-OSS paths

Also applies to: 385-400


426-433: Parameter correctly integrated into runMoeMinLantency method.

The swizzled_input_sf is properly threaded through the min-latency execution path in both OSS and non-OSS code paths.

Also applies to: 569-602


718-758: TVM-FFI bindings correctly updated.

Both run_moe and run_moe_min_latency function bindings properly include the new swizzled_input_sf parameter and pass it through to the respective runner methods. This aligns with the coding guidelines for framework bindings in csrc/.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request exposes a new swizzled_input_sf parameter for the CUTLASS fused MoE implementation, allowing control over whether the input scaling factor is swizzled. The changes are well-implemented, consistently propagating the new parameter from the Python API down to the C++ kernel calls. The default value is set to True to maintain backward compatibility, as described. The code looks good, but I have one suggestion to improve documentation by adding the new parameter to the function's docstring.

swiglu_alpha: Optional[torch.Tensor] = None,
swiglu_beta: Optional[torch.Tensor] = None,
swiglu_limit: Optional[torch.Tensor] = None,
swizzled_input_sf: bool = True,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The docstring for cutlass_fused_moe (starting on line 741) is missing documentation for the new swizzled_input_sf parameter. Please add it for completeness and to inform users about this new option.

@claude
Copy link
Copy Markdown

claude Bot commented Jan 11, 2026

Code Review

I've reviewed PR #2330 and overall the implementation looks good. The change cleanly exposes the swizzled_input_sf parameter that was previously hardcoded, enabling the fusion optimization mentioned in issue #2200.

✅ Strengths

  1. Backward Compatibility: The default value of True maintains existing behavior, preventing any breaking changes for current users.

  2. Clean Implementation: The parameter is properly threaded through all layers:

    • Python API (flashinfer/fused_moe/core.py)
    • TVM-FFI bindings (csrc/fused_moe/cutlass_backend/flashinfer_cutlass_fused_moe_binding.cu)
    • Both run_moe and run_moe_min_latency code paths
  3. Consistent Placement: The parameter is placed logically in the parameter list, after swiglu_limit and before the parallelism parameters (tp_size, etc.).

  4. Proper Scope: The change removes the hardcoded swizzled_input_sf = true values in the C++ binding layer, as they should be controlled by the user.

🔍 Potential Issues & Suggestions

1. Missing Test Coverage (High Priority)

The PR adds a new user-facing parameter but doesn't include any tests exercising it. This is risky because:

  • We can't verify the parameter actually works as intended
  • Future refactoring might break this functionality without detection
  • The issue mentions a specific use case (FP4 allgather/alltoall), which should be validated

Recommendation: Add at least one test case that calls cutlass_fused_moe with swizzled_input_sf=False. The existing test_moe_nvfp4 test in tests/moe/test_trtllm_cutlass_fused_moe.py (lines 484-632) would be a good template since it already uses input_sf.

Example:

@pytest.mark.parametrize("swizzled_input_sf", [True, False])
def test_moe_nvfp4_swizzled_input_sf(batch_size, ..., swizzled_input_sf):
    # ... existing setup ...
    _ = fused_moe.cutlass_fused_moe(
        hidden_states,
        selected_experts.to(torch.int),
        routing_weights,
        w1_q.contiguous().view(torch.long),
        w2_q.contiguous().view(torch.long),
        otype,
        quant_scales=quant_scales,
        input_sf=input_sf,
        swizzled_input_sf=swizzled_input_sf,  # Test the new parameter
        output=flash_output,
        activation_type=activation_type,
    )

2. Missing Documentation (Medium Priority)

The parameter lacks a docstring explaining:

  • What "swizzled" means in this context
  • When to set it to False vs True
  • The performance implications

Recommendation: Add parameter documentation to the cutlass_fused_moe function (around line 495 in core.py). Reference the TensorRT-LLM PR (#6231) for technical details if needed.

Example docstring addition:

def cutlass_fused_moe(
    ...
    swizzled_input_sf: bool = True,  # Whether the input scaling factor is swizzled
    ...
):
    """
    ...
    Args:
        ...
        swizzled_input_sf: Whether the input scaling factor (input_sf) is already 
            swizzled. Set to False when the swizzle operation should be fused into 
            the MOE kernel (e.g., after FP4 allgather/alltoall operations). 
            Default: True (maintains backward compatibility).
        ...
    """

3. Fake Op Signature (Low Priority)

The _fake_cutlass_fused_moe function at line 654 correctly includes the new parameter, maintaining consistency with the real op. Good job on this detail!

🔒 Security & Performance

  • No security concerns: This is a boolean configuration parameter with no injection risks
  • No performance regression: Default behavior unchanged; new behavior is opt-in

📋 Code Quality

  • Formatting: Clean, follows project conventions
  • Consistency: Parameter properly added to all necessary locations
  • Type Safety: Correct type annotation (bool)

Summary

This is a solid implementation that enables an important performance optimization. The main gap is test coverage—adding even a basic test would significantly increase confidence in this change.

Recommendation: Request the author to add test coverage before merging. Otherwise, the implementation is ready.


Review generated with assistance from Claude Code

@samuellees samuellees self-assigned this Mar 27, 2026
@samuellees samuellees force-pushed the claude/issue-2200-20260111-0748 branch from 1a50f5a to ff12b88 Compare March 27, 2026 09:55
Add swizzled_input_sf parameter to allow users to control whether
the input scaling factor is swizzled. This enables fusion of the
swizzle operation into the MOE kernel after FP4 allgather/alltoall
operations.
Changes:
- Add swizzled_input_sf parameter to cutlass_fused_moe Python API
  with default value of True (maintaining backward compatibility)
- Update C++ TVM-FFI bindings to accept and pass through the parameter
- Remove hardcoded swizzled_input_sf = true in binding layer
Closes #2200
Co-authored-by: Zihao Ye <yzh119@users.noreply.github.com>
Co-authored-by: Kesen Li <lsam@nvidia.com>
@samuellees samuellees force-pushed the claude/issue-2200-20260111-0748 branch from ff12b88 to 50dc605 Compare March 27, 2026 09:58
@claude
Copy link
Copy Markdown

claude Bot commented Mar 27, 2026

Code Review

Summary: Clean, focused change that correctly promotes swizzled_input_sf from a hardcoded internal default to an explicit caller-controlled parameter. The threading from Python API through C++ bindings is consistent and correct.

Issues

  1. Missing docstring for the new parameter

The public cutlass_fused_moe docstring (core.py around line 672) documents input_sf, swiglu_alpha, etc. but has no entry for swizzled_input_sf. Since this is a user-visible parameter with non-obvious semantics, it deserves documentation covering: what swizzling means, when to pass False (e.g. after FP4 allgather/alltoall), and that it only matters when input_sf is provided.

  1. No validation that swizzled_input_sf=False is meaningful

A caller can pass swizzled_input_sf=False with a non-quantized input where input_sf=None. This is likely a silent misconfiguration. Consider adding a guard analogous to the existing checks around line 497-500 in the binding:

TVM_FFI_ICHECK(swizzled_input_sf || input_sf.has_value())
    << "swizzled_input_sf=false is only meaningful when input_sf is provided";
  1. No test coverage for swizzled_input_sf=False

The change introduces a new code path but no tests exercise it. A test calling with swizzled_input_sf=False on a pre-swizzled input_sf would validate the new use case and guard against regressions.

Minor observations

  • The C++ variable rename from swizzled_input_sf_ml to swizzled_input_sf in runMoeMinLatency is a good cleanup.
  • _fake_cutlass_fused_moe (torch.compile abstract impl) correctly accepts but ignores the parameter -- right pattern for fake impls.
  • Pre-existing: docstring at line 670 says "GEMM1 biases" for fc2_expert_biases (copy-paste typo, unrelated to this PR).

Overall the change is correct and the default of True properly maintains backward compatibility. Main gaps are documentation and a test for the False path.

The original commit only added the parameter to the internal
@register_custom_op function and fake op, but missed the public
cutlass_fused_moe() function signature and its call to the internal
function. This caused the parameter to be silently ignored.

Co-authored-by: Zihao Ye <yzh119@users.noreply.github.com>
Co-authored-by: Kesen Li <lsam@nvidia.com>
@samuellees samuellees force-pushed the claude/issue-2200-20260111-0748 branch from 50dc605 to 2bc2a26 Compare March 27, 2026 10:05
@claude
Copy link
Copy Markdown

claude Bot commented Mar 27, 2026

Code Review

This is a clean, focused change that exposes swizzled_input_sf to callers rather than hard-coding it to true. Overall it looks correct. A few observations:

Correctness

Argument ordering is consistent across both paths. In the Python dispatcher (core.py:491-518), swizzled_input_sf is placed after swiglu_limit and before *min_latency_output, which maps correctly to both the run_moe and run_moe_min_latency C++ signatures (the min-latency tensors only appear when *min_latency_output expands). Good.

The #else branch (flashinfer_cutlass_fused_moe_binding.cu:418) hardcodes enable_alltoall as false -- pre-existing issue, not introduced here. In the non-USING_OSS_CUTLASS_MOE_GEMM build, alltoall is always disabled regardless of what the caller requests. Consider adding a TODO comment near this site so future maintainers are aware.

Documentation gap

swizzled_input_sf is added to the public cutlass_fused_moe signature (line 602) and the inner registered op (line 388), but it is missing from the docstring (lines 666-746) which documents every other parameter. Please add an entry explaining what swizzled vs linear layout means for the input scaling factor and when callers should pass False (e.g., post-allgather/alltoall FP4 scenario described in the PR summary).

No input validation

There is no guard that swizzled_input_sf=False is only meaningful when input_sf is provided. Passing swizzled_input_sf=False with a non-FP4 dtype propagates silently to the kernel. A small defensive check would help catch misuse:

if not swizzled_input_sf and input_sf is None:
    raise ValueError("swizzled_input_sf=False requires input_sf to be provided")

Test coverage

The new test test_moe_nvfp4_unswizzled_input_sf is well-structured. Using rtol=0, atol=0 is the right call since swizzling is a deterministic layout transformation.

One gap: the test only exercises the Blackwell (SM100/110/120) code path. The SM90 path through FusedMoeRunner::runMoe (non-min-latency) is not exercised with swizzled_input_sf=False. A test covering SM90 would increase confidence if that hardware is available in CI.

Minor nit

The existing typo runMoeMinLantency ("Lantency" vs "Latency") in the binding file is pre-existing and unrelated to this PR.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
flashinfer/fused_moe/core.py (2)

588-619: ⚠️ Potential issue | 🟠 Major

Don’t insert the new public flag ahead of existing positional args.

swizzled_input_sf=True does not preserve backward compatibility here: existing positional callers will silently shift tp_size, tp_rank, ep_size, and the rest into the wrong slots. Please append the new parameter at the end of the public API surface and forward it by keyword; while you’re here, the docstring below should explain what False means.

Suggested fix
-    swizzled_input_sf: bool = True,
     tp_size: int = 1,
     tp_rank: int = 0,
     ep_size: int = 1,
     ep_rank: int = 0,
     cluster_size: int = 1,
@@
     use_packed_weights: bool = False,
     tune_max_num_tokens: int = 8192,
     enable_pdl: Optional[bool] = None,
     activation_type: ActivationType = ActivationType.Swiglu,
+    swizzled_input_sf: bool = True,
 ) -> torch.Tensor:
@@
-        swizzled_input_sf,
         tp_size,
         tp_rank,
         ep_size,
         ep_rank,
         cluster_size,
         cluster_rank,
+        swizzled_input_sf=swizzled_input_sf,
         use_packed_weights=use_packed_weights,

Also applies to: 780-811

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@flashinfer/fused_moe/core.py` around lines 588 - 619, The new public boolean
parameter swizzled_input_sf was inserted before existing positional arguments in
cutlass_fused_moe which breaks backward compatibility; move swizzled_input_sf to
the end of the parameter list (after activation_type) so existing positional
callers keep the same argument mapping, update the function docstring to
describe what swizzled_input_sf=False means, and ensure any internal calls or
forwards pass it by keyword (swizzled_input_sf=...) rather than position; make
the same relocation and docstring/forwarding change for the other
overloaded/duplicate function occurrence mentioned in the review (the second
cutlass_fused_moe-like signature around the later block).

531-562: ⚠️ Potential issue | 🟠 Major

Add activation_type to _fake_cutlass_fused_moe.

The real custom op still accepts activation_type, and the wrapper forwards it at Line 810. Leaving it out here makes the fake schema diverge from the real one, which can break torch.compile / fake-tensor execution.

Suggested fix
-        enable_pdl: Optional[bool] = None,
-        use_packed_weights: bool = False,
+        enable_pdl: Optional[bool] = None,
+        activation_type: ActivationType = ActivationType.Swiglu,
+        use_packed_weights: bool = False,

Based on learnings: fake ops decorated with register_fake_op in flashinfer/fused_moe must exactly mirror the corresponding real op signatures.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@flashinfer/fused_moe/core.py` around lines 531 - 562, The fake op function
_fake_cutlass_fused_moe is missing the activation_type parameter, causing its
signature to diverge from the real custom op and breaking
torch.compile/fake-tensor paths; add an activation_type: Optional[int] (or the
same type used by the real op) parameter to _fake_cutlass_fused_moe and ensure
it is accepted and forwarded just like the wrapper does (the wrapper already
forwards activation_type), so the fake op signature exactly mirrors the real op.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@csrc/fused_moe/cutlass_backend/flashinfer_cutlass_fused_moe_binding.cu`:
- Around line 430-431: The function signature/call formatting around the
parameters Optional<TensorView> swiglu_limit, bool swizzled_input_sf, TensorView
num_active_experts_per_node, and the similar block at the other occurrence must
be normalized by running clang-format on the file so pre-commit stops rewriting
it; run clang-format with the project's config (or run the repo's pre-commit
hook) to reflow the parameter lists and call sites, save and stage the updated
file so CI no longer flags the formatting differences.

---

Outside diff comments:
In `@flashinfer/fused_moe/core.py`:
- Around line 588-619: The new public boolean parameter swizzled_input_sf was
inserted before existing positional arguments in cutlass_fused_moe which breaks
backward compatibility; move swizzled_input_sf to the end of the parameter list
(after activation_type) so existing positional callers keep the same argument
mapping, update the function docstring to describe what swizzled_input_sf=False
means, and ensure any internal calls or forwards pass it by keyword
(swizzled_input_sf=...) rather than position; make the same relocation and
docstring/forwarding change for the other overloaded/duplicate function
occurrence mentioned in the review (the second cutlass_fused_moe-like signature
around the later block).
- Around line 531-562: The fake op function _fake_cutlass_fused_moe is missing
the activation_type parameter, causing its signature to diverge from the real
custom op and breaking torch.compile/fake-tensor paths; add an activation_type:
Optional[int] (or the same type used by the real op) parameter to
_fake_cutlass_fused_moe and ensure it is accepted and forwarded just like the
wrapper does (the wrapper already forwards activation_type), so the fake op
signature exactly mirrors the real op.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: acd62a87-2bbf-4eab-9aa8-33ade59826cb

📥 Commits

Reviewing files that changed from the base of the PR and between 1a50f5a and ff12b88.

📒 Files selected for processing (2)
  • csrc/fused_moe/cutlass_backend/flashinfer_cutlass_fused_moe_binding.cu
  • flashinfer/fused_moe/core.py

Comment thread csrc/fused_moe/cutlass_backend/flashinfer_cutlass_fused_moe_binding.cu Outdated
@samuellees samuellees force-pushed the claude/issue-2200-20260111-0748 branch from 2bc2a26 to 92d0912 Compare March 27, 2026 10:12
Verifies that passing a linear-layout input_sf with swizzled_input_sf=False
produces identical output to a swizzled-layout input_sf with the default
swizzled_input_sf=True. This covers the FP4 allgather/alltoall use case
where input_sf is in linear layout after communication.

Co-authored-by: Kesen Li <lsam@nvidia.com>
@samuellees samuellees force-pushed the claude/issue-2200-20260111-0748 branch from 92d0912 to 267869a Compare March 27, 2026 10:13
@claude
Copy link
Copy Markdown

claude Bot commented Mar 27, 2026

Code Review

This PR cleanly exposes swizzled_input_sf to support post-allgather/alltoall FP4 scenarios where scaling factors arrive in linear layout. The change is well-motivated and the default (True) correctly preserves backward compatibility. A few things to address:


1. Missing docstring entry for swizzled_input_sf

The parameter is added to the public cutlass_fused_moe() signature at line 602 but is not documented in the docstring. Every other parameter in that function is documented. This should be added between the swiglu_limit and tp_size entries:

swizzled_input_sf : bool, optional
    Whether the input scaling factor tensor (``input_sf``) is in swizzled layout.
    Set to ``False`` when ``input_sf`` is received in linear (row-major) layout,
    e.g. after FP4 allgather/alltoall operations. Defaults to ``True`` to maintain
    backward compatibility.

2. Bit-exact test assertion may be fragile

The test asserts:

torch.testing.assert_close(output_swizzled, output_linear, rtol=0, atol=0)

Bit-exact equality (rtol=0, atol=0) between two different execution paths is a strong claim. If both paths go through identical arithmetic after swizzle handling, this is correct. But if the kernel has any path divergence (e.g., different memory access patterns causing different rounding) this could produce spurious CI failures on some hardware.

Consider using a small tolerance (rtol=1e-5, atol=0) or leaving a comment explaining why bit-exact output is expected here (e.g., "swizzle is purely a memory reordering with no arithmetic impact, so outputs must be identical").


3. swizzled_input_sf is silently ignored for non-FP4 inputs

When input_sf is None (e.g., FP16/BF16 inputs), swizzled_input_sf is passed to the kernel runner but has no effect. This is functionally correct, but passing swizzled_input_sf=False in that case is misleading and could confuse callers. A lightweight guard or at least a docstring note would help:

# In cutlass_fused_moe():
if swizzled_input_sf is False and input_sf is None:
    # swizzled_input_sf has no effect when input_sf is not provided
    pass

Or just document it: "This parameter has no effect when input_sf is None."


4. Test only covers standard mode, not min_latency_mode

The swizzled_input_sf parameter flows into both run_moe and run_moe_min_latency in the C++ binding, but the new test only exercises the standard path. A comment noting this gap, or a separate test for min_latency_mode=True, would improve coverage confidence (this is less critical since min_latency_mode raises NotImplementedError for Blackwell currently).


Minor: Consistent skip pattern

The new test uses the same torch.cuda.get_device_capability()[0] not in [10, 11, 12] skip condition as existing NVFP4 tests (e.g. line 481) — good consistency.


Overall the change is correct and well-structured. The C++ parameter threading through both runMoe/runMoeMinLatency is clean, and removing the hardcoded swizzled_input_sf = true HACK comment is the right move. The docstring gap and test assertion strictness are the main things to address before merge.

@claude
Copy link
Copy Markdown

claude Bot commented Mar 27, 2026

Code Review

Overall this is a clean, well-scoped change. The parameter threads correctly from Python through TVM-FFI lambdas to the C++ kernel. A few observations:

C++ binding: local-variable shadowing removal

In runMoe, the removed local bool const swizzled_input_sf = true was shadowing the new parameter of the same name. Removing it makes the parameter flow through silently - correct, but subtle. The same situation is handled differently in runMoeMinLatency (where the local was named swizzled_input_sf_ml, requiring an explicit substitution at the call site). A brief comment in runMoe noting that swizzled_input_sf now comes from the function parameter would improve clarity.

_fake_cutlass_fused_moe receives but never uses the parameter

The shape-inference stub accepts swizzled_input_sf: bool = True but does not use it (layout flags do not affect output shape, which is expected). A short comment explaining this intentional omission would prevent reviewers from wondering if it was accidentally dropped.

Docstring caveat with no runtime guard

The docstring correctly notes the parameter is Only relevant when input_sf is not None, but there is no assert or warning for the case where input_sf is None and swizzled_input_sf is False. Silent no-ops are fine here, but a matching note in the C++ binding would help downstream callers debug misconfiguration.

Test: zero-tolerance exact-match assertion

The test uses rtol=0, atol=0 (bitwise exact). This is appropriate since both paths process identical quantized values. However, if the underlying TRT-LLM kernel ever introduces non-determinism (parallel reductions), this could become flaky. Using a small tolerance matching the rest of the test file (most tests use rtol=1e-2, atol=1e-1) would be more robust.

Minor: pre-existing docstring typo on touched lines

fc2_expert_biases is documented as GEMM1 biases for each expert - should be GEMM2. Worth fixing while the docstring is already being updated.

Summary

Backward-compatible default, correct parameter threading, and a clear test scenario for the allgather/alltoall use case. The changes are minimal and well-targeted. The main actionable items are: (1) add a comment in runMoe clarifying the shadowing removal, (2) annotate the unused parameter in the fake stub, (3) optionally relax the test tolerance.

Inserting a new positional parameter between existing ones breaks backward
compatibility for callers that use positional arguments. Moving
swizzled_input_sf to the end (after activation_type) ensures no existing
call sites are affected.

The internal @register_custom_op function signature is unchanged since it is
not part of the public API and its positional order must match the TVM-FFI
lambda.
@samuellees samuellees force-pushed the claude/issue-2200-20260111-0748 branch from 5b4dcf3 to 3e04aa9 Compare March 27, 2026 10:20
@claude
Copy link
Copy Markdown

claude Bot commented Mar 27, 2026

Code Review

Summary: This PR exposes swizzled_input_sf to allow callers to indicate whether the input scaling factor tensor is in swizzled or linear layout — enabling fusion of the de-swizzle step into the MOE kernel for FP4 allgather/alltoall scenarios. Overall this is a clean, well-scoped change.


What's done well

  • Backward compatible default: swizzled_input_sf=True matches the previous hardcoded behaviour, so existing callers are unaffected.
  • Doc string: The new parameter's documentation clearly explains the motivation (post-allgather linear layout) and when to use each value.
  • Test: test_moe_nvfp4_unswizzled_input_sf directly validates the equivalence of both paths and uses rtol=0, atol=0 since the two paths should produce bit-identical results. The device-capability guard is correct.

Observations / suggestions

1. Parameter position inconsistency between the public API and the internal call-through

In the public cutlass_fused_moe() signature (line 615), swizzled_input_sf sits after activation_type, at the very end of the argument list. In the internal _fake_cutlass_fused_moe() / TVM dispatch (lines 544, 502), it sits after swiglu_limit, before tp_size. These two orderings make it easy to accidentally mismatch argument positions if someone uses positional arguments on the public API or calls the internal helper directly.

It would be slightly cleaner to keep the position consistent, but given that all call sites currently use keyword arguments this is low priority.

2. swizzled_input_sf is silently ignored when input_sf is None

The parameter only has meaning when input_sf is provided, but there is no guard or warning when a caller passes swizzled_input_sf=False with input_sf=None. This is harmless today, but a brief note in the docstring — "ignored when input_sf is None" — would avoid confusion.

3. min_latency_mode path not covered by the new test

The runMoeMinLantency path receives swizzled_input_sf too (lines 427–442), but the new test only exercises the regular runMoe path. A brief call with min_latency_mode=True (or a pytest.mark.parametrize over both modes) would give confidence that the wiring is correct end-to-end for both code paths.

4. Minor nit: lambda in test body

round_up = lambda x, y: (x + y - 1) // y * y

pre-commit / ruff will flag a lambda assigned to a variable (E731). A def or an inline expression would be cleaner, matching the style of the rest of the test file.


Verdict

The change is correct, well-motivated, and backward compatible. The suggestions above are mostly minor polish items. Happy to approve once the min-latency test coverage gap is addressed (or confirmed to be intentionally deferred).

… test

Both paths compute equivalent values; allow for minor floating-point
non-determinism while still catching real divergence (well below BF16
resolution of ~0.008).
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (1)
tests/moe/test_trtllm_cutlass_fused_moe.py (1)

1869-1879: Assert that the SF buffers actually differ before comparing outputs.

hidden_states_* are expected to match regardless of layout. Without a precondition on input_sf_swizzled vs input_sf_linear, this can still pass if the quantizer regresses and returns the same SF layout for both branches, so the new flag is never really exercised.

Suggested hardening
     # Both quantizations should produce the same quantized values
     assert torch.equal(hidden_states_swizzled, hidden_states_linear)
+    assert input_sf_swizzled.shape != input_sf_linear.shape or not torch.equal(
+        input_sf_swizzled, input_sf_linear
+    )
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/moe/test_trtllm_cutlass_fused_moe.py` around lines 1869 - 1879, Before
asserting outputs equal, add a precondition check that the scale-factor buffers
differ: call fp4_quantize to obtain input_sf_swizzled and input_sf_linear and
assert they are not equal (e.g., torch.any(input_sf_swizzled !=
input_sf_linear)) to ensure the swizzled vs linear branch was exercised; only
then assert torch.equal(hidden_states_swizzled, hidden_states_linear). This
ensures fp4_quantize's is_sf_swizzled_layout flag actually changes SF layout
before comparing hidden_states_swizzled and hidden_states_linear.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@tests/moe/test_trtllm_cutlass_fused_moe.py`:
- Around line 1799-1802: Replace the direct torch.cuda.get_device_capability()
check in the pytest.skipif decorator with the repo helper that centralizes GPU
capability logic: import and use is_sm100a_supported() (or
get_compute_capability()/is_sm90a_supported() if more appropriate) instead of
torch.cuda.get_device_capability()[0] not in [10, 11, 12]; update the decorator
to pytest.mark.skipif(not is_sm100a_supported(), reason=...) and add the import
for is_sm100a_supported() at the top of the test module so the skip logic stays
consistent with the rest of the CUDA test suite.
- Line 1819: Replace the lambda assignment to round_up with a regular function
that calls the existing ceil_div helper: implement round_up as a normal def (not
a lambda) that returns the result of ceil_div(x, y) multiplied by y; reference
the existing ceil_div function in the module and remove the lambda assignment to
fix the Ruff E731 violation.

---

Nitpick comments:
In `@tests/moe/test_trtllm_cutlass_fused_moe.py`:
- Around line 1869-1879: Before asserting outputs equal, add a precondition
check that the scale-factor buffers differ: call fp4_quantize to obtain
input_sf_swizzled and input_sf_linear and assert they are not equal (e.g.,
torch.any(input_sf_swizzled != input_sf_linear)) to ensure the swizzled vs
linear branch was exercised; only then assert
torch.equal(hidden_states_swizzled, hidden_states_linear). This ensures
fp4_quantize's is_sf_swizzled_layout flag actually changes SF layout before
comparing hidden_states_swizzled and hidden_states_linear.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 621b187e-d44d-472d-8f45-baa2c317e257

📥 Commits

Reviewing files that changed from the base of the PR and between ff12b88 and 267869a.

📒 Files selected for processing (3)
  • csrc/fused_moe/cutlass_backend/flashinfer_cutlass_fused_moe_binding.cu
  • flashinfer/fused_moe/core.py
  • tests/moe/test_trtllm_cutlass_fused_moe.py
🚧 Files skipped from review as they are similar to previous changes (1)
  • flashinfer/fused_moe/core.py

Comment thread tests/moe/test_trtllm_cutlass_fused_moe.py
Comment thread tests/moe/test_trtllm_cutlass_fused_moe.py Outdated
@claude
Copy link
Copy Markdown

claude Bot commented Mar 27, 2026

test comment - please ignore

@claude
Copy link
Copy Markdown

claude Bot commented Mar 27, 2026

test edit

@claude
Copy link
Copy Markdown

claude Bot commented Mar 27, 2026

Code Review

Critical Bug: Positional Argument Mismatch in Outer Call Site

File: flashinfer/fused_moe/core.py, lines 786-817

The inner cutlass_fused_moe custom op (line 373) had swizzled_input_sf: bool = True inserted at positional slot 14 (between swiglu_limit and tp_size). But the outer public cutlass_fused_moe call site (lines 786-817) was NOT updated to match. It still passes tp_size, tp_rank, ep_size, ep_rank, cluster_size, cluster_rank positionally after swiglu_limit, then passes swizzled_input_sf as a keyword argument. This means tp_size lands in the swizzled_input_sf slot, tp_rank lands in tp_size, ep_size lands in tp_rank, ep_rank lands in ep_size, cluster_size lands in ep_rank, cluster_rank lands in cluster_size, and swizzled_input_sf=swizzled_input_sf then causes TypeError: got multiple values for argument on any call path reaching this code.

The fix is to insert swizzled_input_sf positionally between swiglu_limit and tp_size at line ~800, or switch tp_size through cluster_rank to keyword arguments.

Why the New Test Does Not Catch This

test_moe_nvfp4_unswizzled_input_sf calls the public outer API and would hit this bug. However, it is guarded by an SM100/110/120 skip condition, so it will not run in CI environments without those GPUs. The skip guard is correct, but it means the bug is not exercised in broader CI.

Other Observations (Non-blocking)

C++ binding changes look correct. swizzled_input_sf is consistently threaded through runMoe, runMoeMinLatency, and both TVM GetFunction lambdas (run_moe and run_moe_min_latency). Removing the internal bool const swizzled_input_sf = true hardcodes is the right approach.

Backward compatibility is properly handled. Default value True in all signatures ensures no existing callers are broken.

Test design is sound. Using rtol=0, atol=0 is appropriate - swizzling is purely a memory layout transformation, so the kernel output should be bit-exact regardless of which layout is used. The assertion torch.equal(hidden_states_swizzled, hidden_states_linear) verifying the quantized values are identical is a good sanity check.

Minor nit: The round_up lambda in the test is used 6 times; a named function would be slightly more readable, but this is low priority for a test file.

- Replace torch.cuda.get_device_capability() check with is_sm100a_supported()
  helper for consistent GPU capability detection across the test suite
- Replace round_up lambda with def to fix Ruff E731
- Add assertion that input_sf buffers differ between swizzled/linear layouts
  so the test cannot pass trivially if fp4_quantize ignores is_sf_swizzled_layout
@claude
Copy link
Copy Markdown

claude Bot commented Mar 27, 2026

Code Review

This PR cleanly exposes swizzled_input_sf as a user-controllable parameter instead of a hardcoded true, enabling fusion of the swizzle operation into the MoE kernel after FP4 allgather/alltoall. The implementation is sound overall. A few observations:

Correctness

Parameter threading is correct. The boolean propagates through all four call sites:

  • runMoe (non-min-latency)
  • runMoeMinLantency (min-latency)
  • GetFunction("run_moe") lambda
  • GetFunction("run_moe_min_latency") lambda

The trick of reusing the same variable name (swizzled_input_sf) in runMoe is subtle but correct — the diff only shows removal of the hardcoded assignment; the downstream mKernelRunner->runMoe(..., swizzled_input_sf, ...) call is unchanged because the variable name is identical. Worth a comment noting this was intentional.

input_sf=None with swizzled_input_sf=False is handled correctly — the C++ side nulls out the pointer, making the flag irrelevant.

API Design

Parameter placement inconsistency: swizzled_input_sf appears in different positions between the inner custom op (core.py:388, between swiglu_limit and tp_size) and the public API (core.py:618, at the end). This isn't a bug (the call site at line 816 uses swizzled_input_sf=swizzled_input_sf), but it's worth noting this divergence for future maintainers who may be confused when comparing the two signatures.

Backward compatibility is maintained — default True preserves existing behavior.

Test

The test structure is solid:

# Guard against trivial pass
assert not torch.equal(input_sf_swizzled, input_sf_linear), ...

This sanity check is important — good practice.

Concern: test tolerances. rtol=1e-3, atol=1e-3 for FP4 quantized output could be tight for some inputs. FP4 has limited precision and non-linear quantization errors can accumulate across tokens. Consider using slightly looser tolerances (e.g. rtol=1e-2, atol=1e-2) that other NVFP4 tests in this file use, or document why these tighter tolerances are appropriate here.

Missing coverage: The runMoeMinLantency code path is updated but not exercised by any test. The public API raises NotImplementedError for min-latency mode on SM100a (Blackwell), so this is unavoidable for the current test environment — but it'd be worth a comment explaining why min-latency coverage is absent.

Minor nit: quant_blocksize = 16 is defined but only used for sf_w1_k and sf_w2_n calculations. The 128 in sf_w1_2n = round_up(w1_n, 128) and sf_w2_k = round_up(k, 128) are hardcoded — are these the same blocksize concept (just the other dimension), or different? A brief comment would clarify.

Pre-existing Issue (not this PR)

fc2_expert_biases docstring (core.py:670) says "GEMM1 biases" — typo, should be "GEMM2 biases". Out of scope here, but worth fixing in a follow-up.


Overall: the change is clean, the default maintains backward compatibility, and the approach is correct. The main asks are around test tolerances and coverage of the min-latency path in the docstring/comment.

@samuellees
Copy link
Copy Markdown
Collaborator

/bot run

@flashinfer-bot
Copy link
Copy Markdown
Collaborator

GitLab MR !465 has been created, and the CI pipeline #47094985 is currently running. I'll report back once the pipeline job completes.

@flashinfer-bot
Copy link
Copy Markdown
Collaborator

[FAILED] Pipeline #47094985: 6/20 passed

…_moe signature

The inner cutlass_fused_moe (registered via torch.library.custom_op) has
swizzled_input_sf at position 15 (after swiglu_limit), but the outer wrapper
was passing tp_size at that position and then also passing swizzled_input_sf
as a keyword argument, causing:

  TypeError: cutlass_fused_moe() got multiple values for argument 'swizzled_input_sf'

Fix: insert swizzled_input_sf into the positional argument list between
swiglu_limit and tp_size, and remove the duplicate keyword argument.

AI-assisted

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@samuellees
Copy link
Copy Markdown
Collaborator

/bot run

@flashinfer-bot
Copy link
Copy Markdown
Collaborator

GitLab MR !465 has been updated with latest changes, and the CI pipeline #47128436 is currently running. I'll report back once the pipeline job completes.

@claude
Copy link
Copy Markdown

claude Bot commented Mar 28, 2026

Code Review

Overall this is a clean, focused change that exposes a previously hardcoded parameter. Backward-compatible and well-documented. A few observations below.

Correctness

Both execution paths are covered. swizzled_input_sf is correctly threaded through both runMoe and runMoeMinLantency, and both the regular and min-latency Python call sites pass the new parameter. No path was missed.

_fake_cutlass_fused_moe is updated. Important for torch.compile compatibility -- the fake kernel signature must stay in sync with the real one.

Minor Issues

Parameter position inconsistency (non-blocking). In the internal helpers (_fake_cutlass_fused_moe and the lower-level overload), swizzled_input_sf is inserted between swiglu_limit and tp_size, matching the C++ signature order. In the public cutlass_fused_moe it is placed after activation_type. Not wrong since all are kwargs, but slightly surprising when reading both definitions side-by-side.

Test import ordering. from flashinfer.utils import is_sm100a_supported is inserted before the from flashinfer import (...) block. Per isort convention all flashinfer imports should be grouped together.

Pre-existing typo (not introduced here). runMoeMinLantency should be runMoeMinLatency -- could be fixed in a follow-up.

Test Coverage

The test structure is solid: it runs both swizzled and linear layouts and asserts numerical closeness. The explicit sanity-check that the two SF buffers actually differ is a nice touch -- prevents the test passing trivially if fp4_quantize ignored is_sf_swizzled_layout.

Optional enhancement: add a min_latency_mode=True variant to also exercise the runMoeMinLantency path with swizzled_input_sf=False, since that is an independent C++ code path. Not a blocker.

Summary

The change is correct and well-tested. Only actionable items are the import ordering nit and optional min-latency coverage. LGTM with the minor import fix.

@flashinfer-bot
Copy link
Copy Markdown
Collaborator

[FAILED] Pipeline #47128436: 12/20 passed

Copy link
Copy Markdown
Collaborator

@jiahanc jiahanc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks for the contribution!

@jiahanc jiahanc added the run-ci label Mar 30, 2026
@jiahanc jiahanc enabled auto-merge (squash) March 30, 2026 00:09
@jiahanc jiahanc merged commit 4941606 into main Mar 30, 2026
54 checks passed
@jiahanc jiahanc deleted the claude/issue-2200-20260111-0748 branch March 30, 2026 04:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Feature Request: Support swizzled_input_sf for cutlass fused moe.

5 participants