feat: expose swizzled_input_sf parameter for CUTLASS fused MOE by yzh119 · Pull Request #2330 · flashinfer-ai/flashinfer

yzh119 · 2026-01-11T07:53:58Z

Summary

Add swizzled_input_sf parameter to allow users to control whether the input scaling factor is swizzled
Enables fusion of the swizzle operation into the MOE kernel after FP4 allgather/alltoall operations
Default value is True to maintain backward compatibility

Closes #2200

Generated with Claude Code

Summary by CodeRabbit

New Features
- Added explicit control over input swizzling for Mixture-of-Experts ops via a new boolean parameter (defaults to enabled).
API Changes
- Public MoE operation signatures now accept the swizzling parameter and have adjusted argument ordering; wrappers and docs updated accordingly.
Tests
- Added a CUDA-gated test validating FP4/NVFP4 behavior with swizzled vs linear input layouts.

coderabbitai · 2026-01-11T07:54:09Z

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

@coderabbitai resume to resume automatic reviews.
@coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

▶️ Resume reviews
🔍 Trigger review

📝 Walkthrough

Walkthrough

Adds a new swizzled_input_sf: bool parameter propagated from Python through TVM/C++ bindings into the CUDA fused MoE kernel; removes internal hard-coded swizzle defaults and adds a GPU test validating swizzled vs unswizzled input_sf handling.

Changes

Cohort / File(s)	Summary
CUDA C++ Binding `csrc/fused_moe/cutlass_backend/flashinfer_cutlass_fused_moe_binding.cu`	Added `bool swizzled_input_sf` to `FusedMoeRunner::runMoe` and `runMoeMinLatency` signatures; removed internal default swizzle flags and forward caller-provided `swizzled_input_sf` into kernel runner calls and TVM `GetFunction` lambdas for `"run_moe"` / `"run_moe_min_latency"`.
Python API `flashinfer/fused_moe/core.py`	Added `swizzled_input_sf: bool = True` to `cutlass_fused_moe`, `_fake_cutlass_fused_moe`, and the exported `flashinfer_api` wrapper; threaded the parameter into the TVM/C++ invocation and updated docstring to document `input_sf` layout semantics.
Tests `tests/moe/test_trtllm_cutlass_fused_moe.py`	Added `test_moe_nvfp4_unswizzled_input_sf()` (SM100+-gated) that quantizes with swizzled vs linear `input_sf`, calls `cutlass_fused_moe` with matching `swizzled_input_sf` flags, and asserts outputs match.

Sequence Diagram

sequenceDiagram
    participant User as User Code
    participant PyAPI as Python API\n(cutlass_fused_moe)
    participant TVM as TVM Binding\n(GetFunction)
    participant Cpp as C++ Runner\n(FusedMoeRunner)
    participant Kernel as CUDA Kernel

    User->>PyAPI: call cutlass_fused_moe(..., swizzled_input_sf)
    PyAPI->>TVM: forward args including swizzled_input_sf
    TVM->>Cpp: invoke lambda with swizzled_input_sf
    Cpp->>Kernel: launch kernel using swizzled_input_sf
    Kernel-->>Cpp: kernel completes
    Cpp-->>TVM: return results
    TVM-->>PyAPI: return outputs
    PyAPI-->>User: return outputs

Estimated Code Review Effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

feat: unit-test and api change, w4a8 grouped-gemm fused MoE for SM90 #2193 — adds a similar boolean flag to the FusedMoeRunner/CUTLASS MoE bindings and threads it through C++ bindings and Python API.

Suggested labels

run-ci

Suggested reviewers

djmmoss
cyx-6
aleozlx
wenscarl
jiahanc
jimmyzho
nv-yunzheq
IwakuraRein

Poem

🐇 I tunneled a flag from Py to C,

swizzled or straight — now both roam free.
From quant to kernel my whiskers trace,
matching layouts, outputs in place.
A nibble of code, a carrot of grace.

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 30.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title accurately summarizes the main feature addition - exposing a swizzled_input_sf parameter for CUTLASS fused MOE.
Description check	✅ Passed	The description covers the summary, related issues, and acknowledges the template checklist items, though some checklist items lack explicit confirmation.
Linked Issues check	✅ Passed	The PR implementation fully addresses issue `#2200` by exposing swizzled_input_sf parameter across C++ binding, Python API, and tests with backward-compatible default True.
Out of Scope Changes check	✅ Passed	All changes are directly scoped to implementing the swizzled_input_sf parameter feature with no unrelated modifications detected.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch claude/issue-2200-20260111-0748

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

gemini-code-assist · 2026-01-11T07:54:12Z

Summary of Changes

Hello @yzh119, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request enhances the CUTLASS fused Mixture-of-Experts (MOE) implementation by exposing a new configuration parameter, swizzled_input_sf. This parameter provides users with explicit control over the swizzling behavior of the input scaling factor, which is crucial for optimizing performance by enabling the fusion of the swizzle operation directly into the MOE kernel. The default setting maintains existing behavior, ensuring a smooth transition for current users while offering new optimization opportunities.

Highlights

New Parameter Introduction: A new boolean parameter, swizzled_input_sf, has been introduced to the CUTLASS fused MOE kernel, allowing explicit control over whether the input scaling factor is swizzled.
Performance Optimization: This parameter enables the fusion of the swizzle operation directly into the MOE kernel, specifically after FP4 allgather/alltoall operations, which can lead to performance improvements.
Backward Compatibility: The swizzled_input_sf parameter defaults to True to ensure backward compatibility with existing implementations and workflows.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

coderabbitai

Actionable comments posted: 0

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)

flashinfer/fused_moe/core.py (1)
710-931: Critical: swizzled_input_sf parameter missing from public API function.

The new swizzled_input_sf parameter was added to the inner cutlass_fused_moe function (line 510) and passed to run_moe (line 627), but it's missing from:

The public cutlass_fused_moe function signature (lines 710-740)

The call to the inner function (lines 901-931)

This means users cannot actually control the swizzle behavior through the public API, defeating the purpose of this PR.
🐛 Proposed fix

Add the parameter to the public function signature (after swiglu_limit):
     swiglu_alpha: Optional[torch.Tensor] = None,
     swiglu_beta: Optional[torch.Tensor] = None,
     swiglu_limit: Optional[torch.Tensor] = None,
+    swizzled_input_sf: bool = True,
     tp_size: int = 1,
And pass it in the call (around line 916):
         swiglu_alpha,
         swiglu_beta,
         swiglu_limit,
+        swizzled_input_sf,
         tp_size,
Also add documentation for the parameter in the docstring.
csrc/fused_moe/cutlass_backend/flashinfer_cutlass_fused_moe_binding.cu (1)
420-433: Pipeline failure: clang-format check failed.

The CI indicates a formatting issue around line 424. Based on the code structure, ensure consistent formatting of the function signature.

Run clang-format on this file to fix the formatting:
clang-format -i csrc/fused_moe/cutlass_backend/flashinfer_cutlass_fused_moe_binding.cu

📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 2062dec and 1a50f5a.

📒 Files selected for processing (2)

csrc/fused_moe/cutlass_backend/flashinfer_cutlass_fused_moe_binding.cu
flashinfer/fused_moe/core.py

🧰 Additional context used

📓 Path-based instructions (2)

flashinfer/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

flashinfer/**/*.py: Use @functools.cache decorator on Python API functions to implement module-level caching and avoid recompilation
Use @flashinfer_api decorator for debugging API calls, enable via FLASHINFER_LOGLEVEL environment variable (0=off, 1=basic, 3=detailed, 5=with stats)

Files:

flashinfer/fused_moe/core.py

csrc/**/*.cu

📄 CodeRabbit inference engine (CLAUDE.md)

Framework bindings and PyTorch tensor handling should be implemented in csrc/ via TVM-FFI, not in include/ headers

Files:

csrc/fused_moe/cutlass_backend/flashinfer_cutlass_fused_moe_binding.cu

🧬 Code graph analysis (1)

csrc/fused_moe/cutlass_backend/flashinfer_cutlass_fused_moe_binding.cu (2)

flashinfer/comm/mapping.py (1)

tp_rank (325-326)

csrc/nv_internal/tensorrt_llm/kernels/cutlass_kernels/include/moe_gemm_kernels.h (1)

enable_pdl (220-220)

🪛 GitHub Actions: pre-commit

csrc/fused_moe/cutlass_backend/flashinfer_cutlass_fused_moe_binding.cu

[error] 424-424: clang-format check failed. Files were modified by this hook during pre-commit.

🪛 Ruff (0.14.10)

flashinfer/fused_moe/core.py

669-669: Unused function argument: swizzled_input_sf

(ARG001)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)

GitHub Check: Deploy Docs
GitHub Check: claude-review

🔇 Additional comments (5)

flashinfer/fused_moe/core.py (2)

653-701: The swizzled_input_sf parameter is correctly added to the fake op.

The static analysis warning about the unused parameter is expected for fake ops, as they only define output shapes and dtypes without executing actual logic.

510-510: Parameter correctly threaded through inner function to runtime call.

The swizzled_input_sf parameter is properly added with default True for backward compatibility, and correctly passed to the underlying run_moe call.

Also applies to: 627-627

csrc/fused_moe/cutlass_backend/flashinfer_cutlass_fused_moe_binding.cu (3)

246-250: Parameter correctly integrated into runMoe method.

The swizzled_input_sf parameter is properly:

Added to the function signature after swiglu_limit

Passed through to mKernelRunner->runMoe in both USING_OSS_CUTLASS_MOE_GEMM and non-OSS paths

Also applies to: 385-400

426-433: Parameter correctly integrated into runMoeMinLantency method.

The swizzled_input_sf is properly threaded through the min-latency execution path in both OSS and non-OSS code paths.

Also applies to: 569-602

718-758: TVM-FFI bindings correctly updated.

Both run_moe and run_moe_min_latency function bindings properly include the new swizzled_input_sf parameter and pass it through to the respective runner methods. This aligns with the coding guidelines for framework bindings in csrc/.

gemini-code-assist

Code Review

This pull request exposes a new swizzled_input_sf parameter for the CUTLASS fused MoE implementation, allowing control over whether the input scaling factor is swizzled. The changes are well-implemented, consistently propagating the new parameter from the Python API down to the C++ kernel calls. The default value is set to True to maintain backward compatibility, as described. The code looks good, but I have one suggestion to improve documentation by adding the new parameter to the function's docstring.

gemini-code-assist · 2026-01-11T07:56:44Z

        swiglu_alpha: Optional[torch.Tensor] = None,
        swiglu_beta: Optional[torch.Tensor] = None,
        swiglu_limit: Optional[torch.Tensor] = None,
+        swizzled_input_sf: bool = True,


The docstring for cutlass_fused_moe (starting on line 741) is missing documentation for the new swizzled_input_sf parameter. Please add it for completeness and to inform users about this new option.

claude · 2026-01-11T07:56:52Z

Code Review

I've reviewed PR #2330 and overall the implementation looks good. The change cleanly exposes the swizzled_input_sf parameter that was previously hardcoded, enabling the fusion optimization mentioned in issue #2200.

✅ Strengths

Backward Compatibility: The default value of True maintains existing behavior, preventing any breaking changes for current users.
Clean Implementation: The parameter is properly threaded through all layers:
- Python API (flashinfer/fused_moe/core.py)
- TVM-FFI bindings (csrc/fused_moe/cutlass_backend/flashinfer_cutlass_fused_moe_binding.cu)
- Both run_moe and run_moe_min_latency code paths
Consistent Placement: The parameter is placed logically in the parameter list, after swiglu_limit and before the parallelism parameters (tp_size, etc.).
Proper Scope: The change removes the hardcoded swizzled_input_sf = true values in the C++ binding layer, as they should be controlled by the user.

🔍 Potential Issues & Suggestions

1. Missing Test Coverage (High Priority)

The PR adds a new user-facing parameter but doesn't include any tests exercising it. This is risky because:

We can't verify the parameter actually works as intended
Future refactoring might break this functionality without detection
The issue mentions a specific use case (FP4 allgather/alltoall), which should be validated

Recommendation: Add at least one test case that calls cutlass_fused_moe with swizzled_input_sf=False. The existing test_moe_nvfp4 test in tests/moe/test_trtllm_cutlass_fused_moe.py (lines 484-632) would be a good template since it already uses input_sf.

Example:

@pytest.mark.parametrize("swizzled_input_sf", [True, False])
def test_moe_nvfp4_swizzled_input_sf(batch_size, ..., swizzled_input_sf):
    # ... existing setup ...
    _ = fused_moe.cutlass_fused_moe(
        hidden_states,
        selected_experts.to(torch.int),
        routing_weights,
        w1_q.contiguous().view(torch.long),
        w2_q.contiguous().view(torch.long),
        otype,
        quant_scales=quant_scales,
        input_sf=input_sf,
        swizzled_input_sf=swizzled_input_sf,  # Test the new parameter
        output=flash_output,
        activation_type=activation_type,
    )

2. Missing Documentation (Medium Priority)

The parameter lacks a docstring explaining:

What "swizzled" means in this context
When to set it to False vs True
The performance implications

Recommendation: Add parameter documentation to the cutlass_fused_moe function (around line 495 in core.py). Reference the TensorRT-LLM PR (#6231) for technical details if needed.

Example docstring addition:

def cutlass_fused_moe(
    ...
    swizzled_input_sf: bool = True,  # Whether the input scaling factor is swizzled
    ...
):
    """
    ...
    Args:
        ...
        swizzled_input_sf: Whether the input scaling factor (input_sf) is already 
            swizzled. Set to False when the swizzle operation should be fused into 
            the MOE kernel (e.g., after FP4 allgather/alltoall operations). 
            Default: True (maintains backward compatibility).
        ...
    """

3. Fake Op Signature (Low Priority)

The _fake_cutlass_fused_moe function at line 654 correctly includes the new parameter, maintaining consistency with the real op. Good job on this detail!

🔒 Security & Performance

No security concerns: This is a boolean configuration parameter with no injection risks
No performance regression: Default behavior unchanged; new behavior is opt-in

📋 Code Quality

Formatting: Clean, follows project conventions
Consistency: Parameter properly added to all necessary locations
Type Safety: Correct type annotation (bool)

Summary

This is a solid implementation that enables an important performance optimization. The main gap is test coverage—adding even a basic test would significantly increase confidence in this change.

Recommendation: Request the author to add test coverage before merging. Otherwise, the implementation is ready.

Review generated with assistance from Claude Code

Add swizzled_input_sf parameter to allow users to control whether the input scaling factor is swizzled. This enables fusion of the swizzle operation into the MOE kernel after FP4 allgather/alltoall operations. Changes: - Add swizzled_input_sf parameter to cutlass_fused_moe Python API with default value of True (maintaining backward compatibility) - Update C++ TVM-FFI bindings to accept and pass through the parameter - Remove hardcoded swizzled_input_sf = true in binding layer Closes #2200 Co-authored-by: Zihao Ye <yzh119@users.noreply.github.com> Co-authored-by: Kesen Li <lsam@nvidia.com>

claude · 2026-03-27T09:59:41Z

Code Review

Summary: Clean, focused change that correctly promotes swizzled_input_sf from a hardcoded internal default to an explicit caller-controlled parameter. The threading from Python API through C++ bindings is consistent and correct.

Issues

Missing docstring for the new parameter

The public cutlass_fused_moe docstring (core.py around line 672) documents input_sf, swiglu_alpha, etc. but has no entry for swizzled_input_sf. Since this is a user-visible parameter with non-obvious semantics, it deserves documentation covering: what swizzling means, when to pass False (e.g. after FP4 allgather/alltoall), and that it only matters when input_sf is provided.

No validation that swizzled_input_sf=False is meaningful

A caller can pass swizzled_input_sf=False with a non-quantized input where input_sf=None. This is likely a silent misconfiguration. Consider adding a guard analogous to the existing checks around line 497-500 in the binding:

TVM_FFI_ICHECK(swizzled_input_sf || input_sf.has_value())
    << "swizzled_input_sf=false is only meaningful when input_sf is provided";

No test coverage for swizzled_input_sf=False

The change introduces a new code path but no tests exercise it. A test calling with swizzled_input_sf=False on a pre-swizzled input_sf would validate the new use case and guard against regressions.

Minor observations

The C++ variable rename from swizzled_input_sf_ml to swizzled_input_sf in runMoeMinLatency is a good cleanup.
_fake_cutlass_fused_moe (torch.compile abstract impl) correctly accepts but ignores the parameter -- right pattern for fake impls.
Pre-existing: docstring at line 670 says "GEMM1 biases" for fc2_expert_biases (copy-paste typo, unrelated to this PR).

Overall the change is correct and the default of True properly maintains backward compatibility. Main gaps are documentation and a test for the False path.

The original commit only added the parameter to the internal @register_custom_op function and fake op, but missed the public cutlass_fused_moe() function signature and its call to the internal function. This caused the parameter to be silently ignored. Co-authored-by: Zihao Ye <yzh119@users.noreply.github.com> Co-authored-by: Kesen Li <lsam@nvidia.com>

claude · 2026-03-27T10:09:42Z

Code Review

This is a clean, focused change that exposes swizzled_input_sf to callers rather than hard-coding it to true. Overall it looks correct. A few observations:

Correctness

Argument ordering is consistent across both paths. In the Python dispatcher (core.py:491-518), swizzled_input_sf is placed after swiglu_limit and before *min_latency_output, which maps correctly to both the run_moe and run_moe_min_latency C++ signatures (the min-latency tensors only appear when *min_latency_output expands). Good.

The #else branch (flashinfer_cutlass_fused_moe_binding.cu:418) hardcodes enable_alltoall as false -- pre-existing issue, not introduced here. In the non-USING_OSS_CUTLASS_MOE_GEMM build, alltoall is always disabled regardless of what the caller requests. Consider adding a TODO comment near this site so future maintainers are aware.

Documentation gap

swizzled_input_sf is added to the public cutlass_fused_moe signature (line 602) and the inner registered op (line 388), but it is missing from the docstring (lines 666-746) which documents every other parameter. Please add an entry explaining what swizzled vs linear layout means for the input scaling factor and when callers should pass False (e.g., post-allgather/alltoall FP4 scenario described in the PR summary).

No input validation

There is no guard that swizzled_input_sf=False is only meaningful when input_sf is provided. Passing swizzled_input_sf=False with a non-FP4 dtype propagates silently to the kernel. A small defensive check would help catch misuse:

if not swizzled_input_sf and input_sf is None:
    raise ValueError("swizzled_input_sf=False requires input_sf to be provided")

Test coverage

The new test test_moe_nvfp4_unswizzled_input_sf is well-structured. Using rtol=0, atol=0 is the right call since swizzling is a deterministic layout transformation.

One gap: the test only exercises the Blackwell (SM100/110/120) code path. The SM90 path through FusedMoeRunner::runMoe (non-min-latency) is not exercised with swizzled_input_sf=False. A test covering SM90 would increase confidence if that hardware is available in CI.

Minor nit

The existing typo runMoeMinLantency ("Lantency" vs "Latency") in the binding file is pre-existing and unrelated to this PR.

coderabbitai

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)

flashinfer/fused_moe/core.py (2)

588-619: ⚠️ Potential issue | 🟠 Major

Don’t insert the new public flag ahead of existing positional args.

swizzled_input_sf=True does not preserve backward compatibility here: existing positional callers will silently shift tp_size, tp_rank, ep_size, and the rest into the wrong slots. Please append the new parameter at the end of the public API surface and forward it by keyword; while you’re here, the docstring below should explain what False means.

Suggested fix

-    swizzled_input_sf: bool = True,
     tp_size: int = 1,
     tp_rank: int = 0,
     ep_size: int = 1,
     ep_rank: int = 0,
     cluster_size: int = 1,
@@
     use_packed_weights: bool = False,
     tune_max_num_tokens: int = 8192,
     enable_pdl: Optional[bool] = None,
     activation_type: ActivationType = ActivationType.Swiglu,
+    swizzled_input_sf: bool = True,
 ) -> torch.Tensor:
@@
-        swizzled_input_sf,
         tp_size,
         tp_rank,
         ep_size,
         ep_rank,
         cluster_size,
         cluster_rank,
+        swizzled_input_sf=swizzled_input_sf,
         use_packed_weights=use_packed_weights,

Also applies to: 780-811

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@flashinfer/fused_moe/core.py` around lines 588 - 619, The new public boolean
parameter swizzled_input_sf was inserted before existing positional arguments in
cutlass_fused_moe which breaks backward compatibility; move swizzled_input_sf to
the end of the parameter list (after activation_type) so existing positional
callers keep the same argument mapping, update the function docstring to
describe what swizzled_input_sf=False means, and ensure any internal calls or
forwards pass it by keyword (swizzled_input_sf=...) rather than position; make
the same relocation and docstring/forwarding change for the other
overloaded/duplicate function occurrence mentioned in the review (the second
cutlass_fused_moe-like signature around the later block).

531-562: ⚠️ Potential issue | 🟠 Major

Add activation_type to _fake_cutlass_fused_moe.

The real custom op still accepts activation_type, and the wrapper forwards it at Line 810. Leaving it out here makes the fake schema diverge from the real one, which can break torch.compile / fake-tensor execution.

Suggested fix

-        enable_pdl: Optional[bool] = None,
-        use_packed_weights: bool = False,
+        enable_pdl: Optional[bool] = None,
+        activation_type: ActivationType = ActivationType.Swiglu,
+        use_packed_weights: bool = False,

Based on learnings: fake ops decorated with register_fake_op in flashinfer/fused_moe must exactly mirror the corresponding real op signatures.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@flashinfer/fused_moe/core.py` around lines 531 - 562, The fake op function
_fake_cutlass_fused_moe is missing the activation_type parameter, causing its
signature to diverge from the real custom op and breaking
torch.compile/fake-tensor paths; add an activation_type: Optional[int] (or the
same type used by the real op) parameter to _fake_cutlass_fused_moe and ensure
it is accepted and forwarded just like the wrapper does (the wrapper already
forwards activation_type), so the fake op signature exactly mirrors the real op.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@csrc/fused_moe/cutlass_backend/flashinfer_cutlass_fused_moe_binding.cu`:
- Around line 430-431: The function signature/call formatting around the
parameters Optional<TensorView> swiglu_limit, bool swizzled_input_sf, TensorView
num_active_experts_per_node, and the similar block at the other occurrence must
be normalized by running clang-format on the file so pre-commit stops rewriting
it; run clang-format with the project's config (or run the repo's pre-commit
hook) to reflow the parameter lists and call sites, save and stage the updated
file so CI no longer flags the formatting differences.

---

Outside diff comments:
In `@flashinfer/fused_moe/core.py`:
- Around line 588-619: The new public boolean parameter swizzled_input_sf was
inserted before existing positional arguments in cutlass_fused_moe which breaks
backward compatibility; move swizzled_input_sf to the end of the parameter list
(after activation_type) so existing positional callers keep the same argument
mapping, update the function docstring to describe what swizzled_input_sf=False
means, and ensure any internal calls or forwards pass it by keyword
(swizzled_input_sf=...) rather than position; make the same relocation and
docstring/forwarding change for the other overloaded/duplicate function
occurrence mentioned in the review (the second cutlass_fused_moe-like signature
around the later block).
- Around line 531-562: The fake op function _fake_cutlass_fused_moe is missing
the activation_type parameter, causing its signature to diverge from the real
custom op and breaking torch.compile/fake-tensor paths; add an activation_type:
Optional[int] (or the same type used by the real op) parameter to
_fake_cutlass_fused_moe and ensure it is accepted and forwarded just like the
wrapper does (the wrapper already forwards activation_type), so the fake op
signature exactly mirrors the real op.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: acd62a87-2bbf-4eab-9aa8-33ade59826cb

📥 Commits

Reviewing files that changed from the base of the PR and between 1a50f5a and ff12b88.

📒 Files selected for processing (2)

csrc/fused_moe/cutlass_backend/flashinfer_cutlass_fused_moe_binding.cu
flashinfer/fused_moe/core.py

Verifies that passing a linear-layout input_sf with swizzled_input_sf=False produces identical output to a swizzled-layout input_sf with the default swizzled_input_sf=True. This covers the FP4 allgather/alltoall use case where input_sf is in linear layout after communication. Co-authored-by: Kesen Li <lsam@nvidia.com>

claude · 2026-03-27T10:14:37Z

Code Review

This PR cleanly exposes swizzled_input_sf to support post-allgather/alltoall FP4 scenarios where scaling factors arrive in linear layout. The change is well-motivated and the default (True) correctly preserves backward compatibility. A few things to address:

1. Missing docstring entry for `swizzled_input_sf`

The parameter is added to the public cutlass_fused_moe() signature at line 602 but is not documented in the docstring. Every other parameter in that function is documented. This should be added between the swiglu_limit and tp_size entries:

swizzled_input_sf : bool, optional
    Whether the input scaling factor tensor (``input_sf``) is in swizzled layout.
    Set to ``False`` when ``input_sf`` is received in linear (row-major) layout,
    e.g. after FP4 allgather/alltoall operations. Defaults to ``True`` to maintain
    backward compatibility.

2. Bit-exact test assertion may be fragile

The test asserts:

torch.testing.assert_close(output_swizzled, output_linear, rtol=0, atol=0)

Bit-exact equality (rtol=0, atol=0) between two different execution paths is a strong claim. If both paths go through identical arithmetic after swizzle handling, this is correct. But if the kernel has any path divergence (e.g., different memory access patterns causing different rounding) this could produce spurious CI failures on some hardware.

Consider using a small tolerance (rtol=1e-5, atol=0) or leaving a comment explaining why bit-exact output is expected here (e.g., "swizzle is purely a memory reordering with no arithmetic impact, so outputs must be identical").

3. `swizzled_input_sf` is silently ignored for non-FP4 inputs

When input_sf is None (e.g., FP16/BF16 inputs), swizzled_input_sf is passed to the kernel runner but has no effect. This is functionally correct, but passing swizzled_input_sf=False in that case is misleading and could confuse callers. A lightweight guard or at least a docstring note would help:

# In cutlass_fused_moe():
if swizzled_input_sf is False and input_sf is None:
    # swizzled_input_sf has no effect when input_sf is not provided
    pass

Or just document it: "This parameter has no effect when input_sf is None."

4. Test only covers standard mode, not `min_latency_mode`

The swizzled_input_sf parameter flows into both run_moe and run_moe_min_latency in the C++ binding, but the new test only exercises the standard path. A comment noting this gap, or a separate test for min_latency_mode=True, would improve coverage confidence (this is less critical since min_latency_mode raises NotImplementedError for Blackwell currently).

Minor: Consistent skip pattern

The new test uses the same torch.cuda.get_device_capability()[0] not in [10, 11, 12] skip condition as existing NVFP4 tests (e.g. line 481) — good consistency.

Overall the change is correct and well-structured. The C++ parameter threading through both runMoe/runMoeMinLatency is clean, and removing the hardcoded swizzled_input_sf = true HACK comment is the right move. The docstring gap and test assertion strictness are the main things to address before merge.

claude · 2026-03-27T10:16:13Z

Code Review

Overall this is a clean, well-scoped change. The parameter threads correctly from Python through TVM-FFI lambdas to the C++ kernel. A few observations:

C++ binding: local-variable shadowing removal

In runMoe, the removed local bool const swizzled_input_sf = true was shadowing the new parameter of the same name. Removing it makes the parameter flow through silently - correct, but subtle. The same situation is handled differently in runMoeMinLatency (where the local was named swizzled_input_sf_ml, requiring an explicit substitution at the call site). A brief comment in runMoe noting that swizzled_input_sf now comes from the function parameter would improve clarity.

_fake_cutlass_fused_moe receives but never uses the parameter

The shape-inference stub accepts swizzled_input_sf: bool = True but does not use it (layout flags do not affect output shape, which is expected). A short comment explaining this intentional omission would prevent reviewers from wondering if it was accidentally dropped.

Docstring caveat with no runtime guard

The docstring correctly notes the parameter is Only relevant when input_sf is not None, but there is no assert or warning for the case where input_sf is None and swizzled_input_sf is False. Silent no-ops are fine here, but a matching note in the C++ binding would help downstream callers debug misconfiguration.

Test: zero-tolerance exact-match assertion

The test uses rtol=0, atol=0 (bitwise exact). This is appropriate since both paths process identical quantized values. However, if the underlying TRT-LLM kernel ever introduces non-determinism (parallel reductions), this could become flaky. Using a small tolerance matching the rest of the test file (most tests use rtol=1e-2, atol=1e-1) would be more robust.

Minor: pre-existing docstring typo on touched lines

fc2_expert_biases is documented as GEMM1 biases for each expert - should be GEMM2. Worth fixing while the docstring is already being updated.

Summary

Backward-compatible default, correct parameter threading, and a clear test scenario for the allgather/alltoall use case. The changes are minimal and well-targeted. The main actionable items are: (1) add a comment in runMoe clarifying the shadowing removal, (2) annotate the unused parameter in the fake stub, (3) optionally relax the test tolerance.

Inserting a new positional parameter between existing ones breaks backward compatibility for callers that use positional arguments. Moving swizzled_input_sf to the end (after activation_type) ensures no existing call sites are affected. The internal @register_custom_op function signature is unchanged since it is not part of the public API and its positional order must match the TVM-FFI lambda.

claude · 2026-03-27T10:22:07Z

Code Review

Summary: This PR exposes swizzled_input_sf to allow callers to indicate whether the input scaling factor tensor is in swizzled or linear layout — enabling fusion of the de-swizzle step into the MOE kernel for FP4 allgather/alltoall scenarios. Overall this is a clean, well-scoped change.

What's done well

Backward compatible default: swizzled_input_sf=True matches the previous hardcoded behaviour, so existing callers are unaffected.
Doc string: The new parameter's documentation clearly explains the motivation (post-allgather linear layout) and when to use each value.
Test: test_moe_nvfp4_unswizzled_input_sf directly validates the equivalence of both paths and uses rtol=0, atol=0 since the two paths should produce bit-identical results. The device-capability guard is correct.

Observations / suggestions

1. Parameter position inconsistency between the public API and the internal call-through

In the public cutlass_fused_moe() signature (line 615), swizzled_input_sf sits after activation_type, at the very end of the argument list. In the internal _fake_cutlass_fused_moe() / TVM dispatch (lines 544, 502), it sits after swiglu_limit, before tp_size. These two orderings make it easy to accidentally mismatch argument positions if someone uses positional arguments on the public API or calls the internal helper directly.

It would be slightly cleaner to keep the position consistent, but given that all call sites currently use keyword arguments this is low priority.

2. swizzled_input_sf is silently ignored when input_sf is None

The parameter only has meaning when input_sf is provided, but there is no guard or warning when a caller passes swizzled_input_sf=False with input_sf=None. This is harmless today, but a brief note in the docstring — "ignored when input_sf is None" — would avoid confusion.

3. min_latency_mode path not covered by the new test

The runMoeMinLantency path receives swizzled_input_sf too (lines 427–442), but the new test only exercises the regular runMoe path. A brief call with min_latency_mode=True (or a pytest.mark.parametrize over both modes) would give confidence that the wiring is correct end-to-end for both code paths.

4. Minor nit: lambda in test body

round_up = lambda x, y: (x + y - 1) // y * y

pre-commit / ruff will flag a lambda assigned to a variable (E731). A def or an inline expression would be cleaner, matching the style of the rest of the test file.

Verdict

The change is correct, well-motivated, and backward compatible. The suggestions above are mostly minor polish items. Happy to approve once the min-latency test coverage gap is addressed (or confirmed to be intentionally deferred).

… test Both paths compute equivalent values; allow for minor floating-point non-determinism while still catching real divergence (well below BF16 resolution of ~0.008).

coderabbitai

Actionable comments posted: 2

🧹 Nitpick comments (1)

tests/moe/test_trtllm_cutlass_fused_moe.py (1)

1869-1879: Assert that the SF buffers actually differ before comparing outputs.

hidden_states_* are expected to match regardless of layout. Without a precondition on input_sf_swizzled vs input_sf_linear, this can still pass if the quantizer regresses and returns the same SF layout for both branches, so the new flag is never really exercised.

Suggested hardening

     # Both quantizations should produce the same quantized values
     assert torch.equal(hidden_states_swizzled, hidden_states_linear)
+    assert input_sf_swizzled.shape != input_sf_linear.shape or not torch.equal(
+        input_sf_swizzled, input_sf_linear
+    )

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@tests/moe/test_trtllm_cutlass_fused_moe.py` around lines 1869 - 1879, Before
asserting outputs equal, add a precondition check that the scale-factor buffers
differ: call fp4_quantize to obtain input_sf_swizzled and input_sf_linear and
assert they are not equal (e.g., torch.any(input_sf_swizzled !=
input_sf_linear)) to ensure the swizzled vs linear branch was exercised; only
then assert torch.equal(hidden_states_swizzled, hidden_states_linear). This
ensures fp4_quantize's is_sf_swizzled_layout flag actually changes SF layout
before comparing hidden_states_swizzled and hidden_states_linear.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@tests/moe/test_trtllm_cutlass_fused_moe.py`:
- Around line 1799-1802: Replace the direct torch.cuda.get_device_capability()
check in the pytest.skipif decorator with the repo helper that centralizes GPU
capability logic: import and use is_sm100a_supported() (or
get_compute_capability()/is_sm90a_supported() if more appropriate) instead of
torch.cuda.get_device_capability()[0] not in [10, 11, 12]; update the decorator
to pytest.mark.skipif(not is_sm100a_supported(), reason=...) and add the import
for is_sm100a_supported() at the top of the test module so the skip logic stays
consistent with the rest of the CUDA test suite.
- Line 1819: Replace the lambda assignment to round_up with a regular function
that calls the existing ceil_div helper: implement round_up as a normal def (not
a lambda) that returns the result of ceil_div(x, y) multiplied by y; reference
the existing ceil_div function in the module and remove the lambda assignment to
fix the Ruff E731 violation.

---

Nitpick comments:
In `@tests/moe/test_trtllm_cutlass_fused_moe.py`:
- Around line 1869-1879: Before asserting outputs equal, add a precondition
check that the scale-factor buffers differ: call fp4_quantize to obtain
input_sf_swizzled and input_sf_linear and assert they are not equal (e.g.,
torch.any(input_sf_swizzled != input_sf_linear)) to ensure the swizzled vs
linear branch was exercised; only then assert
torch.equal(hidden_states_swizzled, hidden_states_linear). This ensures
fp4_quantize's is_sf_swizzled_layout flag actually changes SF layout before
comparing hidden_states_swizzled and hidden_states_linear.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 621b187e-d44d-472d-8f45-baa2c317e257

📥 Commits

Reviewing files that changed from the base of the PR and between ff12b88 and 267869a.

📒 Files selected for processing (3)

csrc/fused_moe/cutlass_backend/flashinfer_cutlass_fused_moe_binding.cu
flashinfer/fused_moe/core.py
tests/moe/test_trtllm_cutlass_fused_moe.py

🚧 Files skipped from review as they are similar to previous changes (1)

flashinfer/fused_moe/core.py

claude · 2026-03-27T10:25:31Z

test comment - please ignore

claude · 2026-03-27T10:26:08Z

test edit

claude · 2026-03-27T10:26:44Z

Code Review

Critical Bug: Positional Argument Mismatch in Outer Call Site

File: flashinfer/fused_moe/core.py, lines 786-817

The inner cutlass_fused_moe custom op (line 373) had swizzled_input_sf: bool = True inserted at positional slot 14 (between swiglu_limit and tp_size). But the outer public cutlass_fused_moe call site (lines 786-817) was NOT updated to match. It still passes tp_size, tp_rank, ep_size, ep_rank, cluster_size, cluster_rank positionally after swiglu_limit, then passes swizzled_input_sf as a keyword argument. This means tp_size lands in the swizzled_input_sf slot, tp_rank lands in tp_size, ep_size lands in tp_rank, ep_rank lands in ep_size, cluster_size lands in ep_rank, cluster_rank lands in cluster_size, and swizzled_input_sf=swizzled_input_sf then causes TypeError: got multiple values for argument on any call path reaching this code.

The fix is to insert swizzled_input_sf positionally between swiglu_limit and tp_size at line ~800, or switch tp_size through cluster_rank to keyword arguments.

Why the New Test Does Not Catch This

test_moe_nvfp4_unswizzled_input_sf calls the public outer API and would hit this bug. However, it is guarded by an SM100/110/120 skip condition, so it will not run in CI environments without those GPUs. The skip guard is correct, but it means the bug is not exercised in broader CI.

Other Observations (Non-blocking)

C++ binding changes look correct. swizzled_input_sf is consistently threaded through runMoe, runMoeMinLatency, and both TVM GetFunction lambdas (run_moe and run_moe_min_latency). Removing the internal bool const swizzled_input_sf = true hardcodes is the right approach.

Backward compatibility is properly handled. Default value True in all signatures ensures no existing callers are broken.

Test design is sound. Using rtol=0, atol=0 is appropriate - swizzling is purely a memory layout transformation, so the kernel output should be bit-exact regardless of which layout is used. The assertion torch.equal(hidden_states_swizzled, hidden_states_linear) verifying the quantized values are identical is a good sanity check.

Minor nit: The round_up lambda in the test is used 6 times; a named function would be slightly more readable, but this is low priority for a test file.

- Replace torch.cuda.get_device_capability() check with is_sm100a_supported() helper for consistent GPU capability detection across the test suite - Replace round_up lambda with def to fix Ruff E731 - Add assertion that input_sf buffers differ between swizzled/linear layouts so the test cannot pass trivially if fp4_quantize ignores is_sf_swizzled_layout

claude · 2026-03-27T10:33:54Z

Code Review

This PR cleanly exposes swizzled_input_sf as a user-controllable parameter instead of a hardcoded true, enabling fusion of the swizzle operation into the MoE kernel after FP4 allgather/alltoall. The implementation is sound overall. A few observations:

Correctness

Parameter threading is correct. The boolean propagates through all four call sites:

runMoe (non-min-latency)
runMoeMinLantency (min-latency)
GetFunction("run_moe") lambda
GetFunction("run_moe_min_latency") lambda

The trick of reusing the same variable name (swizzled_input_sf) in runMoe is subtle but correct — the diff only shows removal of the hardcoded assignment; the downstream mKernelRunner->runMoe(..., swizzled_input_sf, ...) call is unchanged because the variable name is identical. Worth a comment noting this was intentional.

input_sf=None with swizzled_input_sf=False is handled correctly — the C++ side nulls out the pointer, making the flag irrelevant.

API Design

Parameter placement inconsistency: swizzled_input_sf appears in different positions between the inner custom op (core.py:388, between swiglu_limit and tp_size) and the public API (core.py:618, at the end). This isn't a bug (the call site at line 816 uses swizzled_input_sf=swizzled_input_sf), but it's worth noting this divergence for future maintainers who may be confused when comparing the two signatures.

Backward compatibility is maintained — default True preserves existing behavior.

Test

The test structure is solid:

# Guard against trivial pass
assert not torch.equal(input_sf_swizzled, input_sf_linear), ...

This sanity check is important — good practice.

Concern: test tolerances. rtol=1e-3, atol=1e-3 for FP4 quantized output could be tight for some inputs. FP4 has limited precision and non-linear quantization errors can accumulate across tokens. Consider using slightly looser tolerances (e.g. rtol=1e-2, atol=1e-2) that other NVFP4 tests in this file use, or document why these tighter tolerances are appropriate here.

Missing coverage: The runMoeMinLantency code path is updated but not exercised by any test. The public API raises NotImplementedError for min-latency mode on SM100a (Blackwell), so this is unavoidable for the current test environment — but it'd be worth a comment explaining why min-latency coverage is absent.

Minor nit: quant_blocksize = 16 is defined but only used for sf_w1_k and sf_w2_n calculations. The 128 in sf_w1_2n = round_up(w1_n, 128) and sf_w2_k = round_up(k, 128) are hardcoded — are these the same blocksize concept (just the other dimension), or different? A brief comment would clarify.

Pre-existing Issue (not this PR)

fc2_expert_biases docstring (core.py:670) says "GEMM1 biases" — typo, should be "GEMM2 biases". Out of scope here, but worth fixing in a follow-up.

Overall: the change is clean, the default maintains backward compatibility, and the approach is correct. The main asks are around test tolerances and coverage of the min-latency path in the docstring/comment.

samuellees · 2026-03-27T10:34:43Z

/bot run

flashinfer-bot · 2026-03-27T10:36:07Z

GitLab MR !465 has been created, and the CI pipeline #47094985 is currently running. I'll report back once the pipeline job completes.

flashinfer-bot · 2026-03-27T19:49:41Z

[FAILED] Pipeline #47094985: 6/20 passed

…_moe signature The inner cutlass_fused_moe (registered via torch.library.custom_op) has swizzled_input_sf at position 15 (after swiglu_limit), but the outer wrapper was passing tp_size at that position and then also passing swizzled_input_sf as a keyword argument, causing: TypeError: cutlass_fused_moe() got multiple values for argument 'swizzled_input_sf' Fix: insert swizzled_input_sf into the positional argument list between swiglu_limit and tp_size, and remove the duplicate keyword argument. AI-assisted Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

samuellees · 2026-03-28T03:15:29Z

/bot run

flashinfer-bot · 2026-03-28T03:15:39Z

GitLab MR !465 has been updated with latest changes, and the CI pipeline #47128436 is currently running. I'll report back once the pipeline job completes.

claude · 2026-03-28T03:17:27Z

Code Review

Overall this is a clean, focused change that exposes a previously hardcoded parameter. Backward-compatible and well-documented. A few observations below.

Correctness

Both execution paths are covered. swizzled_input_sf is correctly threaded through both runMoe and runMoeMinLantency, and both the regular and min-latency Python call sites pass the new parameter. No path was missed.

_fake_cutlass_fused_moe is updated. Important for torch.compile compatibility -- the fake kernel signature must stay in sync with the real one.

Minor Issues

Parameter position inconsistency (non-blocking). In the internal helpers (_fake_cutlass_fused_moe and the lower-level overload), swizzled_input_sf is inserted between swiglu_limit and tp_size, matching the C++ signature order. In the public cutlass_fused_moe it is placed after activation_type. Not wrong since all are kwargs, but slightly surprising when reading both definitions side-by-side.

Test import ordering. from flashinfer.utils import is_sm100a_supported is inserted before the from flashinfer import (...) block. Per isort convention all flashinfer imports should be grouped together.

Pre-existing typo (not introduced here). runMoeMinLantency should be runMoeMinLatency -- could be fixed in a follow-up.

Test Coverage

The test structure is solid: it runs both swizzled and linear layouts and asserts numerical closeness. The explicit sanity-check that the two SF buffers actually differ is a nice touch -- prevents the test passing trivially if fp4_quantize ignored is_sf_swizzled_layout.

Optional enhancement: add a min_latency_mode=True variant to also exercise the runMoeMinLantency path with swizzled_input_sf=False, since that is an independent C++ code path. Not a blocker.

Summary

The change is correct and well-tested. Only actionable items are the import ordering nit and optional min-latency coverage. LGTM with the minor import fix.

flashinfer-bot · 2026-03-28T15:19:05Z

[FAILED] Pipeline #47128436: 12/20 passed

jiahanc

LGTM, thanks for the contribution!

yzh119 requested review from cyx-6, djmmoss, jiahanc, jimmyzho and wenscarl as code owners January 11, 2026 07:53

coderabbitai Bot reviewed Jan 11, 2026

View reviewed changes

gemini-code-assist Bot reviewed Jan 11, 2026

View reviewed changes

aleozlx added the op: moe label Feb 14, 2026

samuellees self-assigned this Mar 27, 2026

samuellees force-pushed the claude/issue-2200-20260111-0748 branch from 1a50f5a to ff12b88 Compare March 27, 2026 09:55

samuellees requested review from IwakuraRein, aleozlx, nv-yunzheq and samuellees as code owners March 27, 2026 09:55

samuellees force-pushed the claude/issue-2200-20260111-0748 branch from ff12b88 to 50dc605 Compare March 27, 2026 09:58

samuellees force-pushed the claude/issue-2200-20260111-0748 branch from 50dc605 to 2bc2a26 Compare March 27, 2026 10:05

coderabbitai Bot reviewed Mar 27, 2026

View reviewed changes

Comment thread csrc/fused_moe/cutlass_backend/flashinfer_cutlass_fused_moe_binding.cu Outdated

samuellees force-pushed the claude/issue-2200-20260111-0748 branch from 2bc2a26 to 92d0912 Compare March 27, 2026 10:12

samuellees force-pushed the claude/issue-2200-20260111-0748 branch from 92d0912 to 267869a Compare March 27, 2026 10:13

samuellees force-pushed the claude/issue-2200-20260111-0748 branch from 5b4dcf3 to 3e04aa9 Compare March 27, 2026 10:20

test: relax assert_close tolerance to atol=1e-3 for swizzled_input_sf…

a06af21

… test Both paths compute equivalent values; allow for minor floating-point non-determinism while still catching real divergence (well below BF16 resolution of ~0.008).

coderabbitai Bot reviewed Mar 27, 2026

View reviewed changes

Comment thread tests/moe/test_trtllm_cutlass_fused_moe.py

Comment thread tests/moe/test_trtllm_cutlass_fused_moe.py Outdated

jiahanc approved these changes Mar 30, 2026

View reviewed changes

jiahanc added the run-ci label Mar 30, 2026

jiahanc enabled auto-merge (squash) March 30, 2026 00:09

jiahanc merged commit 4941606 into main Mar 30, 2026
54 checks passed

jiahanc deleted the claude/issue-2200-20260111-0748 branch March 30, 2026 04:18

coderabbitai Bot mentioned this pull request Apr 5, 2026

[Draft] "fix: snap weight_scale_vec_size to handle block_scale_interleave padding for SM120" #2983

Closed

Conversation

yzh119 commented Jan 11, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Jan 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviews paused

Walkthrough

Changes

Sequence Diagram

Estimated Code Review Effort

Possibly related PRs

Suggested labels

Suggested reviewers

Poem

❌ Failed checks (1 warning)

Uh oh!

gemini-code-assist Bot commented Jan 11, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Jan 11, 2026

Choose a reason for hiding this comment

Uh oh!

claude Bot commented Jan 11, 2026

Code Review

✅ Strengths

🔍 Potential Issues & Suggestions

1. Missing Test Coverage (High Priority)

2. Missing Documentation (Medium Priority)

3. Fake Op Signature (Low Priority)

🔒 Security & Performance

📋 Code Quality

Summary

Uh oh!

claude Bot commented Mar 27, 2026

Uh oh!

claude Bot commented Mar 27, 2026

Code Review

Correctness

Documentation gap

No input validation

Test coverage

Minor nit

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

claude Bot commented Mar 27, 2026

Code Review

1. Missing docstring entry for swizzled_input_sf

2. Bit-exact test assertion may be fragile

3. swizzled_input_sf is silently ignored for non-FP4 inputs

4. Test only covers standard mode, not min_latency_mode

Minor: Consistent skip pattern

Uh oh!

claude Bot commented Mar 27, 2026

Code Review

Uh oh!

claude Bot commented Mar 27, 2026

Code Review

What's done well

Observations / suggestions

Verdict

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

yzh119 commented Jan 11, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Jan 11, 2026 •

edited

Loading

1. Missing docstring entry for `swizzled_input_sf`

3. `swizzled_input_sf` is silently ignored for non-FP4 inputs

4. Test only covers standard mode, not `min_latency_mode`

claude Bot commented Mar 27, 2026 •

edited

Loading

claude Bot commented Mar 27, 2026 •

edited

Loading