feat: expose swizzled_input_sf parameter for CUTLASS fused MOE#2330
feat: expose swizzled_input_sf parameter for CUTLASS fused MOE#2330
Conversation
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughAdds a new Changes
Sequence DiagramsequenceDiagram
participant User as User Code
participant PyAPI as Python API\n(cutlass_fused_moe)
participant TVM as TVM Binding\n(GetFunction)
participant Cpp as C++ Runner\n(FusedMoeRunner)
participant Kernel as CUDA Kernel
User->>PyAPI: call cutlass_fused_moe(..., swizzled_input_sf)
PyAPI->>TVM: forward args including swizzled_input_sf
TVM->>Cpp: invoke lambda with swizzled_input_sf
Cpp->>Kernel: launch kernel using swizzled_input_sf
Kernel-->>Cpp: kernel completes
Cpp-->>TVM: return results
TVM-->>PyAPI: return outputs
PyAPI-->>User: return outputs
Estimated Code Review Effort🎯 3 (Moderate) | ⏱️ ~20 minutes Possibly related PRs
Suggested labels
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Summary of ChangesHello @yzh119, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request enhances the CUTLASS fused Mixture-of-Experts (MOE) implementation by exposing a new configuration parameter, Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Actionable comments posted: 0
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
flashinfer/fused_moe/core.py (1)
710-931: Critical:swizzled_input_sfparameter missing from public API function.The new
swizzled_input_sfparameter was added to the innercutlass_fused_moefunction (line 510) and passed torun_moe(line 627), but it's missing from:
- The public
cutlass_fused_moefunction signature (lines 710-740)- The call to the inner function (lines 901-931)
This means users cannot actually control the swizzle behavior through the public API, defeating the purpose of this PR.
🐛 Proposed fix
Add the parameter to the public function signature (after
swiglu_limit):swiglu_alpha: Optional[torch.Tensor] = None, swiglu_beta: Optional[torch.Tensor] = None, swiglu_limit: Optional[torch.Tensor] = None, + swizzled_input_sf: bool = True, tp_size: int = 1,And pass it in the call (around line 916):
swiglu_alpha, swiglu_beta, swiglu_limit, + swizzled_input_sf, tp_size,Also add documentation for the parameter in the docstring.
csrc/fused_moe/cutlass_backend/flashinfer_cutlass_fused_moe_binding.cu (1)
420-433: Pipeline failure: clang-format check failed.The CI indicates a formatting issue around line 424. Based on the code structure, ensure consistent formatting of the function signature.
Run
clang-formaton this file to fix the formatting:clang-format -i csrc/fused_moe/cutlass_backend/flashinfer_cutlass_fused_moe_binding.cu
📜 Review details
Configuration used: defaults
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
csrc/fused_moe/cutlass_backend/flashinfer_cutlass_fused_moe_binding.cuflashinfer/fused_moe/core.py
🧰 Additional context used
📓 Path-based instructions (2)
flashinfer/**/*.py
📄 CodeRabbit inference engine (CLAUDE.md)
flashinfer/**/*.py: Use@functools.cachedecorator on Python API functions to implement module-level caching and avoid recompilation
Use@flashinfer_apidecorator for debugging API calls, enable viaFLASHINFER_LOGLEVELenvironment variable (0=off, 1=basic, 3=detailed, 5=with stats)
Files:
flashinfer/fused_moe/core.py
csrc/**/*.cu
📄 CodeRabbit inference engine (CLAUDE.md)
Framework bindings and PyTorch tensor handling should be implemented in
csrc/via TVM-FFI, not ininclude/headers
Files:
csrc/fused_moe/cutlass_backend/flashinfer_cutlass_fused_moe_binding.cu
🧬 Code graph analysis (1)
csrc/fused_moe/cutlass_backend/flashinfer_cutlass_fused_moe_binding.cu (2)
flashinfer/comm/mapping.py (1)
tp_rank(325-326)csrc/nv_internal/tensorrt_llm/kernels/cutlass_kernels/include/moe_gemm_kernels.h (1)
enable_pdl(220-220)
🪛 GitHub Actions: pre-commit
csrc/fused_moe/cutlass_backend/flashinfer_cutlass_fused_moe_binding.cu
[error] 424-424: clang-format check failed. Files were modified by this hook during pre-commit.
🪛 Ruff (0.14.10)
flashinfer/fused_moe/core.py
669-669: Unused function argument: swizzled_input_sf
(ARG001)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
- GitHub Check: Deploy Docs
- GitHub Check: claude-review
🔇 Additional comments (5)
flashinfer/fused_moe/core.py (2)
653-701: Theswizzled_input_sfparameter is correctly added to the fake op.The static analysis warning about the unused parameter is expected for fake ops, as they only define output shapes and dtypes without executing actual logic.
510-510: Parameter correctly threaded through inner function to runtime call.The
swizzled_input_sfparameter is properly added with defaultTruefor backward compatibility, and correctly passed to the underlyingrun_moecall.Also applies to: 627-627
csrc/fused_moe/cutlass_backend/flashinfer_cutlass_fused_moe_binding.cu (3)
246-250: Parameter correctly integrated intorunMoemethod.The
swizzled_input_sfparameter is properly:
- Added to the function signature after
swiglu_limit- Passed through to
mKernelRunner->runMoein bothUSING_OSS_CUTLASS_MOE_GEMMand non-OSS pathsAlso applies to: 385-400
426-433: Parameter correctly integrated intorunMoeMinLantencymethod.The
swizzled_input_sfis properly threaded through the min-latency execution path in both OSS and non-OSS code paths.Also applies to: 569-602
718-758: TVM-FFI bindings correctly updated.Both
run_moeandrun_moe_min_latencyfunction bindings properly include the newswizzled_input_sfparameter and pass it through to the respective runner methods. This aligns with the coding guidelines for framework bindings incsrc/.
There was a problem hiding this comment.
Code Review
This pull request exposes a new swizzled_input_sf parameter for the CUTLASS fused MoE implementation, allowing control over whether the input scaling factor is swizzled. The changes are well-implemented, consistently propagating the new parameter from the Python API down to the C++ kernel calls. The default value is set to True to maintain backward compatibility, as described. The code looks good, but I have one suggestion to improve documentation by adding the new parameter to the function's docstring.
| swiglu_alpha: Optional[torch.Tensor] = None, | ||
| swiglu_beta: Optional[torch.Tensor] = None, | ||
| swiglu_limit: Optional[torch.Tensor] = None, | ||
| swizzled_input_sf: bool = True, |
Code ReviewI've reviewed PR #2330 and overall the implementation looks good. The change cleanly exposes the ✅ Strengths
🔍 Potential Issues & Suggestions1. Missing Test Coverage (High Priority)The PR adds a new user-facing parameter but doesn't include any tests exercising it. This is risky because:
Recommendation: Add at least one test case that calls Example: @pytest.mark.parametrize("swizzled_input_sf", [True, False])
def test_moe_nvfp4_swizzled_input_sf(batch_size, ..., swizzled_input_sf):
# ... existing setup ...
_ = fused_moe.cutlass_fused_moe(
hidden_states,
selected_experts.to(torch.int),
routing_weights,
w1_q.contiguous().view(torch.long),
w2_q.contiguous().view(torch.long),
otype,
quant_scales=quant_scales,
input_sf=input_sf,
swizzled_input_sf=swizzled_input_sf, # Test the new parameter
output=flash_output,
activation_type=activation_type,
)2. Missing Documentation (Medium Priority)The parameter lacks a docstring explaining:
Recommendation: Add parameter documentation to the Example docstring addition: def cutlass_fused_moe(
...
swizzled_input_sf: bool = True, # Whether the input scaling factor is swizzled
...
):
"""
...
Args:
...
swizzled_input_sf: Whether the input scaling factor (input_sf) is already
swizzled. Set to False when the swizzle operation should be fused into
the MOE kernel (e.g., after FP4 allgather/alltoall operations).
Default: True (maintains backward compatibility).
...
"""3. Fake Op Signature (Low Priority)The 🔒 Security & Performance
📋 Code Quality
SummaryThis is a solid implementation that enables an important performance optimization. The main gap is test coverage—adding even a basic test would significantly increase confidence in this change. Recommendation: Request the author to add test coverage before merging. Otherwise, the implementation is ready. Review generated with assistance from Claude Code |
1a50f5a to
ff12b88
Compare
Add swizzled_input_sf parameter to allow users to control whether the input scaling factor is swizzled. This enables fusion of the swizzle operation into the MOE kernel after FP4 allgather/alltoall operations. Changes: - Add swizzled_input_sf parameter to cutlass_fused_moe Python API with default value of True (maintaining backward compatibility) - Update C++ TVM-FFI bindings to accept and pass through the parameter - Remove hardcoded swizzled_input_sf = true in binding layer Closes #2200 Co-authored-by: Zihao Ye <yzh119@users.noreply.github.com> Co-authored-by: Kesen Li <lsam@nvidia.com>
ff12b88 to
50dc605
Compare
|
Code Review Summary: Clean, focused change that correctly promotes swizzled_input_sf from a hardcoded internal default to an explicit caller-controlled parameter. The threading from Python API through C++ bindings is consistent and correct. Issues
The public cutlass_fused_moe docstring (core.py around line 672) documents input_sf, swiglu_alpha, etc. but has no entry for swizzled_input_sf. Since this is a user-visible parameter with non-obvious semantics, it deserves documentation covering: what swizzling means, when to pass False (e.g. after FP4 allgather/alltoall), and that it only matters when input_sf is provided.
A caller can pass swizzled_input_sf=False with a non-quantized input where input_sf=None. This is likely a silent misconfiguration. Consider adding a guard analogous to the existing checks around line 497-500 in the binding:
The change introduces a new code path but no tests exercise it. A test calling with swizzled_input_sf=False on a pre-swizzled input_sf would validate the new use case and guard against regressions. Minor observations
Overall the change is correct and the default of True properly maintains backward compatibility. Main gaps are documentation and a test for the False path. |
The original commit only added the parameter to the internal @register_custom_op function and fake op, but missed the public cutlass_fused_moe() function signature and its call to the internal function. This caused the parameter to be silently ignored. Co-authored-by: Zihao Ye <yzh119@users.noreply.github.com> Co-authored-by: Kesen Li <lsam@nvidia.com>
50dc605 to
2bc2a26
Compare
Code ReviewThis is a clean, focused change that exposes CorrectnessArgument ordering is consistent across both paths. In the Python dispatcher ( The Documentation gap
No input validationThere is no guard that if not swizzled_input_sf and input_sf is None:
raise ValueError("swizzled_input_sf=False requires input_sf to be provided")Test coverageThe new test One gap: the test only exercises the Blackwell (SM100/110/120) code path. The SM90 path through Minor nitThe existing typo |
There was a problem hiding this comment.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
flashinfer/fused_moe/core.py (2)
588-619:⚠️ Potential issue | 🟠 MajorDon’t insert the new public flag ahead of existing positional args.
swizzled_input_sf=Truedoes not preserve backward compatibility here: existing positional callers will silently shifttp_size,tp_rank,ep_size, and the rest into the wrong slots. Please append the new parameter at the end of the public API surface and forward it by keyword; while you’re here, the docstring below should explain whatFalsemeans.Suggested fix
- swizzled_input_sf: bool = True, tp_size: int = 1, tp_rank: int = 0, ep_size: int = 1, ep_rank: int = 0, cluster_size: int = 1, @@ use_packed_weights: bool = False, tune_max_num_tokens: int = 8192, enable_pdl: Optional[bool] = None, activation_type: ActivationType = ActivationType.Swiglu, + swizzled_input_sf: bool = True, ) -> torch.Tensor: @@ - swizzled_input_sf, tp_size, tp_rank, ep_size, ep_rank, cluster_size, cluster_rank, + swizzled_input_sf=swizzled_input_sf, use_packed_weights=use_packed_weights,Also applies to: 780-811
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@flashinfer/fused_moe/core.py` around lines 588 - 619, The new public boolean parameter swizzled_input_sf was inserted before existing positional arguments in cutlass_fused_moe which breaks backward compatibility; move swizzled_input_sf to the end of the parameter list (after activation_type) so existing positional callers keep the same argument mapping, update the function docstring to describe what swizzled_input_sf=False means, and ensure any internal calls or forwards pass it by keyword (swizzled_input_sf=...) rather than position; make the same relocation and docstring/forwarding change for the other overloaded/duplicate function occurrence mentioned in the review (the second cutlass_fused_moe-like signature around the later block).
531-562:⚠️ Potential issue | 🟠 MajorAdd
activation_typeto_fake_cutlass_fused_moe.The real custom op still accepts
activation_type, and the wrapper forwards it at Line 810. Leaving it out here makes the fake schema diverge from the real one, which can breaktorch.compile/ fake-tensor execution.Suggested fix
- enable_pdl: Optional[bool] = None, - use_packed_weights: bool = False, + enable_pdl: Optional[bool] = None, + activation_type: ActivationType = ActivationType.Swiglu, + use_packed_weights: bool = False,Based on learnings: fake ops decorated with
register_fake_opinflashinfer/fused_moemust exactly mirror the corresponding real op signatures.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@flashinfer/fused_moe/core.py` around lines 531 - 562, The fake op function _fake_cutlass_fused_moe is missing the activation_type parameter, causing its signature to diverge from the real custom op and breaking torch.compile/fake-tensor paths; add an activation_type: Optional[int] (or the same type used by the real op) parameter to _fake_cutlass_fused_moe and ensure it is accepted and forwarded just like the wrapper does (the wrapper already forwards activation_type), so the fake op signature exactly mirrors the real op.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@csrc/fused_moe/cutlass_backend/flashinfer_cutlass_fused_moe_binding.cu`:
- Around line 430-431: The function signature/call formatting around the
parameters Optional<TensorView> swiglu_limit, bool swizzled_input_sf, TensorView
num_active_experts_per_node, and the similar block at the other occurrence must
be normalized by running clang-format on the file so pre-commit stops rewriting
it; run clang-format with the project's config (or run the repo's pre-commit
hook) to reflow the parameter lists and call sites, save and stage the updated
file so CI no longer flags the formatting differences.
---
Outside diff comments:
In `@flashinfer/fused_moe/core.py`:
- Around line 588-619: The new public boolean parameter swizzled_input_sf was
inserted before existing positional arguments in cutlass_fused_moe which breaks
backward compatibility; move swizzled_input_sf to the end of the parameter list
(after activation_type) so existing positional callers keep the same argument
mapping, update the function docstring to describe what swizzled_input_sf=False
means, and ensure any internal calls or forwards pass it by keyword
(swizzled_input_sf=...) rather than position; make the same relocation and
docstring/forwarding change for the other overloaded/duplicate function
occurrence mentioned in the review (the second cutlass_fused_moe-like signature
around the later block).
- Around line 531-562: The fake op function _fake_cutlass_fused_moe is missing
the activation_type parameter, causing its signature to diverge from the real
custom op and breaking torch.compile/fake-tensor paths; add an activation_type:
Optional[int] (or the same type used by the real op) parameter to
_fake_cutlass_fused_moe and ensure it is accepted and forwarded just like the
wrapper does (the wrapper already forwards activation_type), so the fake op
signature exactly mirrors the real op.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: acd62a87-2bbf-4eab-9aa8-33ade59826cb
📒 Files selected for processing (2)
csrc/fused_moe/cutlass_backend/flashinfer_cutlass_fused_moe_binding.cuflashinfer/fused_moe/core.py
2bc2a26 to
92d0912
Compare
Verifies that passing a linear-layout input_sf with swizzled_input_sf=False produces identical output to a swizzled-layout input_sf with the default swizzled_input_sf=True. This covers the FP4 allgather/alltoall use case where input_sf is in linear layout after communication. Co-authored-by: Kesen Li <lsam@nvidia.com>
92d0912 to
267869a
Compare
Code ReviewThis PR cleanly exposes 1. Missing docstring entry for
|
Code ReviewOverall this is a clean, well-scoped change. The parameter threads correctly from Python through TVM-FFI lambdas to the C++ kernel. A few observations: C++ binding: local-variable shadowing removal In runMoe, the removed local _fake_cutlass_fused_moe receives but never uses the parameter The shape-inference stub accepts Docstring caveat with no runtime guard The docstring correctly notes the parameter is Test: zero-tolerance exact-match assertion The test uses Minor: pre-existing docstring typo on touched lines
Summary Backward-compatible default, correct parameter threading, and a clear test scenario for the allgather/alltoall use case. The changes are minimal and well-targeted. The main actionable items are: (1) add a comment in runMoe clarifying the shadowing removal, (2) annotate the unused parameter in the fake stub, (3) optionally relax the test tolerance. |
Inserting a new positional parameter between existing ones breaks backward compatibility for callers that use positional arguments. Moving swizzled_input_sf to the end (after activation_type) ensures no existing call sites are affected. The internal @register_custom_op function signature is unchanged since it is not part of the public API and its positional order must match the TVM-FFI lambda.
5b4dcf3 to
3e04aa9
Compare
Code ReviewSummary: This PR exposes What's done well
Observations / suggestions1. Parameter position inconsistency between the public API and the internal call-through In the public It would be slightly cleaner to keep the position consistent, but given that all call sites currently use keyword arguments this is low priority. 2. The parameter only has meaning when 3. The 4. Minor nit: lambda in test body round_up = lambda x, y: (x + y - 1) // y * y
VerdictThe change is correct, well-motivated, and backward compatible. The suggestions above are mostly minor polish items. Happy to approve once the min-latency test coverage gap is addressed (or confirmed to be intentionally deferred). |
… test Both paths compute equivalent values; allow for minor floating-point non-determinism while still catching real divergence (well below BF16 resolution of ~0.008).
There was a problem hiding this comment.
Actionable comments posted: 2
🧹 Nitpick comments (1)
tests/moe/test_trtllm_cutlass_fused_moe.py (1)
1869-1879: Assert that the SF buffers actually differ before comparing outputs.
hidden_states_*are expected to match regardless of layout. Without a precondition oninput_sf_swizzledvsinput_sf_linear, this can still pass if the quantizer regresses and returns the same SF layout for both branches, so the new flag is never really exercised.Suggested hardening
# Both quantizations should produce the same quantized values assert torch.equal(hidden_states_swizzled, hidden_states_linear) + assert input_sf_swizzled.shape != input_sf_linear.shape or not torch.equal( + input_sf_swizzled, input_sf_linear + )🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/moe/test_trtllm_cutlass_fused_moe.py` around lines 1869 - 1879, Before asserting outputs equal, add a precondition check that the scale-factor buffers differ: call fp4_quantize to obtain input_sf_swizzled and input_sf_linear and assert they are not equal (e.g., torch.any(input_sf_swizzled != input_sf_linear)) to ensure the swizzled vs linear branch was exercised; only then assert torch.equal(hidden_states_swizzled, hidden_states_linear). This ensures fp4_quantize's is_sf_swizzled_layout flag actually changes SF layout before comparing hidden_states_swizzled and hidden_states_linear.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@tests/moe/test_trtllm_cutlass_fused_moe.py`:
- Around line 1799-1802: Replace the direct torch.cuda.get_device_capability()
check in the pytest.skipif decorator with the repo helper that centralizes GPU
capability logic: import and use is_sm100a_supported() (or
get_compute_capability()/is_sm90a_supported() if more appropriate) instead of
torch.cuda.get_device_capability()[0] not in [10, 11, 12]; update the decorator
to pytest.mark.skipif(not is_sm100a_supported(), reason=...) and add the import
for is_sm100a_supported() at the top of the test module so the skip logic stays
consistent with the rest of the CUDA test suite.
- Line 1819: Replace the lambda assignment to round_up with a regular function
that calls the existing ceil_div helper: implement round_up as a normal def (not
a lambda) that returns the result of ceil_div(x, y) multiplied by y; reference
the existing ceil_div function in the module and remove the lambda assignment to
fix the Ruff E731 violation.
---
Nitpick comments:
In `@tests/moe/test_trtllm_cutlass_fused_moe.py`:
- Around line 1869-1879: Before asserting outputs equal, add a precondition
check that the scale-factor buffers differ: call fp4_quantize to obtain
input_sf_swizzled and input_sf_linear and assert they are not equal (e.g.,
torch.any(input_sf_swizzled != input_sf_linear)) to ensure the swizzled vs
linear branch was exercised; only then assert
torch.equal(hidden_states_swizzled, hidden_states_linear). This ensures
fp4_quantize's is_sf_swizzled_layout flag actually changes SF layout before
comparing hidden_states_swizzled and hidden_states_linear.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 621b187e-d44d-472d-8f45-baa2c317e257
📒 Files selected for processing (3)
csrc/fused_moe/cutlass_backend/flashinfer_cutlass_fused_moe_binding.cuflashinfer/fused_moe/core.pytests/moe/test_trtllm_cutlass_fused_moe.py
🚧 Files skipped from review as they are similar to previous changes (1)
- flashinfer/fused_moe/core.py
|
test comment - please ignore |
|
test edit |
|
Code Review Critical Bug: Positional Argument Mismatch in Outer Call Site File: flashinfer/fused_moe/core.py, lines 786-817 The inner cutlass_fused_moe custom op (line 373) had swizzled_input_sf: bool = True inserted at positional slot 14 (between swiglu_limit and tp_size). But the outer public cutlass_fused_moe call site (lines 786-817) was NOT updated to match. It still passes tp_size, tp_rank, ep_size, ep_rank, cluster_size, cluster_rank positionally after swiglu_limit, then passes swizzled_input_sf as a keyword argument. This means tp_size lands in the swizzled_input_sf slot, tp_rank lands in tp_size, ep_size lands in tp_rank, ep_rank lands in ep_size, cluster_size lands in ep_rank, cluster_rank lands in cluster_size, and swizzled_input_sf=swizzled_input_sf then causes TypeError: got multiple values for argument on any call path reaching this code. The fix is to insert swizzled_input_sf positionally between swiglu_limit and tp_size at line ~800, or switch tp_size through cluster_rank to keyword arguments. Why the New Test Does Not Catch This test_moe_nvfp4_unswizzled_input_sf calls the public outer API and would hit this bug. However, it is guarded by an SM100/110/120 skip condition, so it will not run in CI environments without those GPUs. The skip guard is correct, but it means the bug is not exercised in broader CI. Other Observations (Non-blocking) C++ binding changes look correct. swizzled_input_sf is consistently threaded through runMoe, runMoeMinLatency, and both TVM GetFunction lambdas (run_moe and run_moe_min_latency). Removing the internal bool const swizzled_input_sf = true hardcodes is the right approach. Backward compatibility is properly handled. Default value True in all signatures ensures no existing callers are broken. Test design is sound. Using rtol=0, atol=0 is appropriate - swizzling is purely a memory layout transformation, so the kernel output should be bit-exact regardless of which layout is used. The assertion torch.equal(hidden_states_swizzled, hidden_states_linear) verifying the quantized values are identical is a good sanity check. Minor nit: The round_up lambda in the test is used 6 times; a named function would be slightly more readable, but this is low priority for a test file. |
- Replace torch.cuda.get_device_capability() check with is_sm100a_supported() helper for consistent GPU capability detection across the test suite - Replace round_up lambda with def to fix Ruff E731 - Add assertion that input_sf buffers differ between swizzled/linear layouts so the test cannot pass trivially if fp4_quantize ignores is_sf_swizzled_layout
Code ReviewThis PR cleanly exposes CorrectnessParameter threading is correct. The boolean propagates through all four call sites:
The trick of reusing the same variable name (
API DesignParameter placement inconsistency: Backward compatibility is maintained — default TestThe test structure is solid: # Guard against trivial pass
assert not torch.equal(input_sf_swizzled, input_sf_linear), ...This sanity check is important — good practice. Concern: test tolerances. Missing coverage: The Minor nit: Pre-existing Issue (not this PR)
Overall: the change is clean, the default maintains backward compatibility, and the approach is correct. The main asks are around test tolerances and coverage of the min-latency path in the docstring/comment. |
|
/bot run |
|
[FAILED] Pipeline #47094985: 6/20 passed |
…_moe signature The inner cutlass_fused_moe (registered via torch.library.custom_op) has swizzled_input_sf at position 15 (after swiglu_limit), but the outer wrapper was passing tp_size at that position and then also passing swizzled_input_sf as a keyword argument, causing: TypeError: cutlass_fused_moe() got multiple values for argument 'swizzled_input_sf' Fix: insert swizzled_input_sf into the positional argument list between swiglu_limit and tp_size, and remove the duplicate keyword argument. AI-assisted Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
/bot run |
Code ReviewOverall this is a clean, focused change that exposes a previously hardcoded parameter. Backward-compatible and well-documented. A few observations below. CorrectnessBoth execution paths are covered.
Minor IssuesParameter position inconsistency (non-blocking). In the internal helpers ( Test import ordering. Pre-existing typo (not introduced here). Test CoverageThe test structure is solid: it runs both swizzled and linear layouts and asserts numerical closeness. The explicit sanity-check that the two SF buffers actually differ is a nice touch -- prevents the test passing trivially if Optional enhancement: add a SummaryThe change is correct and well-tested. Only actionable items are the import ordering nit and optional min-latency coverage. LGTM with the minor import fix. |
|
[FAILED] Pipeline #47128436: 12/20 passed |
jiahanc
left a comment
There was a problem hiding this comment.
LGTM, thanks for the contribution!
Summary
Closes #2200
Generated with Claude Code
Summary by CodeRabbit
New Features
API Changes
Tests