feat: Fuse shared experts into trtllm_gen moe (fp8)#2625
feat: Fuse shared experts into trtllm_gen moe (fp8)#2625nv-yunzheq wants to merge 9 commits intoflashinfer-ai:mainfrom
Conversation
Summary of ChangesHello @nv-yunzheq, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request integrates the concept of 'fused shared experts' into the existing TensorRT-LLM MoE framework, particularly for FP8 operations. The primary goal is to optimize the handling of experts that are shared across multiple tokens, by incorporating them directly into the MoE kernel's routing and execution logic. This change impacts how memory is allocated, how routing decisions are made, and how the overall MoE computation is performed, leading to a more streamlined and potentially faster processing of MoE layers with shared components. Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Changelog
Activity
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughThis PR adds support for fused shared experts by threading Changes
Sequence DiagramsequenceDiagram
participant Client as Client
participant Py as Python Binding
participant Launcher as Kernel Launcher
participant Routing as Routing Kernel
participant Runner as MoE Runner
participant GEMM as GEMM Kernels
Client->>Py: call trtllm_fp8_block_scale_moe(..., num_fused_shared_experts=N)
Py->>Launcher: forward args including N
Launcher->>Launcher: compute totalExpertsPerToken = top_k + N, totalLocalExperts = local_num_experts + N, allocate workspaces
Launcher->>Routing: routing_runner.run(..., numFusedSharedExpert=N, ...)
Routing->>Routing: emit routed experts + fused shared expert indices/weights (expanded topK)
Routing-->>Launcher: routing outputs (expanded indices, counts)
Launcher->>Runner: moe_runner.run(..., topK=top_k+N, localNumExperts=local+N, ...)
Runner->>GEMM: launch PermuteGemm1/Gemm2 with fused-aware dimensions
GEMM-->>Runner: results
Runner-->>Launcher: aggregated MoE output
Launcher-->>Py: return result
Py-->>Client: deliver output
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related issues
Possibly related PRs
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 1 | ❌ 2❌ Failed checks (1 warning, 1 inconclusive)
✅ Passed checks (1 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
/bot run |
There was a problem hiding this comment.
Code Review
This pull request introduces the fusion of shared experts into the trtllm_gen MoE implementation, specifically for FP8. The changes cover the routing kernel, the launcher, and the Python API. While the integration logic for shared experts is mostly sound, there are a few critical issues regarding histogram initialization and template dispatching in the routing kernel that could lead to undefined behavior or incorrect results in multi-GPU or large-token scenarios.
| if (data.mNumFusedSharedExperts > 0) { | ||
| data.mNumExperts += data.mNumFusedSharedExperts; | ||
| data.mTopK += data.mNumFusedSharedExperts; | ||
| data.mNumLocalExperts += data.mNumFusedSharedExperts; | ||
| } |
There was a problem hiding this comment.
Updating data.mNumExperts and data.mTopK after the first kernel launch (line 656 or 662) leads to several issues:
numThreadsMain(line 655) and the histogram initialization insideroutingMainKernel(line 85) use the original routed expert count, meaning the histogram entries for shared experts are never initialized to zero. This can cause garbage values to be used as offsets in subsequent permutation kernels.- The dispatching macro
LAUNCH_ROUTING_DEEPSEEKusesdata.mNumExpertsto select theMaxNumExpertstemplate parameter. If the total expert count (routed + shared) crosses a threshold (e.g., 256 to 257), the first and second launches will use different template instantiations, which is inconsistent.
You should calculate the total expert count and top-k at the beginning of runImpl and ensure that initialization kernels use the total count, while routingMainKernel receives the routed count for its indexing logic.
| FLASHINFER_CHECK(data.mNumFusedSharedExperts <= WarpSize, | ||
| "Number of fused shared experts (%d) must be less than warp size.", | ||
| data.mNumFusedSharedExperts); |
There was a problem hiding this comment.
The check for mNumFusedSharedExperts <= WarpSize is currently placed inside the if (data.mNumExpertGroups > 1) block. However, routingMainKernel always assumes that shared experts can be handled by a single warp (using laneIdx), regardless of whether expert groups are used. This check should be moved outside the conditional block to ensure it is always enforced.
| weight_layout=weight_layout, | ||
| do_finalize=do_finalize, | ||
| enable_pdl=enable_pdl, | ||
| num_fused_shared_experts=num_fused_shared_experts, |
There was a problem hiding this comment.
The num_fused_shared_experts parameter should be included in the instance_key used by the MoERunner (around line 1045). Since the kernel's performance and configuration depend on the total number of experts (routed + shared), omitting this from the key might lead to the autotuner returning a suboptimal tactic if multiple calls with different shared expert counts are made.
There was a problem hiding this comment.
Actionable comments posted: 3
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
csrc/trtllm_fused_moe_kernel_launcher.cu (1)
810-821:⚠️ Potential issue | 🟠 MajorShape validation for precomputed routing with fused shared experts is inconsistent.
The shape check at line 818 validates
expert_indices.size(1) == args->top_k, but when fused shared experts are enabled, precomputed indices should account for the additional fused entries. At line 892,totalExpertsPerTokenis calculated asargs->top_k + args->num_fused_shared_experts, and theexpert_weightstensor is allocated with this dimension (line 897). If precomputed routing is used alongside fused shared experts, the shape validation should checkexpert_indices.size(1) == totalExpertsPerTokeninstead of justargs->top_kto ensure consistency with the routing output tensors.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@csrc/trtllm_fused_moe_kernel_launcher.cu` around lines 810 - 821, In check_routing(), the validation of expert_indices.dim(1) only compares to args->top_k but must account for fused shared experts; compute an expected width like int expectedPerToken = args->top_k + args->num_fused_shared_experts (or just use args->top_k when num_fused_shared_experts is zero) and replace the existing TVM_FFI_ICHECK_EQ(expert_indices.size(1), args->top_k) with a check against expectedPerToken so precomputed routing matches the allocation for totalExpertsPerToken and expert_weights.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@csrc/trtllm_fused_moe_routing_deepseek.cu`:
- Around line 616-619: The check ensuring data.mNumFusedSharedExperts <=
WarpSize must be unconditional because fused shared-expert writes use laneIdx <
mNumFusedSharedExperts regardless of expert group count; move the
FLASHINFER_CHECK(data.mNumFusedSharedExperts <= WarpSize, ...) out of the if
(data.mNumExpertGroups > 1) block so it always runs, ensuring
data.mNumFusedSharedExperts is validated before any code paths that use
mNumFusedSharedExperts/mNumFusedSharedExperts-induced lane comparisons or writes
(references: data.mNumFusedSharedExperts, WarpSize, mNumFusedSharedExperts,
mNumExpertGroups).
- Around line 571-574: routingInitExpertCounts currently initializes only 2 *
data.mNumExperts using the pre-fusion value then data.mNumExperts is incremented
to include mNumFusedSharedExperts, leaving histogram slots for fused-shared
experts uninitialized; fix by making the initialization cover the full fused
range (initialize 2 * (data.mNumExperts + data.mNumFusedSharedExperts)) or by
moving the mutation of data.mNumExperts (add mNumFusedSharedExperts) before
calling routingInitExpertCounts so the kernel initializes the correct size, and
ensure subsequent kernels that atomicAdd into expert-count slots will see zeros
for indices [original_mNumExperts, original_mNumExperts +
mNumFusedSharedExperts). Also move the check data.mNumFusedSharedExperts <=
WarpSize out of the if (data.mNumExpertGroups > 1) block so the
fused-shared-expert write logic (the unconditional write at the fused shared
expert site) consistently validates the WarpSize constraint regardless of group
count.
In `@include/flashinfer/trtllm/fused_moe/RoutingKernel.h`:
- Around line 102-107: The new fused-shared expert members
(mNumFusedSharedExperts, mSharedExpertTokenOffset, mSharedExpertNumTokens,
mTotalExpertsPerToken) are uninitialized; initialize them to the same safe
defaults used in KernelParamsBase (e.g., zero) by adding default member
initializers or setting them in the DataBase constructor so callers that don't
set them won't propagate garbage into kernel params and cause routing/OOB
errors.
---
Outside diff comments:
In `@csrc/trtllm_fused_moe_kernel_launcher.cu`:
- Around line 810-821: In check_routing(), the validation of
expert_indices.dim(1) only compares to args->top_k but must account for fused
shared experts; compute an expected width like int expectedPerToken =
args->top_k + args->num_fused_shared_experts (or just use args->top_k when
num_fused_shared_experts is zero) and replace the existing
TVM_FFI_ICHECK_EQ(expert_indices.size(1), args->top_k) with a check against
expectedPerToken so precomputed routing matches the allocation for
totalExpertsPerToken and expert_weights.
ℹ️ Review info
Configuration used: defaults
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (7)
csrc/trtllm_fused_moe_kernel_launcher.cucsrc/trtllm_fused_moe_routing_deepseek.cucsrc/trtllm_fused_moe_runner.cuflashinfer/fused_moe/core.pyinclude/flashinfer/trtllm/fused_moe/RoutingKernel.hinclude/flashinfer/trtllm/fused_moe/runner.htests/moe/test_trtllm_gen_fused_moe.py
|
|
||
| FLASHINFER_CHECK(data.mNumFusedSharedExperts <= WarpSize, | ||
| "Number of fused shared experts (%d) must be less than warp size.", | ||
| data.mNumFusedSharedExperts); |
There was a problem hiding this comment.
fusedSharedExperts <= WarpSize check should be unconditional.
This validation is guarded by if (data.mNumExpertGroups > 1) (line 605), but the fused shared expert writes at lines 261-265 and 272-274 use laneIdx < mNumFusedSharedExperts regardless of expert groups. If mNumExpertGroups <= 1 and mNumFusedSharedExperts > WarpSize, the writes would silently skip some fused experts.
Suggested fix
Move the check out of the if (data.mNumExpertGroups > 1) block:
+ FLASHINFER_CHECK(data.mNumFusedSharedExperts <= WarpSize,
+ "Number of fused shared experts (%d) must be less than warp size.",
+ data.mNumFusedSharedExperts);
+
if (data.mNumExpertGroups > 1) {
FLASHINFER_CHECK(data.mNumExpertGroups <= MaxNumGroups,
...);
...
-
- FLASHINFER_CHECK(data.mNumFusedSharedExperts <= WarpSize,
- "Number of fused shared experts (%d) must be less than warp size.",
- data.mNumFusedSharedExperts);
}🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@csrc/trtllm_fused_moe_routing_deepseek.cu` around lines 616 - 619, The check
ensuring data.mNumFusedSharedExperts <= WarpSize must be unconditional because
fused shared-expert writes use laneIdx < mNumFusedSharedExperts regardless of
expert group count; move the FLASHINFER_CHECK(data.mNumFusedSharedExperts <=
WarpSize, ...) out of the if (data.mNumExpertGroups > 1) block so it always
runs, ensuring data.mNumFusedSharedExperts is validated before any code paths
that use mNumFusedSharedExperts/mNumFusedSharedExperts-induced lane comparisons
or writes (references: data.mNumFusedSharedExperts, WarpSize,
mNumFusedSharedExperts, mNumExpertGroups).
|
|
||
| /// For fused shared expert | ||
| int32_t mNumFusedSharedExperts; | ||
| int32_t mSharedExpertTokenOffset; | ||
| int32_t mSharedExpertNumTokens; | ||
| int32_t mTotalExpertsPerToken; |
There was a problem hiding this comment.
Initialize fused-shared expert metadata fields to safe defaults.
These new DataBase members are currently uninitialized, so any caller that forgets to set them will propagate garbage into kernel params and risk incorrect routing / OOB indexing. Mirror the KernelParamsBase defaults.
🛠️ Suggested fix
- int32_t mNumFusedSharedExperts;
- int32_t mSharedExpertTokenOffset;
- int32_t mSharedExpertNumTokens;
- int32_t mTotalExpertsPerToken;
+ int32_t mNumFusedSharedExperts{0};
+ int32_t mSharedExpertTokenOffset{0};
+ int32_t mSharedExpertNumTokens{0};
+ int32_t mTotalExpertsPerToken{0};📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| /// For fused shared expert | |
| int32_t mNumFusedSharedExperts; | |
| int32_t mSharedExpertTokenOffset; | |
| int32_t mSharedExpertNumTokens; | |
| int32_t mTotalExpertsPerToken; | |
| /// For fused shared expert | |
| int32_t mNumFusedSharedExperts{0}; | |
| int32_t mSharedExpertTokenOffset{0}; | |
| int32_t mSharedExpertNumTokens{0}; | |
| int32_t mTotalExpertsPerToken{0}; |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@include/flashinfer/trtllm/fused_moe/RoutingKernel.h` around lines 102 - 107,
The new fused-shared expert members (mNumFusedSharedExperts,
mSharedExpertTokenOffset, mSharedExpertNumTokens, mTotalExpertsPerToken) are
uninitialized; initialize them to the same safe defaults used in
KernelParamsBase (e.g., zero) by adding default member initializers or setting
them in the DataBase constructor so callers that don't set them won't propagate
garbage into kernel params and cause routing/OOB errors.
|
[FAILED] Pipeline #44669282: 13/20 passed |
259d279 to
2255bca
Compare
|
/bot run |
There was a problem hiding this comment.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (3)
flashinfer/fused_moe/core.py (1)
1804-1836:⚠️ Potential issue | 🟠 MajorFake op signature mismatch: missing
num_fused_shared_expertsparameter.The
_fake_trtllm_fp8_block_scale_moefunction signature must exactly mirror the real optrtllm_fp8_block_scale_moe_op. The real op hasnum_fused_shared_experts: int = 0at line 1661, but the fake op is missing this parameter. This will cause issues with torch.compile or other tracing scenarios.Suggested fix
`@register_fake_op`("flashinfer::trtllm_fp8_block_scale_moe") def _fake_trtllm_fp8_block_scale_moe( routing_logits: Optional[torch.Tensor], topk_ids: Optional[torch.Tensor], expert_weights: Optional[torch.Tensor], routing_bias: Optional[torch.Tensor], hidden_states: torch.Tensor, hidden_states_scale: torch.Tensor, gemm1_weights: torch.Tensor, gemm1_weights_scale: torch.Tensor, gemm2_weights: torch.Tensor, gemm2_weights_scale: torch.Tensor, output: torch.Tensor, num_experts: int, top_k: int, n_group: Optional[int], topk_group: Optional[int], intermediate_size: int, local_expert_offset: int, local_num_experts: int, routed_scaling_factor: Optional[float], routing_method_type: int = 0, use_shuffled_weight: bool = False, weight_layout: int = 0, do_finalize: bool = True, enable_pdl: Optional[bool] = None, tune_max_num_tokens: int = 8192, fp8_quantization_type: Fp8QuantizationType = Fp8QuantizationType.DeepSeekFp8, + num_fused_shared_experts: int = 0, ) -> List[torch.Tensor]:Based on learnings: "When reviewing files that define fake ops decorated with register_fake_op (e.g., in flashinfer/fused_moe/*), ensure the function signatures exactly mirror the real op they stand in for."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@flashinfer/fused_moe/core.py` around lines 1804 - 1836, The fake op _fake_trtllm_fp8_block_scale_moe must exactly mirror the real op signature trtllm_fp8_block_scale_moe_op: add the missing parameter num_fused_shared_experts: int = 0 to the fake function signature (position it where the real op declares it) so tracing/torch.compile sees identical parameters; update any callers or tests if they rely on positional args to ensure compatibility.csrc/trtllm_fused_moe_kernel_launcher.cu (2)
1857-1874:⚠️ Potential issue | 🟠 MajorValidate
num_fused_shared_expertsbefore using it in size math.The new FFI parameter is folded directly into
totalExpertsPerTokenandtotalLocalExperts. A negative value can drive those counts to zero or below and break tile selection/workspace sizing before any lower-layer routing checks run.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@csrc/trtllm_fused_moe_kernel_launcher.cu` around lines 1857 - 1874, The code uses the FFI parameter num_fused_shared_experts directly in size math (totalExpertsPerToken, totalLocalExperts) which can be negative; validate and clamp it before use (e.g., ensure num_fused_shared_experts >= 0 and fits expected bounds) and reject or adjust invalid values; specifically, check/convert num_fused_shared_experts (and the optional num_fused_shared_experts.value_or(0)) to a non-negative int64_t before computing totalExpertsPerToken and totalLocalExperts, and add a defensive check that aborts or logs an error if the provided FFI value is out of acceptable range so computeSelectedTileN and downstream launchers (Fp8BlockScaleLauncher::getSupportedTileNums, computeSelectedTileN, MoERunnerArgs) never see negative counts.
919-929:⚠️ Potential issue | 🟠 MajorPrecomputed routing tensors still use the old
top_kwidth.Only the internally allocated
expert_weightsbuffer is widened totop_k + num_fused_shared_experts. If the caller provides precomputedexpert_indices/expert_weights, this path still follows the oldtop_kcontract elsewhere, so fused-shared precomputed routing will either reject correctly sized tensors or consume too few columns during finalize.
♻️ Duplicate comments (2)
csrc/trtllm_fused_moe_routing_deepseek.cu (2)
620-623:⚠️ Potential issue | 🟡 Minor
mNumFusedSharedExperts <= WarpSizecheck should be unconditional.This validation is guarded by
if (data.mNumExpertGroups > 1)(line 609), but the fused shared expert writes at lines 261-265 and 272-274 uselaneIdx < mNumFusedSharedExpertsregardless of expert groups. IfmNumExpertGroups <= 1andmNumFusedSharedExperts > WarpSize, the writes would silently skip some fused experts.Suggested fix
Move the check out of the
if (data.mNumExpertGroups > 1)block:+ FLASHINFER_CHECK(data.mNumFusedSharedExperts <= WarpSize, + "Number of fused shared experts (%d) must be less than warp size.", + data.mNumFusedSharedExperts); + if (data.mNumExpertGroups > 1) { FLASHINFER_CHECK(data.mNumExpertGroups <= MaxNumGroups, ...); ... - - FLASHINFER_CHECK(data.mNumFusedSharedExperts <= WarpSize, - "Number of fused shared experts (%d) must be less than warp size.", - data.mNumFusedSharedExperts); }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@csrc/trtllm_fused_moe_routing_deepseek.cu` around lines 620 - 623, The check ensuring data.mNumFusedSharedExperts <= WarpSize must be unconditional: move the FLASHINFER_CHECK(data.mNumFusedSharedExperts <= WarpSize, ...) out of the if (data.mNumExpertGroups > 1) block so it always runs; this prevents laneIdx < mNumFusedSharedExperts conditions (used in the fused shared expert writes) from silently skipping experts when data.mNumExpertGroups <= 1 and mNumFusedSharedExperts > WarpSize. Ensure the unique symbols FLASHINFER_CHECK, data.mNumFusedSharedExperts, WarpSize, data.mNumExpertGroups, and laneIdx are referenced when relocating the check.
666-676:⚠️ Potential issue | 🟠 MajorExpert count histogram not initialized for fused shared expert indices.
The
routingInitExpertCountskernel (line 666-669) initializes2 * data.mNumExpertselements using the pre-mutation value. After the kernel completes,data.mNumExpertsis incremented at lines 673-675 to includemNumFusedSharedExperts. Subsequent kernels (lines 678+) use the mutated value but access uninitialized histogram slots for indices[original_mNumExperts, original_mNumExperts + mNumFusedSharedExperts).This causes atomicAdd operations to accumulate into uninitialized values for fused shared expert slots.
Suggested fix
Either move the mutation before the histogram initialization or expand the initialization range:
+ if (data.mNumFusedSharedExperts > 0) { + data.mNumExperts += data.mNumFusedSharedExperts; + data.mTopK += data.mNumFusedSharedExperts; + data.mNumLocalExperts += data.mNumFusedSharedExperts; + } + if (data.mPtrTopKIds == nullptr) { ... } else { // Reset the global histograms. LAUNCH_ROUTING_DEEPSEEK(data, false, routingInitExpertCounts, (2 * data.mNumExperts - 1) / numThreadsHist + 1, numThreadsHist, /*smemSize=*/0, stream, data.mNumExpertGroups > 1); } - if (data.mNumFusedSharedExperts > 0) { - data.mNumExperts += data.mNumFusedSharedExperts; - data.mTopK += data.mNumFusedSharedExperts; - data.mNumLocalExperts += data.mNumFusedSharedExperts; - }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@csrc/trtllm_fused_moe_routing_deepseek.cu` around lines 666 - 676, The histogram for expert counts is initialized by the routingInitExpertCounts kernel using the pre-mutation value of data.mNumExperts, but data.mNumExperts is then increased by mNumFusedSharedExperts, leaving the new fused-shared slots uninitialized; fix by either moving the mutation of data.mNumExperts (and data.mTopK/data.mNumLocalExperts) to occur before the LAUNCH_ROUTING_DEEPSEEK(...) call that invokes routingInitExpertCounts so the kernel initializes the full range, or modify the initialization invocation to cover (2 * (data.mNumExperts + data.mNumFusedSharedExperts)) elements (or equivalent) so routingInitExpertCounts explicitly zeroes the fused-shared indices; update references to routingInitExpertCounts, data.mNumExperts, data.mNumFusedSharedExperts, and the LAUNCH_ROUTING_DEEPSEEK call accordingly.
🧹 Nitpick comments (1)
flashinfer/fused_moe/core.py (1)
1757-1759: Redundant None check fornum_fused_shared_experts.Since
num_fused_shared_expertsis typed asint = 0at line 1661 (notOptional[int]), the None check on line 1759 is unnecessary. The parameter can never beNoneat this point.Suggested simplification
- _nfse = num_fused_shared_experts if num_fused_shared_experts is not None else 0 + _nfse = num_fused_shared_experts🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@flashinfer/fused_moe/core.py` around lines 1757 - 1759, The assignment uses an unnecessary None check for num_fused_shared_experts (typed as int with default 0); simplify by removing the conditional and directly assign _nfse = num_fused_shared_experts in the scope where num_fused_shared_experts is passed (refer to the variables num_fused_shared_experts and _nfse in this function/class), ensuring no Optional handling remains.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@csrc/trtllm_fused_moe_runner.cu`:
- Around line 106-117: The partition math wrongly assumes uniform shards —
replace the division-based computation (numDevices, deviceIndex derived from
numExperts / localNumExperts and localExpertOffset / localNumExperts) with logic
that computes device boundaries from actual per-rank expert counts: build the
cumulative expert-count prefix (using the actual localNumExperts for each
device/rank) to find the device index and the exact token-offset/length for
routingData.mSharedExpertTokenOffset and routingData.mSharedExpertNumTokens;
ensure you use numTokens scaled by each device's expert count slice (not simple
baseTokensPerDevice/remainingTokens across a uniform numDevices), and reference
localExpertOffset, localNumExperts, numExperts when mapping into the cumulative
ranges so uneven sharding yields correct offsets and lengths.
---
Outside diff comments:
In `@csrc/trtllm_fused_moe_kernel_launcher.cu`:
- Around line 1857-1874: The code uses the FFI parameter
num_fused_shared_experts directly in size math (totalExpertsPerToken,
totalLocalExperts) which can be negative; validate and clamp it before use
(e.g., ensure num_fused_shared_experts >= 0 and fits expected bounds) and reject
or adjust invalid values; specifically, check/convert num_fused_shared_experts
(and the optional num_fused_shared_experts.value_or(0)) to a non-negative
int64_t before computing totalExpertsPerToken and totalLocalExperts, and add a
defensive check that aborts or logs an error if the provided FFI value is out of
acceptable range so computeSelectedTileN and downstream launchers
(Fp8BlockScaleLauncher::getSupportedTileNums, computeSelectedTileN,
MoERunnerArgs) never see negative counts.
In `@flashinfer/fused_moe/core.py`:
- Around line 1804-1836: The fake op _fake_trtllm_fp8_block_scale_moe must
exactly mirror the real op signature trtllm_fp8_block_scale_moe_op: add the
missing parameter num_fused_shared_experts: int = 0 to the fake function
signature (position it where the real op declares it) so tracing/torch.compile
sees identical parameters; update any callers or tests if they rely on
positional args to ensure compatibility.
---
Duplicate comments:
In `@csrc/trtllm_fused_moe_routing_deepseek.cu`:
- Around line 620-623: The check ensuring data.mNumFusedSharedExperts <=
WarpSize must be unconditional: move the
FLASHINFER_CHECK(data.mNumFusedSharedExperts <= WarpSize, ...) out of the if
(data.mNumExpertGroups > 1) block so it always runs; this prevents laneIdx <
mNumFusedSharedExperts conditions (used in the fused shared expert writes) from
silently skipping experts when data.mNumExpertGroups <= 1 and
mNumFusedSharedExperts > WarpSize. Ensure the unique symbols FLASHINFER_CHECK,
data.mNumFusedSharedExperts, WarpSize, data.mNumExpertGroups, and laneIdx are
referenced when relocating the check.
- Around line 666-676: The histogram for expert counts is initialized by the
routingInitExpertCounts kernel using the pre-mutation value of data.mNumExperts,
but data.mNumExperts is then increased by mNumFusedSharedExperts, leaving the
new fused-shared slots uninitialized; fix by either moving the mutation of
data.mNumExperts (and data.mTopK/data.mNumLocalExperts) to occur before the
LAUNCH_ROUTING_DEEPSEEK(...) call that invokes routingInitExpertCounts so the
kernel initializes the full range, or modify the initialization invocation to
cover (2 * (data.mNumExperts + data.mNumFusedSharedExperts)) elements (or
equivalent) so routingInitExpertCounts explicitly zeroes the fused-shared
indices; update references to routingInitExpertCounts, data.mNumExperts,
data.mNumFusedSharedExperts, and the LAUNCH_ROUTING_DEEPSEEK call accordingly.
---
Nitpick comments:
In `@flashinfer/fused_moe/core.py`:
- Around line 1757-1759: The assignment uses an unnecessary None check for
num_fused_shared_experts (typed as int with default 0); simplify by removing the
conditional and directly assign _nfse = num_fused_shared_experts in the scope
where num_fused_shared_experts is passed (refer to the variables
num_fused_shared_experts and _nfse in this function/class), ensuring no Optional
handling remains.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: e7cc95b5-315d-4900-b7a7-8eb9b5984c8d
📒 Files selected for processing (7)
csrc/trtllm_fused_moe_kernel_launcher.cucsrc/trtllm_fused_moe_routing_deepseek.cucsrc/trtllm_fused_moe_runner.cuflashinfer/fused_moe/core.pyinclude/flashinfer/trtllm/fused_moe/RoutingKernel.hinclude/flashinfer/trtllm/fused_moe/runner.htests/moe/test_trtllm_gen_fused_moe.py
🚧 Files skipped from review as they are similar to previous changes (1)
- include/flashinfer/trtllm/fused_moe/RoutingKernel.h
| int32_t const numDevices = (localNumExperts > 0) ? numExperts / localNumExperts : 1; | ||
| int32_t const deviceIndex = (localNumExperts > 0) ? localExpertOffset / localNumExperts : 0; | ||
| int32_t const baseTokensPerDevice = numTokens / numDevices; | ||
| int32_t const remainingTokens = numTokens % numDevices; | ||
|
|
||
| if (deviceIndex < remainingTokens) { | ||
| routingData.mSharedExpertTokenOffset = (baseTokensPerDevice + 1) * deviceIndex; | ||
| routingData.mSharedExpertNumTokens = baseTokensPerDevice + 1; | ||
| } else { | ||
| routingData.mSharedExpertTokenOffset = remainingTokens + deviceIndex * baseTokensPerDevice; | ||
| routingData.mSharedExpertNumTokens = baseTokensPerDevice; | ||
| } |
There was a problem hiding this comment.
Shared-expert token partition assumes uniform expert shards.
numDevices = numExperts / localNumExperts and deviceIndex = localExpertOffset / localNumExperts are only correct when every rank owns the same routed-expert count. The visible checks here only require localExpertOffset + localNumExperts <= numExperts, so uneven sharding will compute the wrong mSharedExpertTokenOffset/mSharedExpertNumTokens range and route fused shared experts against the wrong token slice.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@csrc/trtllm_fused_moe_runner.cu` around lines 106 - 117, The partition math
wrongly assumes uniform shards — replace the division-based computation
(numDevices, deviceIndex derived from numExperts / localNumExperts and
localExpertOffset / localNumExperts) with logic that computes device boundaries
from actual per-rank expert counts: build the cumulative expert-count prefix
(using the actual localNumExperts for each device/rank) to find the device index
and the exact token-offset/length for routingData.mSharedExpertTokenOffset and
routingData.mSharedExpertNumTokens; ensure you use numTokens scaled by each
device's expert count slice (not simple baseTokensPerDevice/remainingTokens
across a uniform numDevices), and reference localExpertOffset, localNumExperts,
numExperts when mapping into the cumulative ranges so uneven sharding yields
correct offsets and lengths.
|
[FAILED] Pipeline #45731067: 8/20 passed |
Merged origin/main into nv-yunzheq/DSR1_shared_expert_fusion. Resolved conflicts: - csrc/trtllm_fused_moe_kernel_launcher.cu: merged signature (num_fused_shared_experts + act_type/norm_topk_prob), combined expert_weights allocation logic - csrc/trtllm_fused_moe_runner.cu: merged signature, adopted routingCustom framework with shared expert guard - csrc/trtllm_fused_moe_routing_deepseek.cu: git rm (deleted on main, needs manual port) - flashinfer/fused_moe/core.py: merged all new params in signatures and call sites - tests/moe/test_trtllm_gen_fused_moe.py: merged test param additions Co-Authored-By: Claude <noreply@anthropic.com>
The PR's shared expert fusion changes were made to the old csrc/trtllm_fused_moe_routing_deepseek.cu which was deleted on main and relocated to csrc/fused_moe/trtllm_backend/. This commit ports those changes to the new location: - routingMainKernel: use mTotalExpertsPerToken stride for TopK output, write shared expert indices (weight=1.0) after routed experts - run(): compute adjusted topK/numExperts accounting for shared experts, add bounds check (mNumFusedSharedExperts <= WarpSize), bump data.mNumExperts/mTopK/mNumLocalExperts post-routing so the permutation pipeline sees the full expanded expert set - Adjust single-cluster threshold, maxTokensCoop, and expandedIdxSize to use the shared-expert-adjusted topK Co-Authored-By: Claude <noreply@anthropic.com>
|
/bot run |
|
@nv-yunzheq would you check merge conflict resolution if possible how about nvfp4, do you have a pointer? |
There was a problem hiding this comment.
Actionable comments posted: 2
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
flashinfer/fused_moe/core.py (2)
2767-2796:⚠️ Potential issue | 🔴 CriticalCritical: Missing
num_fused_shared_expertsargument causes parameter misalignment.The
trtllm_fp8_block_scale_routed_moefunction call is missing thenum_fused_shared_expertsargument at position 27. This causes all subsequent arguments to be passed to the wrong parameters:
activation_type(e.g., 3) is passed tonum_fused_shared_expertsTrueis passed toactivation_type(coerced to 1)norm_topk_probis not passed (uses default)This will cause incorrect MoE computation when
activation_typevalue is interpreted as a shared expert count.🐛 Proposed fix to add missing argument
result = get_trtllm_moe_sm100_module().trtllm_fp8_block_scale_moe( None, # routing_logits topk_ids, None, # expert_weights routing_bias, hidden_states, hidden_states_scale, gemm1_weights, gemm1_weights_scale, gemm2_weights, gemm2_weights_scale, output, num_experts, top_k, n_group, topk_group, intermediate_size, local_expert_offset, local_num_experts, routed_scaling_factor, routing_method_type, use_shuffled_weight, weight_layout, do_finalize, enable_pdl, tune_max_num_tokens, fp8_quantization_type, + 0, # num_fused_shared_experts: not supported for pre-routed MoE activation_type, True, # norm_topk_prob: not used for pre-computed routing )🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@flashinfer/fused_moe/core.py` around lines 2767 - 2796, The call to get_trtllm_moe_sm100_module().trtllm_fp8_block_scale_moe is missing the num_fused_shared_experts argument, causing parameter misalignment (activation_type and norm_topk_prob are shifted); update the call to include the proper num_fused_shared_experts value (same type/variable used elsewhere for fused shared experts) inserted immediately before the activation_type argument so subsequent parameters (activation_type, norm_topk_prob) line up with the function signature.
1823-1853:⚠️ Potential issue | 🔴 CriticalCritical: Fake op signature missing
num_fused_shared_expertsparameter.The
_fake_trtllm_fp8_block_scale_moefunction is missing thenum_fused_shared_expertsparameter that was added totrtllm_fp8_block_scale_moe_opat line 1658. Fake ops must exactly mirror the real op signatures for torch.compile/tracing to work correctly.🐛 Proposed fix to add missing parameter
`@register_fake_op`("flashinfer::trtllm_fp8_block_scale_moe") def _fake_trtllm_fp8_block_scale_moe( routing_logits: Optional[torch.Tensor], topk_ids: Optional[torch.Tensor], expert_weights: Optional[torch.Tensor], routing_bias: Optional[torch.Tensor], hidden_states: torch.Tensor, hidden_states_scale: torch.Tensor, gemm1_weights: torch.Tensor, gemm1_weights_scale: torch.Tensor, gemm2_weights: torch.Tensor, gemm2_weights_scale: torch.Tensor, output: torch.Tensor, num_experts: int, top_k: int, n_group: Optional[int], topk_group: Optional[int], intermediate_size: int, local_expert_offset: int, local_num_experts: int, routed_scaling_factor: Optional[float], routing_method_type: int = 0, use_shuffled_weight: bool = False, weight_layout: int = 0, do_finalize: bool = True, enable_pdl: Optional[bool] = None, tune_max_num_tokens: int = 8192, fp8_quantization_type: Fp8QuantizationType = Fp8QuantizationType.DeepSeekFp8, + num_fused_shared_experts: int = 0, activation_type: int = ActivationType.Swiglu.value, norm_topk_prob: bool = True, ) -> List[torch.Tensor]:🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@flashinfer/fused_moe/core.py` around lines 1823 - 1853, The fake op _fake_trtllm_fp8_block_scale_moe must include the new parameter num_fused_shared_experts to exactly mirror the real op signature; update the function signature to add num_fused_shared_experts with the same type and default as trtllm_fp8_block_scale_moe_op (e.g., int or Optional[int] matching the real op) and propagate that new parameter in the fake-op declaration so torch.compile/tracing sees an identical signature (no additional logic changes needed inside the function).
🧹 Nitpick comments (2)
flashinfer/fused_moe/core.py (1)
1774-1792: Redundant None check fornum_fused_shared_experts.At line 1776,
_nfse = num_fused_shared_experts if num_fused_shared_experts is not None else 0is redundant becausenum_fused_shared_expertsis typed asint = 0at line 1658 and can never beNone. You can simplify to use the parameter directly.♻️ Suggested simplification
num_fused_shared_experts=num_fused_shared_experts, ) - _nfse = num_fused_shared_experts if num_fused_shared_experts is not None else 0 # Call the C++ function for block scale MoE intermediate_output = moe_op.trtllm_fp8_block_scale_moe( routing_logits, topk_ids, expert_weights, routing_bias, hidden_states, hidden_states_scale, gemm1_weights, gemm1_weights_scale, gemm2_weights, gemm2_weights_scale, output, num_experts, top_k, - _nfse, + num_fused_shared_experts, n_group,🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@flashinfer/fused_moe/core.py` around lines 1774 - 1792, The local variable _nfse and its None-check are redundant because num_fused_shared_experts is declared as an int (default 0); remove the _nfse assignment and pass num_fused_shared_experts directly into the call to moe_op.trtllm_fp8_block_scale_moe (replace the _nfse argument with num_fused_shared_experts), and delete the unused _nfse variable to simplify core.py around the block that constructs intermediate_output.csrc/fused_moe/trtllm_backend/trtllm_fused_moe_routing_deepseek.cu (1)
555-560: Explain why the shared-expert expansion mutatesDatain place.This hot-path mutation is easy to misread because
launchMainKernel()consumes the pre-fusion counts and the remaining launches consume the expanded counts from the same object. A brief note on why this was chosen instead of staging a copiedData/ separate post-pass would make the trade-off much clearer.As per coding guidelines, "For performance-critical hot paths, leave comments with justification for special algorithmic choices and mention alternative approaches considered".
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@csrc/fused_moe/trtllm_backend/trtllm_fused_moe_routing_deepseek.cu` around lines 555 - 560, The in-place mutation of Data (incrementing data.mNumExperts, data.mTopK, data.mNumLocalExperts when data.mNumFusedSharedExperts > 0) is confusing because launchMainKernel() uses the pre-fusion counts while subsequent pipeline stages expect expanded counts; update the source by adding a concise comment immediately above this block explaining why we mutate Data in place for performance (avoid copying Data or making a separate post-pass), note that launchMainKernel() intentionally consumes the original counts and the permutation pipeline requires the expanded counts, and briefly mention the considered alternatives (cloning Data or staging a post-pass) and why they were rejected for this hot path.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@csrc/fused_moe/trtllm_backend/trtllm_fused_moe_routing_deepseek.cu`:
- Around line 540-546: Compute and validate the fused expert total before using
it: after calculating numExperts = data.mNumExperts +
data.mNumFusedSharedExperts, add a guard (via FLASHINFER_CHECK or equivalent)
that numExperts <= MaxSupportedExpertCount (or handle getMaxNumExperts returning
0) so getMaxNumExperts(numExperts) cannot return 0; update the block around
numExperts/topK/numThreadsHist to validate the fused total (referencing
data.mNumExperts, data.mNumFusedSharedExperts, numExperts, getMaxNumExperts, and
MaxSupportedExpertCount) and ensure downstream logic that uses useSingleCluster
/ maxTokensCoop only runs for supported expert counts.
- Around line 345-358: Before emitting appended shared-expert IDs, clear the
corresponding mPtrExpertCounts slots so stale device memory can't corrupt later
histogram/prefix-scan; in the same kernel code path that writes packed/shared
entries (where laneIdx, idxShared, params.mNumFusedSharedExperts,
params.mNumExperts, params.mPtrTopKPacked, params.mPtrTopKWeights are used) set
params.mPtrExpertCounts[params.mNumExperts + laneIdx] = 0 (guarded by laneIdx <
params.mNumFusedSharedExperts) so the extra expert-count/offset slots are zeroed
on-device prior to writing the shared-expert outputs (this complements the
earlier routingMainKernel initialization that only clears 2 * params.mNumExperts
entries).
---
Outside diff comments:
In `@flashinfer/fused_moe/core.py`:
- Around line 2767-2796: The call to
get_trtllm_moe_sm100_module().trtllm_fp8_block_scale_moe is missing the
num_fused_shared_experts argument, causing parameter misalignment
(activation_type and norm_topk_prob are shifted); update the call to include the
proper num_fused_shared_experts value (same type/variable used elsewhere for
fused shared experts) inserted immediately before the activation_type argument
so subsequent parameters (activation_type, norm_topk_prob) line up with the
function signature.
- Around line 1823-1853: The fake op _fake_trtllm_fp8_block_scale_moe must
include the new parameter num_fused_shared_experts to exactly mirror the real op
signature; update the function signature to add num_fused_shared_experts with
the same type and default as trtllm_fp8_block_scale_moe_op (e.g., int or
Optional[int] matching the real op) and propagate that new parameter in the
fake-op declaration so torch.compile/tracing sees an identical signature (no
additional logic changes needed inside the function).
---
Nitpick comments:
In `@csrc/fused_moe/trtllm_backend/trtllm_fused_moe_routing_deepseek.cu`:
- Around line 555-560: The in-place mutation of Data (incrementing
data.mNumExperts, data.mTopK, data.mNumLocalExperts when
data.mNumFusedSharedExperts > 0) is confusing because launchMainKernel() uses
the pre-fusion counts while subsequent pipeline stages expect expanded counts;
update the source by adding a concise comment immediately above this block
explaining why we mutate Data in place for performance (avoid copying Data or
making a separate post-pass), note that launchMainKernel() intentionally
consumes the original counts and the permutation pipeline requires the expanded
counts, and briefly mention the considered alternatives (cloning Data or staging
a post-pass) and why they were rejected for this hot path.
In `@flashinfer/fused_moe/core.py`:
- Around line 1774-1792: The local variable _nfse and its None-check are
redundant because num_fused_shared_experts is declared as an int (default 0);
remove the _nfse assignment and pass num_fused_shared_experts directly into the
call to moe_op.trtllm_fp8_block_scale_moe (replace the _nfse argument with
num_fused_shared_experts), and delete the unused _nfse variable to simplify
core.py around the block that constructs intermediate_output.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 740993a3-4d67-4790-8f5f-7226f67ad2dd
📒 Files selected for processing (5)
csrc/fused_moe/trtllm_backend/trtllm_fused_moe_routing_deepseek.cucsrc/moe_utils_binding.cucsrc/trtllm_fused_moe_kernel_launcher.cucsrc/trtllm_fused_moe_runner.cuflashinfer/fused_moe/core.py
🚧 Files skipped from review as they are similar to previous changes (2)
- csrc/trtllm_fused_moe_kernel_launcher.cu
- csrc/trtllm_fused_moe_runner.cu
| if (laneIdx < params.mNumFusedSharedExperts && params.mPtrTopKPacked != nullptr) { | ||
| PackedScoreIdx<OutputT> packedScore{static_cast<OutputT>(1.0F), | ||
| static_cast<int16_t>(params.mNumExperts + laneIdx)}; | ||
| params.mPtrTopKPacked[idxShared] = packedScore; | ||
| } | ||
|
|
||
| if (laneIdx < params.mTopK && params.mPtrTopKWeights != nullptr && | ||
| params.mPtrTopKIds == nullptr) { | ||
| params.mPtrTopKWeights[idxTopK] = finalScore; | ||
| } | ||
|
|
||
| if (laneIdx < params.mNumFusedSharedExperts && params.mPtrTopKWeights != nullptr) { | ||
| params.mPtrTopKWeights[idxShared] = static_cast<OutputT>(1.0F); | ||
| } |
There was a problem hiding this comment.
Zero the extra mPtrExpertCounts slots before emitting shared experts.
These appended expert IDs flow through the same non-single-cluster permutation path as routed experts, but routingMainKernel still clears only 2 * params.mNumExperts entries in mPtrExpertCounts (Lines 165-170). The shared-expert count/offset slots therefore start from stale device memory and can corrupt the histogram/prefix-scan for large-token runs.
🐛 Companion fix in the earlier expert-count initialization
if (params.mPtrExpertCounts) {
int32_t globalThreadIdx = blockIdx.x * blockDim.x + threadIdx.x;
int32_t globalThreadStride = gridDim.x * blockDim.x;
- int32_t expertCountsNum = 2 * params.mNumExperts;
+ int32_t expertCountsNum = 2 * (params.mNumExperts + params.mNumFusedSharedExperts);
initArr(globalThreadIdx, expertCountsNum, globalThreadStride, params.mPtrExpertCounts, 0);
}🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@csrc/fused_moe/trtllm_backend/trtllm_fused_moe_routing_deepseek.cu` around
lines 345 - 358, Before emitting appended shared-expert IDs, clear the
corresponding mPtrExpertCounts slots so stale device memory can't corrupt later
histogram/prefix-scan; in the same kernel code path that writes packed/shared
entries (where laneIdx, idxShared, params.mNumFusedSharedExperts,
params.mNumExperts, params.mPtrTopKPacked, params.mPtrTopKWeights are used) set
params.mPtrExpertCounts[params.mNumExperts + laneIdx] = 0 (guarded by laneIdx <
params.mNumFusedSharedExperts) so the extra expert-count/offset slots are zeroed
on-device prior to writing the shared-expert outputs (this complements the
earlier routingMainKernel initialization that only clears 2 * params.mNumExperts
entries).
| int const numExperts = data.mNumExperts + data.mNumFusedSharedExperts; | ||
| int const topK = data.mTopK + data.mNumFusedSharedExperts; | ||
| int const numThreadsHist = getMaxNumExperts(numExperts); | ||
|
|
||
| FLASHINFER_CHECK(topK <= MaxSupportedTopExperts, | ||
| "Routing kernel expects topK experts <= %d, got %d", MaxSupportedTopExperts, | ||
| topK); |
There was a problem hiding this comment.
Guard the fused total against the DeepSeek dispatch ceiling.
Only the pre-fusion data.mNumExperts is validated against MaxSupportedExpertCount. If data.mNumExperts + data.mNumFusedSharedExperts exceeds 512, getMaxNumExperts(numExperts) returns 0 here, which breaks the later useSingleCluster / maxTokensCoop math and leaves the post-topK launches on unsupported expert tiers.
🛡️ Minimal guard before using the fused-aware total
int const numExperts = data.mNumExperts + data.mNumFusedSharedExperts;
int const topK = data.mTopK + data.mNumFusedSharedExperts;
+ if (data.mPtrPermutedIdxSize != nullptr) {
+ FLASHINFER_CHECK(numExperts <= MaxSupportedExpertCount,
+ "Permutation pipeline supports at most %d total experts, got %d",
+ MaxSupportedExpertCount, numExperts);
+ }
int const numThreadsHist = getMaxNumExperts(numExperts);🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@csrc/fused_moe/trtllm_backend/trtllm_fused_moe_routing_deepseek.cu` around
lines 540 - 546, Compute and validate the fused expert total before using it:
after calculating numExperts = data.mNumExperts + data.mNumFusedSharedExperts,
add a guard (via FLASHINFER_CHECK or equivalent) that numExperts <=
MaxSupportedExpertCount (or handle getMaxNumExperts returning 0) so
getMaxNumExperts(numExperts) cannot return 0; update the block around
numExperts/topK/numThreadsHist to validate the fused total (referencing
data.mNumExperts, data.mNumFusedSharedExperts, numExperts, getMaxNumExperts, and
MaxSupportedExpertCount) and ensure downstream logic that uses useSingleCluster
/ maxTokensCoop only runs for supported expert counts.
Fix function signature line wrapping in trtllm_fp8_block_scale_moe() to conform to the project's clang-format configuration. Co-Authored-By: Claude <noreply@anthropic.com>
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@csrc/trtllm_fused_moe_kernel_launcher.cu`:
- Around line 1970-1972: The tile/config discovery is still using unfused
dimensions; update resolveMoeTileAndConfig,
Fp8BlockScaleLauncher::getValidConfigs, and trtllm_get_valid_moe_configs to use
the fused totals (totalExpertsPerToken and totalLocalExperts) and the stored
args->num_fused_shared_experts instead of raw top_k and local_num_experts;
thread these fused totals through any calls that compute or cache tiles/configs,
update the fallback resolution path to bucket on the fused totals, and ensure
the exported trtllm_get_valid_moe_configs forwards the fused totals so autotuned
configs match prepare_moe_common() validation.
- Around line 1195-1198: The precomputed-routing path is unsafe with fused
shared experts because the runner expects top_k + num_fused_shared_experts
columns but precomputed validation only checks against top_k; update the call
site that uses use_precomputed, args->num_fused_shared_experts,
args->routing_logits and workspace.routing_expert_indexes to either (a) reject
precomputed routing when args->num_fused_shared_experts > 0 by returning an
error/setting use_precomputed=false, or (b) validate both
workspace.routing_expert_indexes and args->routing_logits widths against
(args->top_k + args->num_fused_shared_experts) before invoking the routing
runner and abort if they are smaller; implement one of these fixes where the
call is made so precomputed buffers cannot be indexed past their last column.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 90b0079d-8f7e-423e-b8c8-af4c05ac683b
📒 Files selected for processing (1)
csrc/trtllm_fused_moe_kernel_launcher.cu
| use_precomputed ? nullptr : args->routing_logits, args->routing_bias, args->num_tokens, | ||
| args->num_experts, args->top_k, args->n_group, args->topk_group, args->local_expert_offset, | ||
| args->local_num_experts, args->routed_scaling_factor, workspace.routing_expert_indexes, | ||
| args->num_experts, args->top_k, args->num_fused_shared_experts, args->n_group, | ||
| args->topk_group, args->local_expert_offset, args->local_num_experts, | ||
| args->routed_scaling_factor, workspace.routing_expert_indexes, |
There was a problem hiding this comment.
Update the precomputed-routing contract for fused shared experts.
This call now tells the routing runner to consume top_k + num_fused_shared_experts, but the precomputed path still validates expert_indices against top_k and never validates precomputed expert_weights. With fused shared experts enabled, top-k-wide precomputed buffers will be misinterpreted and can be indexed past their last column. Either reject precomputed routing when num_fused_shared_experts > 0, or validate both tensors against the fused width before calling the runner.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@csrc/trtllm_fused_moe_kernel_launcher.cu` around lines 1195 - 1198, The
precomputed-routing path is unsafe with fused shared experts because the runner
expects top_k + num_fused_shared_experts columns but precomputed validation only
checks against top_k; update the call site that uses use_precomputed,
args->num_fused_shared_experts, args->routing_logits and
workspace.routing_expert_indexes to either (a) reject precomputed routing when
args->num_fused_shared_experts > 0 by returning an error/setting
use_precomputed=false, or (b) validate both workspace.routing_expert_indexes and
args->routing_logits widths against (args->top_k +
args->num_fused_shared_experts) before invoking the routing runner and abort if
they are smaller; implement one of these fixes where the call is made so
precomputed buffers cannot be indexed past their last column.
| int64_t const nFusedShared = num_fused_shared_experts.value_or(0); | ||
| int64_t const totalExpertsPerToken = top_k + nFusedShared; | ||
| int64_t const totalLocalExperts = local_num_experts + nFusedShared; |
There was a problem hiding this comment.
Make tile/config discovery fused-aware as well.
You compute fused totals here and store args->num_fused_shared_experts, but fallback tile resolution still buckets on raw top_k and local_num_experts. That leaves resolveMoeTileAndConfig(...), Fp8BlockScaleLauncher::getValidConfigs(...), and the exported trtllm_get_valid_moe_configs(...) surface describing a different problem shape than prepare_moe_common() validates, so cached tactics or the [-1, -1] fallback can become invalid for fused-shared runs.
Possible direction
- auto const [tile_N, config] = resolveMoeTileAndConfig(config_index, supported_tile_nums,
- num_tokens, top_k, local_num_experts);
+ auto const [tile_N, config] = resolveMoeTileAndConfig(config_index, supported_tile_nums,
+ num_tokens, totalExpertsPerToken,
+ totalLocalExperts);Fp8BlockScaleLauncher::getValidConfigs(...) and trtllm_get_valid_moe_configs(...) need the same fused totals threaded through, otherwise autotuned configs will still be generated against the unfused dimensions.
Also applies to: 1985-1985, 2010-2011
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@csrc/trtllm_fused_moe_kernel_launcher.cu` around lines 1970 - 1972, The
tile/config discovery is still using unfused dimensions; update
resolveMoeTileAndConfig, Fp8BlockScaleLauncher::getValidConfigs, and
trtllm_get_valid_moe_configs to use the fused totals (totalExpertsPerToken and
totalLocalExperts) and the stored args->num_fused_shared_experts instead of raw
top_k and local_num_experts; thread these fused totals through any calls that
compute or cache tiles/configs, update the fallback resolution path to bucket on
the fused totals, and ensure the exported trtllm_get_valid_moe_configs forwards
the fused totals so autotuned configs match prepare_moe_common() validation.
…eplay_out) Merged origin/main into nv-yunzheq/DSR1_shared_expert_fusion. New conflicts from PR flashinfer-ai#3024 (routing_replay_out support) resolved by keeping both PR's shared-expert fields and main's routing replay fields. Co-Authored-By: Claude <noreply@anthropic.com>
For #2551
Integrating NVIDIA/TensorRT-LLM#11143
🔍 Related Issues
🚀 Pull Request Checklist
Thank you for contributing to FlashInfer! Before we review your pull request, please make sure the following items are complete.
✅ Pre-commit Checks
pre-commitby runningpip install pre-commit(or used your preferred method).pre-commit install.pre-commit run --all-filesand fixed any reported issues.🧪 Tests
unittest, etc.).Reviewer Notes
Summary by CodeRabbit
New Features
Bug Fixes
Tests
Documentation