[TRTLLM-12200][feat] WideEP FT: add active_rank_mask to NVLink AlltoAll kernels (1a.2)#13404
[TRTLLM-12200][feat] WideEP FT: add active_rank_mask to NVLink AlltoAll kernels (1a.2)#13404chienchunhung wants to merge 2 commits into
Conversation
Cross-link NVIDIA#13404 (the NVLink AlltoAll kernel-mask implementation) from the 1a.2 row of the implementation plan, mirroring the 1a.1 link added in 92af527. Signed-off-by: Chien-Chun Hung <2679986+chienchunhung@users.noreply.github.com> Made-with: Cursor
|
/bot run --disable-fail-fast |
|
PR_Github #45437 [ run ] triggered by Bot. Commit: |
|
PR_Github #45437 [ run ] completed with state
|
Fifth batch. §8 has Phase 1 PR breakdown (1a kernel, 1b EPLB, 1c detection, 1d integration — 13 MVP PRs + 12 v1), Phase 1-DS (disagg), Phase 2 (3 sub-tracks + MNNVL audit as explicit prereq PR 2a.0), and Phase 3 work-track rough plan. Critical-path Gantt shows three MVP gating items: 1a.2 kernel (in flight as PR NVIDIA#13404), 1c.3 MPI FT subcomm (net-new L), 1d.4 fault-injection harness (net-new L). Timeline summary: MVP 6-7 weeks, full program 7-10 months (AI-assisted), with honest caveats about L-sized risks. Added PR 1d.0 (MPI signal handler replacement) to the MVP list — was implicit in the design but needed to be called out as named work. §9 names two audits as gating risks: MNNVL/NVSHMEM teardown capability (Phase 2 prereq, 1-2 week prototype with concrete scope) and Ray-path WideEP perf characterization (future-migration prereq, gated on Ray-path CI coverage existing at EP≥32 first). §9.2 has 14 technical risks with Severity × Probability × Residual per row (residual column is the newly added column per earlier reviewer feedback). §9.3 has 8 open questions including the Q8 framework for when to revisit the Ray pivot (three conditions all must hold). §9.4 summary matrix with all risks in one place; bolded rows are the active-tracking ones during MVP. Signed-off-by: Chien-Chun Hung <2679986+chienchunhung@users.noreply.github.com>
…move v1 files Final batch of the v2 rewrite. §0 executive summary — problem statement, approach, key MPI-vs-Ray decision, two failure modes named, the four TRT-LLM uniqueness properties, headline timeline numbers, and an explicit "what v2 changes vs v1" list for readers coming from the prior version. README — complete rewrite as navigation for v2. Section table with one-line summaries per section. In-flight PR table (NVIDIA#13302 and NVIDIA#13404 against this design). Consolidated terminology table including the rank/process/slot distinction that a reviewer flagged as missing. Scope & non-goals stated once, not repeated throughout. Removing v1 files: - 01-background.md → content folded into 01-user-journey-and-stack.md (§1.3) and 00-executive-summary.md - 02-current-state.md → folded into 01-user-journey-and-stack.md (§1.2) and 03-failure-modes-and-gaps.md - 03-competitive-landscape.md → folded into 02-stack-comparison-and-positioning.md - 04-two-phase-recovery.md → superseded by 04-architecture-overview.md (now three-phase) - 05-rank-masking.md → folded into 05-phase-1-immediate-survival.md §5.1 - 06-eplb-adaptation.md → folded into 05-phase-1-immediate-survival.md §5.2 - 07-failure-detection.md → folded into 05-phase-1-immediate-survival.md §5.3 and §5.4 - 08-mx-gms-integration.md → split between 05-phase-1 (PR NVIDIA#12718 integration in §5.3) and 06-phase-2 (MX-GMS in §6.3) - 09-implementation-plan.md → superseded by 08-implementation-plan.md - 10-risks.md → superseded by 09-risks-and-open-questions.md - COMBINED.md → superseded; single-file view can be regenerated from the split files if needed New v2 file set: README + §0-§9 (11 files) + 3 workflow artifacts (redesign-outline, redesign-research-pass, redesign-research-pass-report). Section count held at 10 numbered sections (§0-§9) but with cleaner phase boundaries — one section per phase. Signed-off-by: Chien-Chun Hung <2679986+chienchunhung@users.noreply.github.com>
…iggers Adds a "Status: paused (2026-05-19)" section to mvp-prototype-findings.md explaining that the prototype's primary mandate is empirically discharged (F1-F5 + OQ2 + OQ4 closed) and the remaining four pending items all hit diminishing returns vs. the production PRs they unblock. Documents five concrete resumption triggers with the action required for each: * PR NVIDIA#13404 (1a.2 kernel mask) lands -> seam-stressing + OQ1 * PR 1a.4 (production AlltoAllWatchdog) lands -> reproduce F3/F4/F5 under real MNNVL fabric memory * PR 1c.3 (MPI FT subcomm) lands -> swap stubs/mpi_ft_subcomm.py and tighten F2 mitigation against world_is_poisoned() * NVL72 access -> false-positive floor + scale validation (Audit 1b) * PR 1d.4 starts -> hand off driver + timeline JSONs as regression baseline Also documents the mechanical steps to resume so future-anyone (including future-me) doesn't have to re-derive how the worktree, branch, and cherry-pick chain hang together. Signed-off-by: Chien-Chun Hung <2679986+chienchunhung@users.noreply.github.com>
f298ca6 to
1aaf884
Compare
|
/bot run |
|
PR_Github #54374 [ run ] triggered by Bot. Commit: |
|
PR_Github #54374 [ run ] completed with state
|
1aaf884 to
72eb5dd
Compare
|
/bot run |
|
PR_Github #54382 [ run ] triggered by Bot. Commit: |
|
PR_Github #54382 [ run ] completed with state
|
72eb5dd to
98d59fd
Compare
|
/bot run --disable-fail-fast |
|
PR_Github #54485 [ run ] triggered by Bot. Commit: |
|
PR_Github #54485 [ run ] completed with state
|
|
/bot run --disable-fail-fast --stage-list "DGX_B200-PyTorch-2,DGX_B200-4_GPUs-PyTorch-Ray-1" |
|
/bot run --disable-fail-fast |
|
PR_Github #54640 [ run ] triggered by Bot. Commit: |
|
PR_Github #54641 [ run ] triggered by Bot. Commit: |
|
PR_Github #54640 [ run ] completed with state |
|
PR_Github #54641 [ run ] completed with state |
📝 WalkthroughWalkthroughThe PR introduces an ChangesMoE A2A Active-Rank Mask
Sequence Diagram(s)sequenceDiagram
participant Caller as Python/C++ caller
participant Host as moeA2ADispatchOp / moeA2ACombineOp
participant Resolver as resolveActiveRankMask
participant DispatchKernel as moeA2ADispatchKernel (GPU)
participant CombineKernel as moeA2ACombineKernel (GPU)
Caller->>Host: dispatch(inputs, active_rank_mask=T)
Host->>Resolver: validate dtype/shape/ep_rank bit
Resolver-->>Host: params.active_rank_mask[]
Host->>DispatchKernel: launch(DispatchKernelPointers{active_rank_mask})
DispatchKernel->>DispatchKernel: is_rank_active(target) → skip dead-rank routing
DispatchKernel->>DispatchKernel: skip recv_counters/EPLB for dead ranks
DispatchKernel->>DispatchKernel: release-store completion flags → active peers only
DispatchKernel->>DispatchKernel: wait/spin → active peers only
DispatchKernel-->>Host: routing table (dead slots = -1)
Caller->>Host: combine(inputs, active_rank_mask=T)
Host->>Resolver: validate dtype/shape/ep_rank bit
Resolver-->>Host: params.active_rank_mask[]
Host->>CombineKernel: launch(CombineKernelPointers{active_rank_mask})
CombineKernel->>CombineKernel: release-store readiness flags → active peers only
CombineKernel->>CombineKernel: wait/spin → active peers only
CombineKernel-->>Host: combined output (dead-targeted slots dropped)
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Suggested reviewers
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 3
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
cpp/tensorrt_llm/kernels/communicationKernels/moeAlltoAllKernels.cu (1)
427-473:⚠️ Potential issue | 🔴 Critical | ⚡ Quick winUndefined behavior when
target_rank >= 64.The
already_copiedbitmask is a singleuint64_t, butkMaxRanksis now 128. Whentarget_rank >= 64(possible withep_size > 64, e.g., NVL72 with 72 ranks), the expression1ULL << target_rankis undefined behavior per the C++ standard (shifting by ≥ bit-width).This breaks duplicate detection for deployments exceeding 64 EP ranks.
🐛 Proposed fix: use a 2-word bitmask like active_rank_mask
- uint64_t already_copied = 0; + uint64_t already_copied[kRankMaskWords] = {0, 0}; // ... existing code ... for (int k = 0; k < TOP_K; k++) { // ... existing code ... - if ((already_copied & (1ULL << target_rank)) || target_dead) + int const word_idx = target_rank >> 6; + uint64_t const bit_mask = 1ULL << (target_rank & 63); + if ((already_copied[word_idx] & bit_mask) || target_dead) { // ... existing skip logic ... continue; } // ... existing send logic ... - already_copied |= 1ULL << target_rank; + already_copied[word_idx] |= bit_mask; }🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@cpp/tensorrt_llm/kernels/communicationKernels/moeAlltoAllKernels.cu` around lines 427 - 473, The `already_copied` bitmask variable uses a single `uint64_t` which can only represent 64 ranks, but `target_rank` can be up to 127 (since `kMaxRanks` is 128), causing undefined behavior when shifting by values >= 64. Replace the single `uint64_t already_copied` with a 2-word bitmask structure (similar to how `active_rank_mask` is implemented) to support up to 128 ranks. Update all bit operations on `already_copied` - specifically the check `(already_copied & (1ULL << target_rank))` and the assignment `already_copied |= 1ULL << target_rank` - to use helper functions or logic that operates across the 2-word structure based on whether `target_rank < 64` or `target_rank >= 64`.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@cpp/tensorrt_llm/thop/moeAlltoAllOp.cpp`:
- Around line 51-77: Add a validation check to ensure that `epRank` is within
the bounds of the fixed-width rank mask and parameter arrays before any
rank-indexed array access occurs. The current validation `epRank < epSize` is
insufficient because if `epSize` exceeds the fixed array capacity (likely
`kMaxRanks`), the function `resolveActiveRankMask()` can index past the bounds
of the `out` array when accessing `out[epRank >> 6]`. Insert a host-side bounds
check that validates `epRank < kMaxRanks` (or the appropriate fixed array
capacity constant) before calling functions that access rank-indexed arrays with
rank values derived from `epRank`.
In `@tests/unittest/_torch/multi_gpu/test_moe_a2a_rank_mask.py`:
- Around line 283-287: Replace the broad `except Exception` clause in both
locations with specific exception types that indicate unsupported hardware. In
the try block where MnnvlMemory.initialize() is called and
MnnvlMemory.supports_mnnvl() is checked (at lines 283-287 and also at lines
332-336), catch only the specific exceptions that indicate the system does not
support MNNVL hardware, rather than all exceptions. This allows unexpected
regressions in these method calls to propagate as test failures instead of being
silently skipped.
- Around line 299-304: Add `strict=True` parameter to the `zip()` call in the
mpi_pool_executor.map() function to enforce that the iterables passed to zip are
of equal length, making the invariant explicit as required by Ruff linting.
Apply the same fix to the other zip() call at lines 358-362.
---
Outside diff comments:
In `@cpp/tensorrt_llm/kernels/communicationKernels/moeAlltoAllKernels.cu`:
- Around line 427-473: The `already_copied` bitmask variable uses a single
`uint64_t` which can only represent 64 ranks, but `target_rank` can be up to 127
(since `kMaxRanks` is 128), causing undefined behavior when shifting by values
>= 64. Replace the single `uint64_t already_copied` with a 2-word bitmask
structure (similar to how `active_rank_mask` is implemented) to support up to
128 ranks. Update all bit operations on `already_copied` - specifically the
check `(already_copied & (1ULL << target_rank))` and the assignment
`already_copied |= 1ULL << target_rank` - to use helper functions or logic that
operates across the 2-word structure based on whether `target_rank < 64` or
`target_rank >= 64`.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: bed8d775-11ff-4a97-9e7d-1177c057b071
📒 Files selected for processing (6)
cpp/tensorrt_llm/kernels/communicationKernels/moeAlltoAllKernels.cucpp/tensorrt_llm/kernels/communicationKernels/moeAlltoAllKernels.hcpp/tensorrt_llm/thop/moeAlltoAllOp.cpptensorrt_llm/_torch/custom_ops/cpp_custom_ops.pytensorrt_llm/_torch/modules/fused_moe/communication/nvlink_one_sided.pytests/unittest/_torch/multi_gpu/test_moe_a2a_rank_mask.py
|
/bot run --disable-fail-fast |
👎 Promotion blocked, new vulnerability foundVulnerability report
|
5f7693b to
685ea79
Compare
|
/bot run --disable-fail-fast |
|
PR_Github #54680 [ run ] triggered by Bot. Commit: |
|
PR_Github #54680 [ run ] completed with state
|
685ea79 to
eb7cfb3
Compare
|
/bot run |
|
PR_Github #54889 [ run ] triggered by Bot. Commit: |
|
PR_Github #54889 [ run ] completed with state
|
eb7cfb3 to
c21304f
Compare
Eliminates the infinite-spin AlltoAll hang that turns a single GPU failure in a Wide-EP group into a 5-minute HangDetector fire + full restart. The dispatch and combine kernels now take a uint64[2] bitmask of currently-alive EP ranks; dead ranks are skipped on every completion-flag write/wait, peer recv_counter store, EPLB stats write, and per-token routing decision (dead-targeted slots collapse to the same -1 sentinel combine already uses for duplicates). The mask is optional on both torch ops; omitting it (or passing all-ones) produces bit-identical output to the pre-change kernel. kMaxRanks is bumped 64 -> 128 to cover NVL72 with headroom; kRankMaskWords = 2 names the kernel ABI explicitly. Tests cover (a) all-ones mask matches no-mask bit-for-bit, and (b) one rank masked dead -> surviving ranks complete dispatch+combine without hang, dead-targeted topk slots dropped, in tests/unittest/_torch/multi_gpu/test_moe_a2a_rank_mask.py. Signed-off-by: Chien-Chun Hung <2679986+chienchunhung@users.noreply.github.com>
Signed-off-by: Chien-Chun Hung <2679986+chienchunhung@users.noreply.github.com>
c21304f to
985d64b
Compare
|
/bot run |
|
PR_Github #54994 [ run ] triggered by Bot. Commit: |
|
PR_Github #54994 [ run ] completed with state
|
Summary by CodeRabbit
Release Notes
active_rank_maskparameter for MoE dispatch and combine operations, with default behavior unchanged when not specified.Description
Adds an
active_rank_maskparameter (uint64[2] bitmask of currently-alive EP ranks) to the NVLink one-sided MoE AlltoAll dispatch and combine kernels. When the mask is omitted, behavior is bit-identical to before; when a bit is cleared, the kernel skips every interaction with that peer rank — most importantly, the completion-flag spin-wait loops that today hang forever when a peer dies. No new behavior is exposed at the Python wrapper layer in this PR — only the kernel and thetorch.ops.trtllm.moe_a2a_*op signatures.Background
When a single GPU dies in a Wide-EP group, the surviving ranks today spin forever on
completion_flagsin symmetric memory: the dispatch and combine kernels poll a flag word per peer and have no timeout, no abort, and no fallback. The system limps along until the host-sideHangDetectorfires after 5 minutes, then a full executor restart adds 2-3 more minutes of warmup — so a single GPU failure costs ~7-8 minutes of full downtime. This PR is the kernel-level half of the fix: the kernels themselves now have a way to know which peers are alive, so the spin loops can skip dead ranks instead of waiting for them. The Python wrapper that actually feeds the mask in (sourced fromEPGroupHealth) is a follow-up PR.What this PR does
The mask is plumbed through every layer that touches the kernels:
kRankMaskWords = 2anduint64_t active_rank_mask[kRankMaskWords]added toMoeA2ADispatchParams,MoeA2ACombineParams, and the kernel-side pointer structs.kMaxRanksis bumped 64 → 128 to cover NVL72 (72 ranks) with headroom.-1sentinel that combine already uses for duplicates, (b) therecv_countersstore loop, (c) the EPLB stats write loop, (d) the completion-flag write loop, and most importantly (e) the completion-flag wait loop where the hang lives.topk_send_indices[k] = -1for dead-targeted slots, and the existingdst_idx < 0guard handles them.moe_a2a_dispatchandmoe_a2a_combinegain an optionalTensor? active_rank_mask=Noneparameter (CPUuint64[2]). A smallresolveActiveRankMaskhelper validates dtype/device/shape and defaults to all-ones when omitted, so existing call sites (including the unchangedMoeAlltoAllPython wrapper) work bit-identically.Key design choices:
is_rank_active(mask, rank)is one word load + one shift + one mask + one branch. The dispatch and combine peer loops runO(ep_size)iterations once per kernel invocation, so the overhead at NVL72 is at most 72 extra branches per launch — well inside the <0.1% steady-state regression gate.MoeA2ADispatchParams params{};) get the safe default automatically;resolveActiveRankMaskthen overwrites with the user mask or all-ones.Test Coverage
tests/unittest/_torch/multi_gpu/test_moe_a2a_rank_mask.py— MPI-driven multi-GPU test, follows the sameMnnvlMemory.supports_mnnvl()skip pattern as the existingtest_moe_a2a.py. Both tests bypass theMoeAlltoAllPython wrapper and calltorch.ops.trtllm.moe_a2a_dispatch/moe_a2a_combinedirectly so they exercise the new C++ op signature without depending on the follow-up wrapper PR.test_all_active_mask_matches_no_mask— regression guard. Two consecutive dispatch + combine rounds on identical input onep_size = 4, one withmask = Noneand one withmask = all-ones. Asserts bit-identical combined output and identicaltopk_target_ranksworkspace state. Parametrized over(local_num_tokens, top_k) ∈ {(16, 2), (32, 4)}.test_one_rank_masked_completes— parametrized overdead_rank ∈ {0, 2, 3}to cover the lowest-numbered, mid, and highest-numbered ranks (so the bit-mask logic is exercised at both ends of word 0). The "dead" rank participates in workspace init (which has its own MPI barrier), then sits atMPI.COMM_WORLD.barrier(); the surviving ranks call dispatch + combine with the dead rank's bit cleared. Asserts (a) the test reaches its assertions at all (no hang), (b) on every surviving rank, every top-k slot whose expert routed todead_rankwas dropped to-1, (c) all other slots match what the contiguous-partition routing rule predicts, and (d) the combined output has the expected(local_num_tokens, hidden_size)shape.Both tests need MNNVL hardware (GB200) and
ep_sizeGPUs to actually run; they skip cleanly elsewhere. Run:mpirun -np 4 pytest tests/unittest/_torch/multi_gpu/test_moe_a2a_rank_mask.py -v.PR Checklist
Please review the following before submitting your PR:
PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.
PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.
Test cases are provided for new code paths (see test instructions)
Any new dependencies have been scanned for license and vulnerabilities
CODEOWNERS updated if ownership changes
Documentation updated as needed
Update tava architecture diagram if there is a significant design change in PR.
The reviewers assigned automatically/manually are appropriate for the PR.
Please check this after reviewing the above items as appropriate for this PR.
GitHub Bot Help
To see a list of available CI bot commands, please comment
/bot help.