Skip to content

[TRTLLM-12200][feat] WideEP FT: add active_rank_mask to NVLink AlltoAll kernels (1a.2)#13404

Open
chienchunhung wants to merge 2 commits into
NVIDIA:mainfrom
chienchunhung:WideEP-FT/1a.2-nvlink-kernel-mask
Open

[TRTLLM-12200][feat] WideEP FT: add active_rank_mask to NVLink AlltoAll kernels (1a.2)#13404
chienchunhung wants to merge 2 commits into
NVIDIA:mainfrom
chienchunhung:WideEP-FT/1a.2-nvlink-kernel-mask

Conversation

@chienchunhung

@chienchunhung chienchunhung commented Apr 24, 2026

Copy link
Copy Markdown
Collaborator

Summary by CodeRabbit

Release Notes

  • New Features
    • Added active rank mask capability to optimize MoE all-to-all communication by efficiently skipping inactive ranks during dispatch and combine operations.
    • Extended maximum supported Expert Parallel (EP) size from 64 to 128 ranks.
    • New optional active_rank_mask parameter for MoE dispatch and combine operations, with default behavior unchanged when not specified.

Description

Adds an active_rank_mask parameter (uint64[2] bitmask of currently-alive EP ranks) to the NVLink one-sided MoE AlltoAll dispatch and combine kernels. When the mask is omitted, behavior is bit-identical to before; when a bit is cleared, the kernel skips every interaction with that peer rank — most importantly, the completion-flag spin-wait loops that today hang forever when a peer dies. No new behavior is exposed at the Python wrapper layer in this PR — only the kernel and the torch.ops.trtllm.moe_a2a_* op signatures.

Background

When a single GPU dies in a Wide-EP group, the surviving ranks today spin forever on completion_flags in symmetric memory: the dispatch and combine kernels poll a flag word per peer and have no timeout, no abort, and no fallback. The system limps along until the host-side HangDetector fires after 5 minutes, then a full executor restart adds 2-3 more minutes of warmup — so a single GPU failure costs ~7-8 minutes of full downtime. This PR is the kernel-level half of the fix: the kernels themselves now have a way to know which peers are alive, so the spin loops can skip dead ranks instead of waiting for them. The Python wrapper that actually feeds the mask in (sourced from EPGroupHealth) is a follow-up PR.

What this PR does

The mask is plumbed through every layer that touches the kernels:

  • Kernel ABI: kRankMaskWords = 2 and uint64_t active_rank_mask[kRankMaskWords] added to MoeA2ADispatchParams, MoeA2ACombineParams, and the kernel-side pointer structs. kMaxRanks is bumped 64 → 128 to cover NVL72 (72 ranks) with headroom.
  • Dispatch kernel: masked targets are skipped on (a) per-token routing — a token whose top-k expert lives on a dead rank collapses to the same -1 sentinel that combine already uses for duplicates, (b) the recv_counters store loop, (c) the EPLB stats write loop, (d) the completion-flag write loop, and most importantly (e) the completion-flag wait loop where the hang lives.
  • Combine kernel: masked peers are skipped on the completion-flag write and wait loops. The per-token reduction needs no explicit mask check because dispatch already set topk_send_indices[k] = -1 for dead-targeted slots, and the existing dst_idx < 0 guard handles them.
  • Torch op: moe_a2a_dispatch and moe_a2a_combine gain an optional Tensor? active_rank_mask=None parameter (CPU uint64[2]). A small resolveActiveRankMask helper validates dtype/device/shape and defaults to all-ones when omitted, so existing call sites (including the unchanged MoeAlltoAll Python wrapper) work bit-identically.

Key design choices:

  • Single bit-test per peer iteration. is_rank_active(mask, rank) is one word load + one shift + one mask + one branch. The dispatch and combine peer loops run O(ep_size) iterations once per kernel invocation, so the overhead at NVL72 is at most 72 extra branches per launch — well inside the <0.1% steady-state regression gate.
  • Mask lives in the launch-param struct, not as a separate kernel arg. Keeps the change localised and means launch sites that zero-initialise their params (MoeA2ADispatchParams params{};) get the safe default automatically; resolveActiveRankMask then overwrites with the user mask or all-ones.
  • Local rank's bit must always be set. Asserted at launch time and at the torch-op boundary; the kernel itself is running on the local rank, so a cleared self-bit indicates an upstream bug rather than a recoverable state.
  • Dead-target tokens are dropped rather than re-routed. The kernel-layer responsibility is "make AlltoAll survive a rank failure"; choosing where else to send a dead-targeted token is an EPLB-layer concern (separate follow-up PR). Dropping is the same code path as the existing duplicate handling, so combine needs no new logic.

Test Coverage

tests/unittest/_torch/multi_gpu/test_moe_a2a_rank_mask.py — MPI-driven multi-GPU test, follows the same MnnvlMemory.supports_mnnvl() skip pattern as the existing test_moe_a2a.py. Both tests bypass the MoeAlltoAll Python wrapper and call torch.ops.trtllm.moe_a2a_dispatch / moe_a2a_combine directly so they exercise the new C++ op signature without depending on the follow-up wrapper PR.

  • test_all_active_mask_matches_no_mask — regression guard. Two consecutive dispatch + combine rounds on identical input on ep_size = 4, one with mask = None and one with mask = all-ones. Asserts bit-identical combined output and identical topk_target_ranks workspace state. Parametrized over (local_num_tokens, top_k) ∈ {(16, 2), (32, 4)}.
  • test_one_rank_masked_completes — parametrized over dead_rank ∈ {0, 2, 3} to cover the lowest-numbered, mid, and highest-numbered ranks (so the bit-mask logic is exercised at both ends of word 0). The "dead" rank participates in workspace init (which has its own MPI barrier), then sits at MPI.COMM_WORLD.barrier(); the surviving ranks call dispatch + combine with the dead rank's bit cleared. Asserts (a) the test reaches its assertions at all (no hang), (b) on every surviving rank, every top-k slot whose expert routed to dead_rank was dropped to -1, (c) all other slots match what the contiguous-partition routing rule predicts, and (d) the combined output has the expected (local_num_tokens, hidden_size) shape.

Both tests need MNNVL hardware (GB200) and ep_size GPUs to actually run; they skip cleanly elsewhere. Run: mpirun -np 4 pytest tests/unittest/_torch/multi_gpu/test_moe_a2a_rank_mask.py -v.

PR Checklist

Please review the following before submitting your PR:

  • PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.

  • PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.

  • Test cases are provided for new code paths (see test instructions)

  • Any new dependencies have been scanned for license and vulnerabilities

  • CODEOWNERS updated if ownership changes

  • Documentation updated as needed

  • Update tava architecture diagram if there is a significant design change in PR.

  • The reviewers assigned automatically/manually are appropriate for the PR.

  • Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

To see a list of available CI bot commands, please comment /bot help.

chienchunhung added a commit to chienchunhung/TensorRT-LLM that referenced this pull request Apr 24, 2026
Cross-link NVIDIA#13404 (the NVLink AlltoAll kernel-mask
implementation) from the 1a.2 row of the implementation plan, mirroring
the 1a.1 link added in 92af527.

Signed-off-by: Chien-Chun Hung <2679986+chienchunhung@users.noreply.github.com>
Made-with: Cursor
@chienchunhung

Copy link
Copy Markdown
Collaborator Author

/bot run --disable-fail-fast

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #45437 [ run ] triggered by Bot. Commit: f298ca6 Link to invocation

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #45437 [ run ] completed with state FAILURE. Commit: f298ca6
/LLM/main/L0_MergeRequest_PR pipeline #35669 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

chienchunhung added a commit to chienchunhung/TensorRT-LLM that referenced this pull request Apr 25, 2026
Fifth batch. §8 has Phase 1 PR breakdown (1a kernel, 1b EPLB, 1c
detection, 1d integration — 13 MVP PRs + 12 v1), Phase 1-DS
(disagg), Phase 2 (3 sub-tracks + MNNVL audit as explicit prereq
PR 2a.0), and Phase 3 work-track rough plan. Critical-path Gantt
shows three MVP gating items: 1a.2 kernel (in flight as PR NVIDIA#13404),
1c.3 MPI FT subcomm (net-new L), 1d.4 fault-injection harness
(net-new L). Timeline summary: MVP 6-7 weeks, full program 7-10
months (AI-assisted), with honest caveats about L-sized risks.
Added PR 1d.0 (MPI signal handler replacement) to the MVP list —
was implicit in the design but needed to be called out as named
work.

§9 names two audits as gating risks: MNNVL/NVSHMEM teardown
capability (Phase 2 prereq, 1-2 week prototype with concrete scope)
and Ray-path WideEP perf characterization (future-migration prereq,
gated on Ray-path CI coverage existing at EP≥32 first). §9.2 has 14
technical risks with Severity × Probability × Residual per row
(residual column is the newly added column per earlier reviewer
feedback). §9.3 has 8 open questions including the Q8 framework
for when to revisit the Ray pivot (three conditions all must hold).
§9.4 summary matrix with all risks in one place; bolded rows are
the active-tracking ones during MVP.

Signed-off-by: Chien-Chun Hung <2679986+chienchunhung@users.noreply.github.com>
chienchunhung added a commit to chienchunhung/TensorRT-LLM that referenced this pull request Apr 25, 2026
…move v1 files

Final batch of the v2 rewrite.

§0 executive summary — problem statement, approach, key MPI-vs-Ray
decision, two failure modes named, the four TRT-LLM uniqueness
properties, headline timeline numbers, and an explicit "what v2
changes vs v1" list for readers coming from the prior version.

README — complete rewrite as navigation for v2. Section table with
one-line summaries per section. In-flight PR table (NVIDIA#13302 and
NVIDIA#13404 against this design). Consolidated terminology table
including the rank/process/slot distinction that a reviewer flagged
as missing. Scope & non-goals stated once, not repeated throughout.

Removing v1 files:
- 01-background.md → content folded into 01-user-journey-and-stack.md (§1.3)
  and 00-executive-summary.md
- 02-current-state.md → folded into 01-user-journey-and-stack.md (§1.2)
  and 03-failure-modes-and-gaps.md
- 03-competitive-landscape.md → folded into 02-stack-comparison-and-positioning.md
- 04-two-phase-recovery.md → superseded by 04-architecture-overview.md
  (now three-phase)
- 05-rank-masking.md → folded into 05-phase-1-immediate-survival.md §5.1
- 06-eplb-adaptation.md → folded into 05-phase-1-immediate-survival.md §5.2
- 07-failure-detection.md → folded into 05-phase-1-immediate-survival.md §5.3
  and §5.4
- 08-mx-gms-integration.md → split between 05-phase-1 (PR NVIDIA#12718 integration
  in §5.3) and 06-phase-2 (MX-GMS in §6.3)
- 09-implementation-plan.md → superseded by 08-implementation-plan.md
- 10-risks.md → superseded by 09-risks-and-open-questions.md
- COMBINED.md → superseded; single-file view can be regenerated from the
  split files if needed

New v2 file set: README + §0-§9 (11 files) + 3 workflow artifacts
(redesign-outline, redesign-research-pass, redesign-research-pass-report).
Section count held at 10 numbered sections (§0-§9) but with cleaner
phase boundaries — one section per phase.

Signed-off-by: Chien-Chun Hung <2679986+chienchunhung@users.noreply.github.com>
chienchunhung added a commit to chienchunhung/TensorRT-LLM that referenced this pull request May 19, 2026
…iggers

Adds a "Status: paused (2026-05-19)" section to mvp-prototype-findings.md
explaining that the prototype's primary mandate is empirically discharged
(F1-F5 + OQ2 + OQ4 closed) and the remaining four pending items all hit
diminishing returns vs. the production PRs they unblock.

Documents five concrete resumption triggers with the action required for
each:
  * PR NVIDIA#13404 (1a.2 kernel mask) lands -> seam-stressing + OQ1
  * PR 1a.4 (production AlltoAllWatchdog) lands -> reproduce F3/F4/F5
    under real MNNVL fabric memory
  * PR 1c.3 (MPI FT subcomm) lands -> swap stubs/mpi_ft_subcomm.py and
    tighten F2 mitigation against world_is_poisoned()
  * NVL72 access -> false-positive floor + scale validation (Audit 1b)
  * PR 1d.4 starts -> hand off driver + timeline JSONs as regression
    baseline

Also documents the mechanical steps to resume so future-anyone (including
future-me) doesn't have to re-derive how the worktree, branch, and
cherry-pick chain hang together.

Signed-off-by: Chien-Chun Hung <2679986+chienchunhung@users.noreply.github.com>
@chienchunhung chienchunhung force-pushed the WideEP-FT/1a.2-nvlink-kernel-mask branch from f298ca6 to 1aaf884 Compare June 15, 2026 23:28

Copy link
Copy Markdown
Collaborator Author

/bot run

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #54374 [ run ] triggered by Bot. Commit: 1aaf884 Link to invocation

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #54374 [ run ] completed with state FAILURE. Commit: 1aaf884
/LLM/main/L0_MergeRequest_PR pipeline #43444 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

@chienchunhung chienchunhung force-pushed the WideEP-FT/1a.2-nvlink-kernel-mask branch from 1aaf884 to 72eb5dd Compare June 16, 2026 00:07

Copy link
Copy Markdown
Collaborator Author

/bot run

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #54382 [ run ] triggered by Bot. Commit: 72eb5dd Link to invocation

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #54382 [ run ] completed with state FAILURE. Commit: 72eb5dd
/LLM/main/L0_MergeRequest_PR pipeline #43451 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

@chienchunhung chienchunhung force-pushed the WideEP-FT/1a.2-nvlink-kernel-mask branch from 72eb5dd to 98d59fd Compare June 16, 2026 05:44

Copy link
Copy Markdown
Collaborator Author

/bot run --disable-fail-fast

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #54485 [ run ] triggered by Bot. Commit: 98d59fd Link to invocation

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #54485 [ run ] completed with state SUCCESS. Commit: 98d59fd
/LLM/main/L0_MergeRequest_PR pipeline #43548 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

Copy link
Copy Markdown
Collaborator Author

/bot run --disable-fail-fast --stage-list "DGX_B200-PyTorch-2,DGX_B200-4_GPUs-PyTorch-Ray-1"

Copy link
Copy Markdown
Collaborator Author

/bot run --disable-fail-fast

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #54640 [ run ] triggered by Bot. Commit: 98d59fd Link to invocation

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #54641 [ run ] triggered by Bot. Commit: 98d59fd Link to invocation

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #54640 [ run ] completed with state ABORTED. Commit: 98d59fd

Link to invocation

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #54641 [ run ] completed with state SUCCESS. Commit: 98d59fd
/LLM/main/L0_MergeRequest_PR pipeline #43672 completed with status: 'SUCCESS'

CI Report

Link to invocation

@chienchunhung chienchunhung marked this pull request as ready for review June 16, 2026 20:37
@chienchunhung chienchunhung requested review from a team as code owners June 16, 2026 20:37
@chienchunhung chienchunhung requested a review from yizhang-nv June 16, 2026 20:37
@coderabbitai

coderabbitai Bot commented Jun 16, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

📝 Walkthrough

Walkthrough

The PR introduces an active_rank_mask bitmask mechanism to MoE all-to-all dispatch and combine CUDA kernels, enabling inactive ("dead") EP ranks to be skipped during routing, counter writes, flag signaling, and peer spin-wait loops. kMaxRanks is increased from 64 to 128. The mask is threaded through the PyTorch op boundary via new optional parameters in C++ ops and Python fake implementations, with a new test module validating correctness under masked and unmasked conditions across multiple GPUs.

Changes

MoE A2A Active-Rank Mask

Layer / File(s) Summary
Header constants and struct fields
cpp/tensorrt_llm/kernels/communicationKernels/moeAlltoAllKernels.h
Increases kMaxRanks to 128, adds kRankMaskWords with a static_assert, and adds active_rank_mask[kRankMaskWords] to DispatchKernelPointers, CombineKernelPointers, MoeA2ADispatchParams, and MoeA2ACombineParams.
Dispatch kernel rank-mask enforcement
cpp/tensorrt_llm/kernels/communicationKernels/moeAlltoAllKernels.cu (lines 213–587, 635–681)
Adds is_rank_active device helper; marks inactive target-rank destinations dead in routing; guards recv_counters/EPLB writes, completion-flag release-stores, and peer wait loops to active ranks only; adds host-side ep_rank validation and mask copy into DispatchKernelPointers.
Combine kernel rank-mask enforcement
cpp/tensorrt_llm/kernels/communicationKernels/moeAlltoAllKernels.cu (lines 1193–1363)
Restricts combine-kernel completion-flag release-stores and readiness wait/spin loops to active peers; adds host-side ep_rank validation and mask copy into CombineKernelPointers.
PyTorch op wiring and Python-side constants
cpp/tensorrt_llm/thop/moeAlltoAllOp.cpp, tensorrt_llm/_torch/custom_ops/cpp_custom_ops.py, tensorrt_llm/_torch/modules/fused_moe/communication/nvlink_one_sided.py
Adds resolveActiveRankMask C++ helper; updates dispatch/combine op signatures and PyTorch schemas with Tensor? active_rank_mask=None; updates fake op signatures; increases NVLinkOneSided.MAX_RANKS to 128.
Multi-GPU rank-mask tests
tests/unittest/_torch/multi_gpu/test_moe_a2a_rank_mask.py
New test module with mask construction, payload generation, routing-table readback helpers, _run_dispatch_combine harness, and two parameterized pytest tests (all-active-mask-matches-no-mask and one-rank-masked-completes) requiring MNNVL hardware.

Sequence Diagram(s)

sequenceDiagram
    participant Caller as Python/C++ caller
    participant Host as moeA2ADispatchOp / moeA2ACombineOp
    participant Resolver as resolveActiveRankMask
    participant DispatchKernel as moeA2ADispatchKernel (GPU)
    participant CombineKernel as moeA2ACombineKernel (GPU)

    Caller->>Host: dispatch(inputs, active_rank_mask=T)
    Host->>Resolver: validate dtype/shape/ep_rank bit
    Resolver-->>Host: params.active_rank_mask[]
    Host->>DispatchKernel: launch(DispatchKernelPointers{active_rank_mask})
    DispatchKernel->>DispatchKernel: is_rank_active(target) → skip dead-rank routing
    DispatchKernel->>DispatchKernel: skip recv_counters/EPLB for dead ranks
    DispatchKernel->>DispatchKernel: release-store completion flags → active peers only
    DispatchKernel->>DispatchKernel: wait/spin → active peers only
    DispatchKernel-->>Host: routing table (dead slots = -1)

    Caller->>Host: combine(inputs, active_rank_mask=T)
    Host->>Resolver: validate dtype/shape/ep_rank bit
    Resolver-->>Host: params.active_rank_mask[]
    Host->>CombineKernel: launch(CombineKernelPointers{active_rank_mask})
    CombineKernel->>CombineKernel: release-store readiness flags → active peers only
    CombineKernel->>CombineKernel: wait/spin → active peers only
    CombineKernel-->>Host: combined output (dead-targeted slots dropped)
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Suggested reviewers

  • leslie-fang25
  • hyukn
  • bobboli
  • liji-nv
  • chang-l
  • xxi-nv
🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 59.26% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and specifically summarizes the main change: adding an active_rank_mask parameter to NVLink AlltoAll kernels as part of the WideEP FT feature set (phase 1a.2).
Description check ✅ Passed The PR description comprehensively covers the background, implementation details, design choices, test coverage, and includes a completed checklist. All required sections are present and well-documented.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
cpp/tensorrt_llm/kernels/communicationKernels/moeAlltoAllKernels.cu (1)

427-473: ⚠️ Potential issue | 🔴 Critical | ⚡ Quick win

Undefined behavior when target_rank >= 64.

The already_copied bitmask is a single uint64_t, but kMaxRanks is now 128. When target_rank >= 64 (possible with ep_size > 64, e.g., NVL72 with 72 ranks), the expression 1ULL << target_rank is undefined behavior per the C++ standard (shifting by ≥ bit-width).

This breaks duplicate detection for deployments exceeding 64 EP ranks.

🐛 Proposed fix: use a 2-word bitmask like active_rank_mask
-        uint64_t already_copied = 0;
+        uint64_t already_copied[kRankMaskWords] = {0, 0};
         // ... existing code ...
         for (int k = 0; k < TOP_K; k++)
         {
             // ... existing code ...
-            if ((already_copied & (1ULL << target_rank)) || target_dead)
+            int const word_idx = target_rank >> 6;
+            uint64_t const bit_mask = 1ULL << (target_rank & 63);
+            if ((already_copied[word_idx] & bit_mask) || target_dead)
             {
                 // ... existing skip logic ...
                 continue;
             }
             // ... existing send logic ...
-            already_copied |= 1ULL << target_rank;
+            already_copied[word_idx] |= bit_mask;
         }
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@cpp/tensorrt_llm/kernels/communicationKernels/moeAlltoAllKernels.cu` around
lines 427 - 473, The `already_copied` bitmask variable uses a single `uint64_t`
which can only represent 64 ranks, but `target_rank` can be up to 127 (since
`kMaxRanks` is 128), causing undefined behavior when shifting by values >= 64.
Replace the single `uint64_t already_copied` with a 2-word bitmask structure
(similar to how `active_rank_mask` is implemented) to support up to 128 ranks.
Update all bit operations on `already_copied` - specifically the check
`(already_copied & (1ULL << target_rank))` and the assignment `already_copied |=
1ULL << target_rank` - to use helper functions or logic that operates across the
2-word structure based on whether `target_rank < 64` or `target_rank >= 64`.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@cpp/tensorrt_llm/thop/moeAlltoAllOp.cpp`:
- Around line 51-77: Add a validation check to ensure that `epRank` is within
the bounds of the fixed-width rank mask and parameter arrays before any
rank-indexed array access occurs. The current validation `epRank < epSize` is
insufficient because if `epSize` exceeds the fixed array capacity (likely
`kMaxRanks`), the function `resolveActiveRankMask()` can index past the bounds
of the `out` array when accessing `out[epRank >> 6]`. Insert a host-side bounds
check that validates `epRank < kMaxRanks` (or the appropriate fixed array
capacity constant) before calling functions that access rank-indexed arrays with
rank values derived from `epRank`.

In `@tests/unittest/_torch/multi_gpu/test_moe_a2a_rank_mask.py`:
- Around line 283-287: Replace the broad `except Exception` clause in both
locations with specific exception types that indicate unsupported hardware. In
the try block where MnnvlMemory.initialize() is called and
MnnvlMemory.supports_mnnvl() is checked (at lines 283-287 and also at lines
332-336), catch only the specific exceptions that indicate the system does not
support MNNVL hardware, rather than all exceptions. This allows unexpected
regressions in these method calls to propagate as test failures instead of being
silently skipped.
- Around line 299-304: Add `strict=True` parameter to the `zip()` call in the
mpi_pool_executor.map() function to enforce that the iterables passed to zip are
of equal length, making the invariant explicit as required by Ruff linting.
Apply the same fix to the other zip() call at lines 358-362.

---

Outside diff comments:
In `@cpp/tensorrt_llm/kernels/communicationKernels/moeAlltoAllKernels.cu`:
- Around line 427-473: The `already_copied` bitmask variable uses a single
`uint64_t` which can only represent 64 ranks, but `target_rank` can be up to 127
(since `kMaxRanks` is 128), causing undefined behavior when shifting by values
>= 64. Replace the single `uint64_t already_copied` with a 2-word bitmask
structure (similar to how `active_rank_mask` is implemented) to support up to
128 ranks. Update all bit operations on `already_copied` - specifically the
check `(already_copied & (1ULL << target_rank))` and the assignment
`already_copied |= 1ULL << target_rank` - to use helper functions or logic that
operates across the 2-word structure based on whether `target_rank < 64` or
`target_rank >= 64`.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: bed8d775-11ff-4a97-9e7d-1177c057b071

📥 Commits

Reviewing files that changed from the base of the PR and between 09449d4 and 98d59fd.

📒 Files selected for processing (6)
  • cpp/tensorrt_llm/kernels/communicationKernels/moeAlltoAllKernels.cu
  • cpp/tensorrt_llm/kernels/communicationKernels/moeAlltoAllKernels.h
  • cpp/tensorrt_llm/thop/moeAlltoAllOp.cpp
  • tensorrt_llm/_torch/custom_ops/cpp_custom_ops.py
  • tensorrt_llm/_torch/modules/fused_moe/communication/nvlink_one_sided.py
  • tests/unittest/_torch/multi_gpu/test_moe_a2a_rank_mask.py

Comment thread cpp/tensorrt_llm/thop/moeAlltoAllOp.cpp
Comment thread tests/unittest/_torch/multi_gpu/test_moe_a2a_rank_mask.py Outdated
Comment thread tests/unittest/_torch/multi_gpu/test_moe_a2a_rank_mask.py
@chienchunhung chienchunhung requested review from dongxuy04 and xxi-nv and removed request for dongxuy04 and xxi-nv June 16, 2026 21:39

Copy link
Copy Markdown
Collaborator Author

/bot run --disable-fail-fast

@github-actions

Copy link
Copy Markdown

👎 Promotion blocked, new vulnerability found

Vulnerability report

Component Vulnerability Description Severity
pytorch CVE-2025-3000 A vulnerability classified as critical has been found in PyTorch 2.6.0. This affects the function torch.jit.script. The manipulation leads to memory corruption. It is possible to launch the attack on the local host. The exploit has been disclosed to the public and may be used. MEDIUM

@chienchunhung chienchunhung force-pushed the WideEP-FT/1a.2-nvlink-kernel-mask branch from 5f7693b to 685ea79 Compare June 16, 2026 22:00

Copy link
Copy Markdown
Collaborator Author

/bot run --disable-fail-fast

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #54680 [ run ] triggered by Bot. Commit: 685ea79 Link to invocation

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #54680 [ run ] completed with state FAILURE. Commit: 685ea79
/LLM/main/L0_MergeRequest_PR pipeline #43711 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

@chienchunhung chienchunhung force-pushed the WideEP-FT/1a.2-nvlink-kernel-mask branch from 685ea79 to eb7cfb3 Compare June 18, 2026 00:38

Copy link
Copy Markdown
Collaborator Author

/bot run

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #54889 [ run ] triggered by Bot. Commit: eb7cfb3 Link to invocation

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #54889 [ run ] completed with state FAILURE. Commit: eb7cfb3
/LLM/main/L0_MergeRequest_PR pipeline #43894 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

@chienchunhung chienchunhung force-pushed the WideEP-FT/1a.2-nvlink-kernel-mask branch from eb7cfb3 to c21304f Compare June 21, 2026 02:53
Eliminates the infinite-spin AlltoAll hang that turns a single GPU failure in a Wide-EP group into a 5-minute HangDetector fire + full restart. The dispatch and combine kernels now take a uint64[2] bitmask of currently-alive EP ranks; dead ranks are skipped on every completion-flag write/wait, peer recv_counter store, EPLB stats write, and per-token routing decision (dead-targeted slots collapse to the same -1 sentinel combine already uses for duplicates).

The mask is optional on both torch ops; omitting it (or passing all-ones) produces bit-identical output to the pre-change kernel. kMaxRanks is bumped 64 -> 128 to cover NVL72 with headroom; kRankMaskWords = 2 names the kernel ABI explicitly.

Tests cover (a) all-ones mask matches no-mask bit-for-bit, and (b) one rank masked dead -> surviving ranks complete dispatch+combine without hang, dead-targeted topk slots dropped, in tests/unittest/_torch/multi_gpu/test_moe_a2a_rank_mask.py.

Signed-off-by: Chien-Chun Hung <2679986+chienchunhung@users.noreply.github.com>
Signed-off-by: Chien-Chun Hung <2679986+chienchunhung@users.noreply.github.com>
@chienchunhung chienchunhung force-pushed the WideEP-FT/1a.2-nvlink-kernel-mask branch from c21304f to 985d64b Compare June 21, 2026 02:54

Copy link
Copy Markdown
Collaborator Author

/bot run

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #54994 [ run ] triggered by Bot. Commit: 985d64b Link to invocation

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #54994 [ run ] completed with state SUCCESS. Commit: 985d64b
/LLM/main/L0_MergeRequest_PR pipeline #43986 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants