[TRTLLM-12200][feat] WideEP FT: add active_rank_mask to NVLink AlltoAll kernels (1a.2) by chienchunhung · Pull Request #13404 · NVIDIA/TensorRT-LLM

chienchunhung · 2026-04-24T03:19:28Z

Summary by CodeRabbit

Release Notes

New Features
- Added active rank mask capability to optimize MoE all-to-all communication by efficiently skipping inactive ranks during dispatch and combine operations.
- Extended maximum supported Expert Parallel (EP) size from 64 to 128 ranks.
- New optional active_rank_mask parameter for MoE dispatch and combine operations, with default behavior unchanged when not specified.

Description

Adds an active_rank_mask parameter (uint64[2] bitmask of currently-alive EP ranks) to the NVLink one-sided MoE AlltoAll dispatch and combine kernels. When the mask is omitted, behavior is bit-identical to before; when a bit is cleared, the kernel skips every interaction with that peer rank — most importantly, the completion-flag spin-wait loops that today hang forever when a peer dies. No new behavior is exposed at the Python wrapper layer in this PR — only the kernel and the torch.ops.trtllm.moe_a2a_* op signatures.

Background

When a single GPU dies in a Wide-EP group, the surviving ranks today spin forever on completion_flags in symmetric memory: the dispatch and combine kernels poll a flag word per peer and have no timeout, no abort, and no fallback. The system limps along until the host-side HangDetector fires after 5 minutes, then a full executor restart adds 2-3 more minutes of warmup — so a single GPU failure costs ~7-8 minutes of full downtime. This PR is the kernel-level half of the fix: the kernels themselves now have a way to know which peers are alive, so the spin loops can skip dead ranks instead of waiting for them. The Python wrapper that actually feeds the mask in (sourced from EPGroupHealth) is a follow-up PR.

What this PR does

The mask is plumbed through every layer that touches the kernels:

Kernel ABI: kRankMaskWords = 2 and uint64_t active_rank_mask[kRankMaskWords] added to MoeA2ADispatchParams, MoeA2ACombineParams, and the kernel-side pointer structs. kMaxRanks is bumped 64 → 128 to cover NVL72 (72 ranks) with headroom.
Dispatch kernel: masked targets are skipped on (a) per-token routing — a token whose top-k expert lives on a dead rank collapses to the same -1 sentinel that combine already uses for duplicates, (b) the recv_counters store loop, (c) the EPLB stats write loop, (d) the completion-flag write loop, and most importantly (e) the completion-flag wait loop where the hang lives.
Combine kernel: masked peers are skipped on the completion-flag write and wait loops. The per-token reduction needs no explicit mask check because dispatch already set topk_send_indices[k] = -1 for dead-targeted slots, and the existing dst_idx < 0 guard handles them.
Torch op: moe_a2a_dispatch and moe_a2a_combine gain an optional Tensor? active_rank_mask=None parameter (CPU uint64[2]). A small resolveActiveRankMask helper validates dtype/device/shape and defaults to all-ones when omitted, so existing call sites (including the unchanged MoeAlltoAll Python wrapper) work bit-identically.

Key design choices:

Single bit-test per peer iteration. is_rank_active(mask, rank) is one word load + one shift + one mask + one branch. The dispatch and combine peer loops run O(ep_size) iterations once per kernel invocation, so the overhead at NVL72 is at most 72 extra branches per launch — well inside the <0.1% steady-state regression gate.
Mask lives in the launch-param struct, not as a separate kernel arg. Keeps the change localised and means launch sites that zero-initialise their params (MoeA2ADispatchParams params{};) get the safe default automatically; resolveActiveRankMask then overwrites with the user mask or all-ones.
Local rank's bit must always be set. Asserted at launch time and at the torch-op boundary; the kernel itself is running on the local rank, so a cleared self-bit indicates an upstream bug rather than a recoverable state.
Dead-target tokens are dropped rather than re-routed. The kernel-layer responsibility is "make AlltoAll survive a rank failure"; choosing where else to send a dead-targeted token is an EPLB-layer concern (separate follow-up PR). Dropping is the same code path as the existing duplicate handling, so combine needs no new logic.

Test Coverage

tests/unittest/_torch/multi_gpu/test_moe_a2a_rank_mask.py — MPI-driven multi-GPU test, follows the same MnnvlMemory.supports_mnnvl() skip pattern as the existing test_moe_a2a.py. Both tests bypass the MoeAlltoAll Python wrapper and call torch.ops.trtllm.moe_a2a_dispatch / moe_a2a_combine directly so they exercise the new C++ op signature without depending on the follow-up wrapper PR.

test_all_active_mask_matches_no_mask — regression guard. Two consecutive dispatch + combine rounds on identical input on ep_size = 4, one with mask = None and one with mask = all-ones. Asserts bit-identical combined output and identical topk_target_ranks workspace state. Parametrized over (local_num_tokens, top_k) ∈ {(16, 2), (32, 4)}.
test_one_rank_masked_completes — parametrized over dead_rank ∈ {0, 2, 3} to cover the lowest-numbered, mid, and highest-numbered ranks (so the bit-mask logic is exercised at both ends of word 0). The "dead" rank participates in workspace init (which has its own MPI barrier), then sits at MPI.COMM_WORLD.barrier(); the surviving ranks call dispatch + combine with the dead rank's bit cleared. Asserts (a) the test reaches its assertions at all (no hang), (b) on every surviving rank, every top-k slot whose expert routed to dead_rank was dropped to -1, (c) all other slots match what the contiguous-partition routing rule predicts, and (d) the combined output has the expected (local_num_tokens, hidden_size) shape.

Both tests need MNNVL hardware (GB200) and ep_size GPUs to actually run; they skip cleanly elsewhere. Run: mpirun -np 4 pytest tests/unittest/_torch/multi_gpu/test_moe_a2a_rank_mask.py -v.

PR Checklist

Please review the following before submitting your PR:

PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.
PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.
Test cases are provided for new code paths (see test instructions)
Any new dependencies have been scanned for license and vulnerabilities
CODEOWNERS updated if ownership changes
Documentation updated as needed
Update tava architecture diagram if there is a significant design change in PR.
The reviewers assigned automatically/manually are appropriate for the PR.
Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

To see a list of available CI bot commands, please comment /bot help.

Cross-link NVIDIA#13404 (the NVLink AlltoAll kernel-mask implementation) from the 1a.2 row of the implementation plan, mirroring the 1a.1 link added in 92af527. Signed-off-by: Chien-Chun Hung <2679986+chienchunhung@users.noreply.github.com> Made-with: Cursor

chienchunhung · 2026-04-24T19:41:04Z

/bot run --disable-fail-fast

tensorrt-cicd · 2026-04-24T19:50:30Z

PR_Github #45437 [ run ] triggered by Bot. Commit: f298ca6 Link to invocation

tensorrt-cicd · 2026-04-24T20:44:39Z

PR_Github #45437 [ run ] completed with state FAILURE. Commit: f298ca6
/LLM/main/L0_MergeRequest_PR pipeline #35669 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

Fifth batch. §8 has Phase 1 PR breakdown (1a kernel, 1b EPLB, 1c detection, 1d integration — 13 MVP PRs + 12 v1), Phase 1-DS (disagg), Phase 2 (3 sub-tracks + MNNVL audit as explicit prereq PR 2a.0), and Phase 3 work-track rough plan. Critical-path Gantt shows three MVP gating items: 1a.2 kernel (in flight as PR NVIDIA#13404), 1c.3 MPI FT subcomm (net-new L), 1d.4 fault-injection harness (net-new L). Timeline summary: MVP 6-7 weeks, full program 7-10 months (AI-assisted), with honest caveats about L-sized risks. Added PR 1d.0 (MPI signal handler replacement) to the MVP list — was implicit in the design but needed to be called out as named work. §9 names two audits as gating risks: MNNVL/NVSHMEM teardown capability (Phase 2 prereq, 1-2 week prototype with concrete scope) and Ray-path WideEP perf characterization (future-migration prereq, gated on Ray-path CI coverage existing at EP≥32 first). §9.2 has 14 technical risks with Severity × Probability × Residual per row (residual column is the newly added column per earlier reviewer feedback). §9.3 has 8 open questions including the Q8 framework for when to revisit the Ray pivot (three conditions all must hold). §9.4 summary matrix with all risks in one place; bolded rows are the active-tracking ones during MVP. Signed-off-by: Chien-Chun Hung <2679986+chienchunhung@users.noreply.github.com>

…move v1 files Final batch of the v2 rewrite. §0 executive summary — problem statement, approach, key MPI-vs-Ray decision, two failure modes named, the four TRT-LLM uniqueness properties, headline timeline numbers, and an explicit "what v2 changes vs v1" list for readers coming from the prior version. README — complete rewrite as navigation for v2. Section table with one-line summaries per section. In-flight PR table (NVIDIA#13302 and NVIDIA#13404 against this design). Consolidated terminology table including the rank/process/slot distinction that a reviewer flagged as missing. Scope & non-goals stated once, not repeated throughout. Removing v1 files: - 01-background.md → content folded into 01-user-journey-and-stack.md (§1.3) and 00-executive-summary.md - 02-current-state.md → folded into 01-user-journey-and-stack.md (§1.2) and 03-failure-modes-and-gaps.md - 03-competitive-landscape.md → folded into 02-stack-comparison-and-positioning.md - 04-two-phase-recovery.md → superseded by 04-architecture-overview.md (now three-phase) - 05-rank-masking.md → folded into 05-phase-1-immediate-survival.md §5.1 - 06-eplb-adaptation.md → folded into 05-phase-1-immediate-survival.md §5.2 - 07-failure-detection.md → folded into 05-phase-1-immediate-survival.md §5.3 and §5.4 - 08-mx-gms-integration.md → split between 05-phase-1 (PR NVIDIA#12718 integration in §5.3) and 06-phase-2 (MX-GMS in §6.3) - 09-implementation-plan.md → superseded by 08-implementation-plan.md - 10-risks.md → superseded by 09-risks-and-open-questions.md - COMBINED.md → superseded; single-file view can be regenerated from the split files if needed New v2 file set: README + §0-§9 (11 files) + 3 workflow artifacts (redesign-outline, redesign-research-pass, redesign-research-pass-report). Section count held at 10 numbered sections (§0-§9) but with cleaner phase boundaries — one section per phase. Signed-off-by: Chien-Chun Hung <2679986+chienchunhung@users.noreply.github.com>

…iggers Adds a "Status: paused (2026-05-19)" section to mvp-prototype-findings.md explaining that the prototype's primary mandate is empirically discharged (F1-F5 + OQ2 + OQ4 closed) and the remaining four pending items all hit diminishing returns vs. the production PRs they unblock. Documents five concrete resumption triggers with the action required for each: * PR NVIDIA#13404 (1a.2 kernel mask) lands -> seam-stressing + OQ1 * PR 1a.4 (production AlltoAllWatchdog) lands -> reproduce F3/F4/F5 under real MNNVL fabric memory * PR 1c.3 (MPI FT subcomm) lands -> swap stubs/mpi_ft_subcomm.py and tighten F2 mitigation against world_is_poisoned() * NVL72 access -> false-positive floor + scale validation (Audit 1b) * PR 1d.4 starts -> hand off driver + timeline JSONs as regression baseline Also documents the mechanical steps to resume so future-anyone (including future-me) doesn't have to re-derive how the worktree, branch, and cherry-pick chain hang together. Signed-off-by: Chien-Chun Hung <2679986+chienchunhung@users.noreply.github.com>

chienchunhung · 2026-06-15T23:28:27Z

/bot run

tensorrt-cicd · 2026-06-15T23:36:02Z

PR_Github #54374 [ run ] triggered by Bot. Commit: 1aaf884 Link to invocation

tensorrt-cicd · 2026-06-15T23:55:59Z

PR_Github #54374 [ run ] completed with state FAILURE. Commit: 1aaf884
/LLM/main/L0_MergeRequest_PR pipeline #43444 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

chienchunhung · 2026-06-16T00:07:43Z

/bot run

tensorrt-cicd · 2026-06-16T00:14:04Z

PR_Github #54382 [ run ] triggered by Bot. Commit: 72eb5dd Link to invocation

tensorrt-cicd · 2026-06-16T04:03:39Z

PR_Github #54382 [ run ] completed with state FAILURE. Commit: 72eb5dd
/LLM/main/L0_MergeRequest_PR pipeline #43451 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

chienchunhung · 2026-06-16T05:44:29Z

/bot run --disable-fail-fast

tensorrt-cicd · 2026-06-16T05:51:00Z

PR_Github #54485 [ run ] triggered by Bot. Commit: 98d59fd Link to invocation

tensorrt-cicd · 2026-06-16T12:54:49Z

PR_Github #54485 [ run ] completed with state SUCCESS. Commit: 98d59fd
/LLM/main/L0_MergeRequest_PR pipeline #43548 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

chienchunhung · 2026-06-16T16:49:37Z

/bot run --disable-fail-fast --stage-list "DGX_B200-PyTorch-2,DGX_B200-4_GPUs-PyTorch-Ray-1"

chienchunhung · 2026-06-16T16:53:47Z

/bot run --disable-fail-fast

tensorrt-cicd · 2026-06-16T16:57:14Z

PR_Github #54640 [ run ] triggered by Bot. Commit: 98d59fd Link to invocation

tensorrt-cicd · 2026-06-16T16:59:29Z

PR_Github #54641 [ run ] triggered by Bot. Commit: 98d59fd Link to invocation

tensorrt-cicd · 2026-06-16T16:59:35Z

PR_Github #54640 [ run ] completed with state ABORTED. Commit: 98d59fd

Link to invocation

tensorrt-cicd · 2026-06-16T18:42:10Z

PR_Github #54641 [ run ] completed with state SUCCESS. Commit: 98d59fd
/LLM/main/L0_MergeRequest_PR pipeline #43672 completed with status: 'SUCCESS'

CI Report

Link to invocation

coderabbitai · 2026-06-16T20:53:45Z

📝 Walkthrough

Walkthrough

The PR introduces an active_rank_mask bitmask mechanism to MoE all-to-all dispatch and combine CUDA kernels, enabling inactive ("dead") EP ranks to be skipped during routing, counter writes, flag signaling, and peer spin-wait loops. kMaxRanks is increased from 64 to 128. The mask is threaded through the PyTorch op boundary via new optional parameters in C++ ops and Python fake implementations, with a new test module validating correctness under masked and unmasked conditions across multiple GPUs.

Changes

MoE A2A Active-Rank Mask

Layer / File(s)	Summary
Header constants and struct fields `cpp/tensorrt_llm/kernels/communicationKernels/moeAlltoAllKernels.h`	Increases `kMaxRanks` to 128, adds `kRankMaskWords` with a `static_assert`, and adds `active_rank_mask[kRankMaskWords]` to `DispatchKernelPointers`, `CombineKernelPointers`, `MoeA2ADispatchParams`, and `MoeA2ACombineParams`.
Dispatch kernel rank-mask enforcement `cpp/tensorrt_llm/kernels/communicationKernels/moeAlltoAllKernels.cu` (lines 213–587, 635–681)	Adds `is_rank_active` device helper; marks inactive target-rank destinations dead in routing; guards `recv_counters`/EPLB writes, completion-flag release-stores, and peer wait loops to active ranks only; adds host-side ep_rank validation and mask copy into `DispatchKernelPointers`.
Combine kernel rank-mask enforcement `cpp/tensorrt_llm/kernels/communicationKernels/moeAlltoAllKernels.cu` (lines 1193–1363)	Restricts combine-kernel completion-flag release-stores and readiness wait/spin loops to active peers; adds host-side ep_rank validation and mask copy into `CombineKernelPointers`.
PyTorch op wiring and Python-side constants `cpp/tensorrt_llm/thop/moeAlltoAllOp.cpp`, `tensorrt_llm/_torch/custom_ops/cpp_custom_ops.py`, `tensorrt_llm/_torch/modules/fused_moe/communication/nvlink_one_sided.py`	Adds `resolveActiveRankMask` C++ helper; updates dispatch/combine op signatures and PyTorch schemas with `Tensor? active_rank_mask=None`; updates fake op signatures; increases `NVLinkOneSided.MAX_RANKS` to 128.
Multi-GPU rank-mask tests `tests/unittest/_torch/multi_gpu/test_moe_a2a_rank_mask.py`	New test module with mask construction, payload generation, routing-table readback helpers, `_run_dispatch_combine` harness, and two parameterized pytest tests (all-active-mask-matches-no-mask and one-rank-masked-completes) requiring MNNVL hardware.

Sequence Diagram(s)

sequenceDiagram
    participant Caller as Python/C++ caller
    participant Host as moeA2ADispatchOp / moeA2ACombineOp
    participant Resolver as resolveActiveRankMask
    participant DispatchKernel as moeA2ADispatchKernel (GPU)
    participant CombineKernel as moeA2ACombineKernel (GPU)

    Caller->>Host: dispatch(inputs, active_rank_mask=T)
    Host->>Resolver: validate dtype/shape/ep_rank bit
    Resolver-->>Host: params.active_rank_mask[]
    Host->>DispatchKernel: launch(DispatchKernelPointers{active_rank_mask})
    DispatchKernel->>DispatchKernel: is_rank_active(target) → skip dead-rank routing
    DispatchKernel->>DispatchKernel: skip recv_counters/EPLB for dead ranks
    DispatchKernel->>DispatchKernel: release-store completion flags → active peers only
    DispatchKernel->>DispatchKernel: wait/spin → active peers only
    DispatchKernel-->>Host: routing table (dead slots = -1)

    Caller->>Host: combine(inputs, active_rank_mask=T)
    Host->>Resolver: validate dtype/shape/ep_rank bit
    Resolver-->>Host: params.active_rank_mask[]
    Host->>CombineKernel: launch(CombineKernelPointers{active_rank_mask})
    CombineKernel->>CombineKernel: release-store readiness flags → active peers only
    CombineKernel->>CombineKernel: wait/spin → active peers only
    CombineKernel-->>Host: combined output (dead-targeted slots dropped)

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Suggested reviewers

leslie-fang25
hyukn
bobboli
liji-nv
chang-l
xxi-nv

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 59.26% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title clearly and specifically summarizes the main change: adding an active_rank_mask parameter to NVLink AlltoAll kernels as part of the WideEP FT feature set (phase 1a.2).
Description check	✅ Passed	The PR description comprehensively covers the background, implementation details, design choices, test coverage, and includes a completed checklist. All required sections are present and well-documented.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 3

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

cpp/tensorrt_llm/kernels/communicationKernels/moeAlltoAllKernels.cu (1)

427-473: ⚠️ Potential issue | 🔴 Critical | ⚡ Quick win

Undefined behavior when target_rank >= 64.

The already_copied bitmask is a single uint64_t, but kMaxRanks is now 128. When target_rank >= 64 (possible with ep_size > 64, e.g., NVL72 with 72 ranks), the expression 1ULL << target_rank is undefined behavior per the C++ standard (shifting by ≥ bit-width).

This breaks duplicate detection for deployments exceeding 64 EP ranks.

🐛 Proposed fix: use a 2-word bitmask like active_rank_mask

-        uint64_t already_copied = 0;
+        uint64_t already_copied[kRankMaskWords] = {0, 0};
         // ... existing code ...
         for (int k = 0; k < TOP_K; k++)
         {
             // ... existing code ...
-            if ((already_copied & (1ULL << target_rank)) || target_dead)
+            int const word_idx = target_rank >> 6;
+            uint64_t const bit_mask = 1ULL << (target_rank & 63);
+            if ((already_copied[word_idx] & bit_mask) || target_dead)
             {
                 // ... existing skip logic ...
                 continue;
             }
             // ... existing send logic ...
-            already_copied |= 1ULL << target_rank;
+            already_copied[word_idx] |= bit_mask;
         }

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@cpp/tensorrt_llm/kernels/communicationKernels/moeAlltoAllKernels.cu` around
lines 427 - 473, The `already_copied` bitmask variable uses a single `uint64_t`
which can only represent 64 ranks, but `target_rank` can be up to 127 (since
`kMaxRanks` is 128), causing undefined behavior when shifting by values >= 64.
Replace the single `uint64_t already_copied` with a 2-word bitmask structure
(similar to how `active_rank_mask` is implemented) to support up to 128 ranks.
Update all bit operations on `already_copied` - specifically the check
`(already_copied & (1ULL << target_rank))` and the assignment `already_copied |=
1ULL << target_rank` - to use helper functions or logic that operates across the
2-word structure based on whether `target_rank < 64` or `target_rank >= 64`.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@cpp/tensorrt_llm/thop/moeAlltoAllOp.cpp`:
- Around line 51-77: Add a validation check to ensure that `epRank` is within
the bounds of the fixed-width rank mask and parameter arrays before any
rank-indexed array access occurs. The current validation `epRank < epSize` is
insufficient because if `epSize` exceeds the fixed array capacity (likely
`kMaxRanks`), the function `resolveActiveRankMask()` can index past the bounds
of the `out` array when accessing `out[epRank >> 6]`. Insert a host-side bounds
check that validates `epRank < kMaxRanks` (or the appropriate fixed array
capacity constant) before calling functions that access rank-indexed arrays with
rank values derived from `epRank`.

In `@tests/unittest/_torch/multi_gpu/test_moe_a2a_rank_mask.py`:
- Around line 283-287: Replace the broad `except Exception` clause in both
locations with specific exception types that indicate unsupported hardware. In
the try block where MnnvlMemory.initialize() is called and
MnnvlMemory.supports_mnnvl() is checked (at lines 283-287 and also at lines
332-336), catch only the specific exceptions that indicate the system does not
support MNNVL hardware, rather than all exceptions. This allows unexpected
regressions in these method calls to propagate as test failures instead of being
silently skipped.
- Around line 299-304: Add `strict=True` parameter to the `zip()` call in the
mpi_pool_executor.map() function to enforce that the iterables passed to zip are
of equal length, making the invariant explicit as required by Ruff linting.
Apply the same fix to the other zip() call at lines 358-362.

---

Outside diff comments:
In `@cpp/tensorrt_llm/kernels/communicationKernels/moeAlltoAllKernels.cu`:
- Around line 427-473: The `already_copied` bitmask variable uses a single
`uint64_t` which can only represent 64 ranks, but `target_rank` can be up to 127
(since `kMaxRanks` is 128), causing undefined behavior when shifting by values
>= 64. Replace the single `uint64_t already_copied` with a 2-word bitmask
structure (similar to how `active_rank_mask` is implemented) to support up to
128 ranks. Update all bit operations on `already_copied` - specifically the
check `(already_copied & (1ULL << target_rank))` and the assignment
`already_copied |= 1ULL << target_rank` - to use helper functions or logic that
operates across the 2-word structure based on whether `target_rank < 64` or
`target_rank >= 64`.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: bed8d775-11ff-4a97-9e7d-1177c057b071

📥 Commits

Reviewing files that changed from the base of the PR and between 09449d4 and 98d59fd.

📒 Files selected for processing (6)

cpp/tensorrt_llm/kernels/communicationKernels/moeAlltoAllKernels.cu
cpp/tensorrt_llm/kernels/communicationKernels/moeAlltoAllKernels.h
cpp/tensorrt_llm/thop/moeAlltoAllOp.cpp
tensorrt_llm/_torch/custom_ops/cpp_custom_ops.py
tensorrt_llm/_torch/modules/fused_moe/communication/nvlink_one_sided.py
tests/unittest/_torch/multi_gpu/test_moe_a2a_rank_mask.py

chienchunhung · 2026-06-16T21:54:51Z

/bot run --disable-fail-fast

github-actions · 2026-06-16T21:59:45Z

👎 Promotion blocked, new vulnerability found

Vulnerability report

Component	Vulnerability	Description	Severity
pytorch	CVE-2025-3000	A vulnerability classified as critical has been found in PyTorch 2.6.0. This affects the function torch.jit.script. The manipulation leads to memory corruption. It is possible to launch the attack on the local host. The exploit has been disclosed to the public and may be used.	MEDIUM

chienchunhung · 2026-06-16T22:01:02Z

/bot run --disable-fail-fast

tensorrt-cicd · 2026-06-16T22:08:50Z

PR_Github #54680 [ run ] triggered by Bot. Commit: 685ea79 Link to invocation

tensorrt-cicd · 2026-06-17T06:47:25Z

PR_Github #54680 [ run ] completed with state FAILURE. Commit: 685ea79
/LLM/main/L0_MergeRequest_PR pipeline #43711 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

chienchunhung · 2026-06-18T00:38:25Z

/bot run

tensorrt-cicd · 2026-06-18T00:43:49Z

PR_Github #54889 [ run ] triggered by Bot. Commit: eb7cfb3 Link to invocation

tensorrt-cicd · 2026-06-18T03:09:27Z

PR_Github #54889 [ run ] completed with state FAILURE. Commit: eb7cfb3
/LLM/main/L0_MergeRequest_PR pipeline #43894 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

Eliminates the infinite-spin AlltoAll hang that turns a single GPU failure in a Wide-EP group into a 5-minute HangDetector fire + full restart. The dispatch and combine kernels now take a uint64[2] bitmask of currently-alive EP ranks; dead ranks are skipped on every completion-flag write/wait, peer recv_counter store, EPLB stats write, and per-token routing decision (dead-targeted slots collapse to the same -1 sentinel combine already uses for duplicates). The mask is optional on both torch ops; omitting it (or passing all-ones) produces bit-identical output to the pre-change kernel. kMaxRanks is bumped 64 -> 128 to cover NVL72 with headroom; kRankMaskWords = 2 names the kernel ABI explicitly. Tests cover (a) all-ones mask matches no-mask bit-for-bit, and (b) one rank masked dead -> surviving ranks complete dispatch+combine without hang, dead-targeted topk slots dropped, in tests/unittest/_torch/multi_gpu/test_moe_a2a_rank_mask.py. Signed-off-by: Chien-Chun Hung <2679986+chienchunhung@users.noreply.github.com>

Signed-off-by: Chien-Chun Hung <2679986+chienchunhung@users.noreply.github.com>

chienchunhung · 2026-06-21T02:54:38Z

/bot run

tensorrt-cicd · 2026-06-21T03:00:51Z

PR_Github #54994 [ run ] triggered by Bot. Commit: 985d64b Link to invocation

tensorrt-cicd · 2026-06-21T03:47:20Z

PR_Github #54994 [ run ] completed with state SUCCESS. Commit: 985d64b
/LLM/main/L0_MergeRequest_PR pipeline #43986 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

github-actions Bot assigned chienchunhung Apr 24, 2026

chienchunhung mentioned this pull request May 19, 2026

(DO NOT SUBMIT) WideEP FT MVP prorotype #14198

Draft

1 task

chienchunhung force-pushed the WideEP-FT/1a.2-nvlink-kernel-mask branch from f298ca6 to 1aaf884 Compare June 15, 2026 23:28

chienchunhung force-pushed the WideEP-FT/1a.2-nvlink-kernel-mask branch from 1aaf884 to 72eb5dd Compare June 16, 2026 00:07

chienchunhung force-pushed the WideEP-FT/1a.2-nvlink-kernel-mask branch from 72eb5dd to 98d59fd Compare June 16, 2026 05:44

chienchunhung requested review from dongxuy04, pcastonguay and xxi-nv June 16, 2026 20:37

chienchunhung marked this pull request as ready for review June 16, 2026 20:37

chienchunhung requested review from a team as code owners June 16, 2026 20:37

chienchunhung requested a review from yizhang-nv June 16, 2026 20:37

coderabbitai Bot reviewed Jun 16, 2026

View reviewed changes

Comment thread cpp/tensorrt_llm/thop/moeAlltoAllOp.cpp

Comment thread tests/unittest/_torch/multi_gpu/test_moe_a2a_rank_mask.py Outdated

Comment thread tests/unittest/_torch/multi_gpu/test_moe_a2a_rank_mask.py

chienchunhung requested review from dongxuy04 and xxi-nv and removed request for dongxuy04 and xxi-nv June 16, 2026 21:39

chienchunhung force-pushed the WideEP-FT/1a.2-nvlink-kernel-mask branch from 5f7693b to 685ea79 Compare June 16, 2026 22:00

chienchunhung force-pushed the WideEP-FT/1a.2-nvlink-kernel-mask branch from 685ea79 to eb7cfb3 Compare June 18, 2026 00:38

chienchunhung force-pushed the WideEP-FT/1a.2-nvlink-kernel-mask branch from eb7cfb3 to c21304f Compare June 21, 2026 02:53

chienchunhung added 2 commits June 20, 2026 22:53

Address active rank mask review comments

985d64b

Signed-off-by: Chien-Chun Hung <2679986+chienchunhung@users.noreply.github.com>

chienchunhung force-pushed the WideEP-FT/1a.2-nvlink-kernel-mask branch from c21304f to 985d64b Compare June 21, 2026 02:54

Conversation

chienchunhung commented Apr 24, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Release Notes

Description

Background

What this PR does

Test Coverage

PR Checklist

GitHub Bot Help

Uh oh!

chienchunhung commented Apr 24, 2026

Uh oh!

tensorrt-cicd commented Apr 24, 2026

Uh oh!

tensorrt-cicd commented Apr 24, 2026

Uh oh!

chienchunhung commented Jun 15, 2026

Uh oh!

tensorrt-cicd commented Jun 15, 2026

Uh oh!

tensorrt-cicd commented Jun 15, 2026

Uh oh!

chienchunhung commented Jun 16, 2026

Uh oh!

tensorrt-cicd commented Jun 16, 2026

Uh oh!

tensorrt-cicd commented Jun 16, 2026

Uh oh!

chienchunhung commented Jun 16, 2026

Uh oh!

tensorrt-cicd commented Jun 16, 2026

Uh oh!

tensorrt-cicd commented Jun 16, 2026

Uh oh!

chienchunhung commented Jun 16, 2026

Uh oh!

chienchunhung commented Jun 16, 2026

Uh oh!

tensorrt-cicd commented Jun 16, 2026

Uh oh!

tensorrt-cicd commented Jun 16, 2026

Uh oh!

tensorrt-cicd commented Jun 16, 2026

Uh oh!

tensorrt-cicd commented Jun 16, 2026

Uh oh!

coderabbitai Bot commented Jun 16, 2026

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Suggested reviewers

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

chienchunhung commented Jun 16, 2026

Uh oh!

github-actions Bot commented Jun 16, 2026

👎 Promotion blocked, new vulnerability found

Vulnerability report

Uh oh!

chienchunhung commented Jun 16, 2026

Uh oh!

tensorrt-cicd commented Jun 16, 2026

Uh oh!

tensorrt-cicd commented Jun 17, 2026

Uh oh!

chienchunhung commented Jun 18, 2026

Uh oh!

tensorrt-cicd commented Jun 18, 2026

Uh oh!

tensorrt-cicd commented Jun 18, 2026

Uh oh!

chienchunhung commented Apr 24, 2026 •

edited by coderabbitai Bot

Loading