Skip to content

Fix bug in EP sync for Mamba models#3785

Draft
santhnm2 wants to merge 2 commits intoNVIDIA:mainfrom
santhnm2:ep_sync_fix
Draft

Fix bug in EP sync for Mamba models#3785
santhnm2 wants to merge 2 commits intoNVIDIA:mainfrom
santhnm2:ep_sync_fix

Conversation

@santhnm2
Copy link
Contributor

What does this PR do ?

EP dummy ranks for Mamba hybrid models could have real_prefill_count == 0 but padded_prefill_count > 0 after the EP all-reduce MAX synchronization in match_graph_config. This caused cu_seqlens in MambaMetadata.update to be all-zeros (since there were no real prefill tokens to compute cumulative lengths from), triggering an illegal memory access in the Mamba SSM prefill kernel. This PR defers the metadata construction in dummy ranks until after the EP synchronization to avoid this issue.

Contribution process

Pre-checks

  • I have added relevant unit tests
  • I have added relevant functional tests
  • I have added proper typing to my code Typing guidelines
  • I have added relevant documentation
  • I have run the autoformatter.sh on my PR

Code review

Feel free to message or comment the @mcore-oncall to help accelerate your merge into main. The less complex your PR is, the faster it will be approved and merged!

All PRs start as draft. If you open a non-draft PR, it will be automatically converted to draft.

Step 1: Mark PR as "Ready for Review"

  1. When your PR is ready, click Ready for Review.
  2. An oncall reviewer is auto-assigned and expert reviewers are notified based on your changes.
    • Some PRs may jump straight to step 2. This is determined by .github/CODEOWNERS.

⚠️ Only mark as ready once merge-conflicts are resolved and the CI is passing.
Final Review might get declined if these requirements are not fulfilled.

Step 2: Final Review

For PRs that change megatron/core, once all expert reviewers have approved, the Final Review label is applied automatically and final reviewers are assigned.

For PRs outside megatron/core, this step is skipped.

Step 3: Approved

Once all required reviewers have approved, the Approved label is applied automatically.

Merge

Any member of mcore-engineers will be able to merge your PR.

For MRs into `dev` branch The proposed review process for `dev` branch is under active discussion.

MRs are mergable after one approval by either eharper@nvidia.com or zijiey@nvidia.com.

Signed-off-by: Keshav Santhanam <ksanthanam@nvidia.com>
@copy-pr-bot
Copy link

copy-pr-bot bot commented Mar 10, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

if remainder > 0:
self.request_query_lengths[N_total - 1] += remainder

self.request_kv_length_offsets[0:N_total].fill_(0)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we use = here? We have seen weird issues with fill_ at times, wherein it creates a duplicate tensor.

prefill_tokens = T - N_decode
tokens_per_prefill = prefill_tokens // N_prefill
remainder = prefill_tokens % N_prefill
self.request_query_lengths[N_decode:N_total].fill_(tokens_per_prefill)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here, would advise using = instead of fill_

token_offset = N_decode
for i in range(N_prefill):
qlen = tokens_per_prefill + (remainder if i == N_prefill - 1 else 0)
self.token_to_request_idx[token_offset : token_offset + qlen] = N_decode + i
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we compute all of the values to write on the CPU first, and then do a single write to GPU?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants