[https://nvbugs/6114821][fix] Remove torch.compile from spec dec sampling to prevent NCCL deadlock#13552
[https://nvbugs/6114821][fix] Remove torch.compile from spec dec sampling to prevent NCCL deadlock#13552tensorrt-cicd wants to merge 2 commits intoNVIDIA:mainfrom
Conversation
…prevent NCCL deadlock With non-greedy sampling (temperature > 0) in one-model speculative decoding with TP > 1, torch.compile on the sampling function causes different compiled code on different ranks (each rank compiles in a separate process). This produces different sampling results across ranks, which diverges the acceptance counts. Since acceptance counts determine the batch shape of subsequent draft-model forward passes containing NCCL collectives, divergent tokens cause an NCCL deadlock. Fix: remove torch.compile from sampling_batch_spec_dec_one_model so all ranks execute identical eager-mode code. Also remove the waiver for the affected test. Signed-off-by: tensorrt-cicd <90828364+tensorrt-cicd@users.noreply.github.com>
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Enterprise Run ID: 📒 Files selected for processing (2)
💤 Files with no reviewable changes (1)
📝 WalkthroughWalkthroughThe Changes
Estimated code review effort🎯 2 (Simple) | ⏱️ ~10 minutes 🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
⚔️ Resolve merge conflicts
Comment |
ziyixiong-nv
left a comment
There was a problem hiding this comment.
The fix and explanation look reasonable to me. + @mikeiovine in case you have any concerns.
|
/bot run |
Signed-off-by: Ziyi Xiong <219238287+ziyixiong-nv@users.noreply.github.com>
|
/bot run |
|
PR_Github #45912 [ run ] triggered by Bot. Commit: |
Summary
torch.compileindependently on the speculative decoding sampling function, which can select different Triton kernel implementations across ranks. This non-determinism causes divergent sampling outputs, leading to mismatched draft token acceptance counts and therefore mismatched batch shapes for subsequent NCCL collectives, resulting in a deadlock.@torch.compile(options={"max-autotune": True})decorator fromsampling_batch_spec_dec_one_modelinone_model_sampler.pyand added a comment explaining why compilation must be avoided in this code path. The corresponding test waiver for the Eagle3 4-GPU accuracy test was also removed since the fix resolves the underlying failure.Test plan
Links
Summary by CodeRabbit
Bug Fixes
Tests