[Feature] Support CUDA Graph under mixed mode DeepEP communication#7345
[Feature] Support CUDA Graph under mixed mode DeepEP communication#7345lizexu123 wants to merge 2 commits intoPaddlePaddle:developfrom
Conversation
|
root seems not to be a GitHub user. You need a GitHub account to be able to sign the CLA. If you have already a GitHub account, please add the email address used for this commit to your account. You have signed the CLA already but the status is still pending? Let us recheck it. |
|
Thanks for your contribution! |
📋 Review 摘要PR 概述:修复 CUDA Graph 在混合模式 DeepEP 通信下的兼容性问题,通过使用独立的 capture stream 避免与 legacy stream 的依赖冲突。 变更范围: 影响面 Tag: 📝 PR 规范检查PR 描述已填写 Motivation 和 Modifications,标题包含 问题
总体评价PR 的核心修复方案(使用独立 capture stream + 清理 DeepEP buffer)设计合理,与 SGLang 的做法一致。但在代码重构过程中遗漏了方法签名更新和属性清理,存在两个运行时错误。另外,本地测试脚本不应提交到仓库。 详细问题说明🔴 Bug 1:
|
Motivation
Modifications
报错日志:
DeepEP/csrc/kernels/internode_ll.cu:553 operation would make the legacy stream depend on a capturing blocking stream根本原因:
本次修复内容:
本来可以很简单的实现,比如像sglang/python/sglang/srt/distributed/parallel_state.py:483-510中这样
with torch.cuda.stream(stream):
# PyTorch 的 current stream → stream ✓
# c10 的 TLS → stream ✓
# DeepEP 调用 at::cuda::getCurrentCUDAStream() → stream ✓
但是Paddle的paddle.device.stream_guard() 只更新了 Paddle 自己的 GPUContext,没有更新 c10 的 TLS:
所以我们才需要用 ctypes 手动调用 c10::cuda::setCurrentCUDAStream() 来弥补这个差距。
感谢护航实习生PaddlePaddle/Paddle#78652 pr解决了这个问题
Usage or Command
Accuracy Tests
Checklist
[FDConfig],[APIServer],[Engine],[Scheduler],[PD Disaggregation],[Executor],[Graph Optimization],[Speculative Decoding],[RL],[Models],[Quantization],[Loader],[OP],[KVCache],[DataProcessor],[BugFix],[Docs],[CI],[Optimization],[Feature],[Benchmark],[Others],[XPU],[HPU],[GCU],[DCU],[Iluvatar],[Metax]]pre-commitbefore commit.releasebranch, make sure the PR has been submitted to thedevelopbranch, then cherry-pick it to thereleasebranch with the[Cherry-Pick]PR tag.