[Bugfix] fix quickreduce acc error in cudagraph mode#29508
[Bugfix] fix quickreduce acc error in cudagraph mode#29508haoyangli0109 wants to merge 1 commit into
Conversation
Signed-off-by: Haoyang Li <lihaoyang0109@gmail.com>
There was a problem hiding this comment.
Code Review
This pull request moves the flag color counter logic from the host to the device using a new d_flag_counters buffer, enabling CUDA-graph replays to correctly advance the flag color inside the kernel. The reviewer suggests extending the leak-prevention logic in destroy() to also free dbuffer and dbuffer_list independently of the initialized flag, as they are also prone to leaking during partial initialization failures.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
1. cause:
Once
flag_coloris fixed bygraph, it remains unchanged for each round→ The written flag value repeats in each round and cannot be distinguished from the residual value of the previous round → The waiting party is prematurely satisfied by the old value and is immediately granted access
→ At this point, since the data for the current round has not yet been fully transmitted, the system reads the old data next, resulting in an error
2. repro code
3.solution
Refer to
customallreduceand use a pointer to maintain theflag_colorfor each block, passing it as a pointer to the device side for execution.4.This change will not affect performance.
CI States
Latest PR Test (Base): ❌ Run #28286525109
Latest PR Test (Extra): ❌ Run #28286525046