[Fix] Defer req_to_token pool-index free in overlap scheduling to prevent cross-stream data race#18803
[Fix] Defer req_to_token pool-index free in overlap scheduling to prevent cross-stream data race#18803nvcastet wants to merge 1 commit intosgl-project:mainfrom
Conversation
Summary of ChangesHello @nvcastet, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request addresses a critical data race condition occurring in MTPv2 overlap scheduling. The race happens when a Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Changelog
Activity
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request effectively addresses a data race in overlap scheduling by deferring the freeing of req_to_token pool indices. The implementation is sound, introducing deferred_free and flush_deferred_frees methods to ReqToTokenPool and integrating them into the existing logic with the enable_overlap flag. The synchronization point for flushing deferred frees is well-chosen, ensuring correctness. I have one suggestion to refactor a small piece of code to improve maintainability.
|
/tag-and-rerun-ci |
757d14e to
c4a0075
Compare
…vent cross-stream data race In overlap scheduling (MTPv2), `process_batch_result(N-1)` runs on the default stream concurrently with `forward(N)` on the forward stream. When a request finishes, `release_kv_cache` immediately returns its `req_pool_idx` to the free list. A new request can then recycle that pool index and `prepare_for_decode` overwrites the `req_to_token` row on the default stream while `forward(N)` still reads it — causing an "index out of bounds" assertion in IndexKernel.cu. Fix: defer the pool-index free by one overlap iteration. Co-authored-by: Trevor Morris <tmorris@nvidia.com>
c4a0075 to
9a44855
Compare
|
/tag-and-rerun-ci |
|
This fix just hides the bug by changing timing of the race condition. |
Motivation
Fixes #18744
With MTPv2 overlap scheduling enabled (
SGLANG_ENABLE_SPEC_V2=1), we see intermittent device-side assertion failures:Root cause: In overlap scheduling,
process_batch_result(N-1)runs on the default stream concurrently withforward(N)draft extend on the forward stream. When a request finishes,release_kv_cacheimmediately returns itsreq_pool_idxto the free list. A new request can then recycle that pool index andprepare_for_decodeoverwrites thereq_to_tokenrow on the default stream while the draft extend attentionforward(N)still reads it.Modifications
Instead of freeing immediately, append finished requests to a deferred list (
_deferred_kv_release_reqs). The list is flushed at the start of the nextprocess_batch_result_decode, right aftercopy_done.synchronize()-- which guarantees the forward stream has finished reading those rows. An additional flush during idle ensures slots aren't held unnecessarily when no more batches are running.