[BugFix] Implement scatter-done event handling for RDMA writes by matthewygf · Pull Request #189 · LMCache/LMCache-Ascend

matthewygf · 2026-03-11T22:36:14Z

This pull request adds synchronization mechanisms to prevent race conditions when using RDMA writes and scatter operations on the same memory pool in lmcache_ascend/v1/npu_connector.py. The changes ensure that RDMA operations do not overwrite buffers that are still being read by a scatter operation, improving data integrity and correctness.

Synchronization improvements for RDMA and scatter operations:

Introduced per-pool scatter-done events (pool_scatter_events) and flags (pool_scatter_recorded) to ensure that an RDMA write does not race with an ongoing scatter read from the same pool. The code now waits for the previous scatter to complete before allowing RDMA writes to proceed.
Implemented recording of scatter-done events on the load_stream after each scatter operation, and updated the logic to track which pool's event was recorded, ensuring proper synchronization for subsequent RDMA writes.

Added handling for scatter-done events to prevent RDMA writes from racing with ongoing reads in the same pool. Introduced event recording for synchronization in multi-batch processing. Signed-off-by: Matthew Yeung <yyygggfff@hotmail.com>

gemini-code-assist · 2026-03-11T22:36:34Z

Summary of Changes

Hello, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request enhances the npu_connector by integrating critical synchronization logic to manage concurrent RDMA write and scatter read operations. The changes are designed to eliminate potential race conditions, thereby safeguarding data consistency and improving the reliability of memory access patterns in high-performance computing environments.

Highlights

Synchronization for RDMA and Scatter Operations: Introduced robust synchronization mechanisms to prevent race conditions between RDMA writes and scatter operations that access the same memory pool. This ensures data integrity by preventing RDMA from overwriting buffers still being read by a scatter operation.
Per-Pool Scatter-Done Events: Implemented pool_scatter_events and pool_scatter_recorded flags to track the completion of scatter operations for each memory pool. RDMA writes now wait for the corresponding scatter event to complete before proceeding.
Event Recording on Load Stream: Updated the logic to record scatter-done events on the load_stream after each scatter operation. This ensures proper synchronization for subsequent RDMA writes by marking which pool's event has been recorded.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog

lmcache_ascend/v1/npu_connector.py
- Initialized pool_scatter_events (torch.npu.Event objects) and pool_scatter_recorded (boolean flags) to manage per-pool scatter synchronization.
- Added a wait mechanism using transport_stream.wait_event to ensure that an RDMA write for a specific pool does not proceed until the previous scatter operation on that pool has completed.
- Recorded scatter-done events on the self.load_stream for the appropriate pool after each scatter operation.
- Updated the pool_scatter_recorded flag to True for the pool whose scatter-done event was just recorded.
- Removed a TODO comment related to investigating scatter-done event recording on the load stream.

Activity

No human activity has been recorded on this pull request yet.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

Copilot

Pull request overview

This PR implements scatter-done event handling for RDMA writes in the NPU connector's ping-pong pipeline, resolving a TODO about potential race conditions between RDMA writes and scatter reads on the same memory pool.

Changes:

Added per-pool scatter-done events and tracking flags to synchronize RDMA writes with ongoing scatter operations
Replaced a TODO comment with actual event recording and synchronization logic

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

You can also share your feedback on Copilot code review. Take the survey.

gemini-code-assist

Code Review

This pull request introduces synchronization for RDMA writes and scatter operations to prevent race conditions, which is a solid improvement for data integrity. The use of per-pool events and flags is well-implemented. I have one suggestion to improve maintainability by avoiding a hardcoded value.

lmcache_ascend/v1/npu_connector.py

Signed-off-by: Matthew Yeung <yyygggfff@hotmail.com>

Ensure compute stream waits for load_stream's KV scatter completion before attention reads. Signed-off-by: Matthew Yeung <yyygggfff@hotmail.com>

matthewygf marked this pull request as ready for review March 11, 2026 22:36

matthewygf requested a review from chloroethylene as a code owner March 11, 2026 22:36

Copilot AI review requested due to automatic review settings March 11, 2026 22:36

Copilot started reviewing on behalf of matthewygf March 11, 2026 22:36 View session

Copilot AI reviewed Mar 11, 2026

View reviewed changes

gemini-code-assist bot reviewed Mar 11, 2026

View reviewed changes

lmcache_ascend/v1/npu_connector.py Show resolved Hide resolved

matthewygf added 2 commits March 11, 2026 22:40

Fix Lint

31d3afe

Signed-off-by: Matthew Yeung <yyygggfff@hotmail.com>

Add synchronization between compute and load streams

7682666

Ensure compute stream waits for load_stream's KV scatter completion before attention reads. Signed-off-by: Matthew Yeung <yyygggfff@hotmail.com>

matthewygf changed the title ~~Implement scatter-done event handling for RDMA writes~~ [BugFix] Implement scatter-done event handling for RDMA writes Mar 12, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BugFix] Implement scatter-done event handling for RDMA writes#189

[BugFix] Implement scatter-done event handling for RDMA writes#189
matthewygf wants to merge 3 commits intoLMCache:mainfrom
matthewygf:main

matthewygf commented Mar 11, 2026

Uh oh!

gemini-code-assist bot commented Mar 11, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

matthewygf commented Mar 11, 2026

Uh oh!

gemini-code-assist bot commented Mar 11, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants