Skip to content

add async api for kv-transfer #6565

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

niqi-lyu
Copy link

@niqi-lyu niqi-lyu commented May 24, 2025

Motivation

This PR introduces asynchronous support for the KV manager in the disaggregation module to improve performance and reduce latency in PD disaggregation inference scenarios. The key goals include:

  • Implementing asynchronous transfer mechanisms for KV cache management
  • Adding support for layer callback tracking and batch preparation
  • Enhancing the scheduler to handle async operations
  • Optimizing block alignment calculations for efficient memory transfers

Modifications

  1. StreamAsyncSubmitter:
  • use the ring-buffer and async copy to check the gpu compute is finished. ideas from mscclpp.
  1. BaseKVManager enhancements:
  • Added is_support_async property flag
  • Implemented prepare_batch() for async batch preparation
  • Added mark_layer_ready() for layer completion tracking
  • Introduced insert_layer_callbacks() for model layer hooking
  1. New async components:
  • Created MooncakeAsyncKVManager class in conn_async.py
  • Modified SchedulerDisaggregationPrefillMixin to support async apis.

Checklist

@ShangmingCai
Copy link
Collaborator

Thank you for this PR. Since we have PD + chunked prefill, and the compute and transfer are already fully overlapping with each other, can you provide a benchmark serving result to present how the async kv manager bring performance gain and reduce latency?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants