-
Notifications
You must be signed in to change notification settings - Fork 113
feat: support prefix caching and chunked prefill for deepseek v32 on mlu. #660
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: support prefix caching and chunked prefill for deepseek v32 on mlu. #660
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces support for prefix caching and chunked prefill for the deepseek_v32 model on MLU hardware. The changes are well-implemented and include several key improvements. A new reshape_from_cache kernel is added to support gathering KV cache data for chunked prefill. The IndexerImpl is refactored for better clarity and to accommodate the new chunked prefill logic. The data parallelism handling in LLMEngine is made more robust to correctly manage mixed forward types across different ranks. Additionally, safeguards are added to prevent enabling these new features on unsupported model variants, and comprehensive unit tests are included to validate the new functionality. The code quality is high, and the changes appear correct and well-tested.
d41314f to
38d3e28
Compare
38d3e28 to
633671d
Compare
No description provided.