[RFC] Hybrid model ExtractHiddenStates: bypass KV cache#162
Draft
rahul-tuli wants to merge 2 commits intomainfrom
Draft
[RFC] Hybrid model ExtractHiddenStates: bypass KV cache#162rahul-tuli wants to merge 2 commits intomainfrom
rahul-tuli wants to merge 2 commits intomainfrom
Conversation
Instead of routing hidden states through a CacheOnly attention layer and the KV cache pipeline, the proposer stacks target-model intermediate hidden states and passes them directly to the connector. The connector accumulates per-request tensors on CPU and saves them when the request finishes. This avoids adding any KV cache groups or modifying core KV cache code, making it naturally compatible with hybrid models. Key changes: - Proposer: remove draft model loading, CUDAGraphs, attention metadata, and slot_mapping buffers. propose() calls connector.save_hidden_states() - Connector: add save_hidden_states() for direct hidden state accumulation, save to disk in request_finished(). Implement SupportsHMA. - Config: gate HMA disable with supports_hma() check so hybrid models keep their per-group block allocators when connector supports it Signed-off-by: Rahul Tuli <rtuli@redhat.com>
request_finished() runs on the scheduler process which has a separate connector instance with no accumulated data. Move the disk write into save_hidden_states() which runs on the worker, so the file is ready when the scheduler reports the path. Signed-off-by: Rahul Tuli <rtuli@redhat.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
extract_hidden_statesspeculative decoding by bypassing the KV cache entirelysave_hidden_states()— no draft model loaded, no CacheOnly attention layer, no CUDAGraphsCacheOnlySpec, no groups, no coordinator changes. Only 3 files changed, net -343 linessupports_hma()check so hybrid models keep per-group block allocatorsThis is 1 of 3 alternative approaches — see RFC document and sister PRs for comparison:
Test plan
pre-commit run --all-filespasses on changed files🤖 Generated with Claude Code