[RFC] Hybrid model ExtractHiddenStates: CacheOnly as supplementary tensors#161
Draft
rahul-tuli wants to merge 1 commit intomainfrom
Draft
[RFC] Hybrid model ExtractHiddenStates: CacheOnly as supplementary tensors#161rahul-tuli wants to merge 1 commit intomainfrom
rahul-tuli wants to merge 1 commit intomainfrom
Conversation
…nsors CacheOnly layers are modeled as supplementary tensors that share group 0's block table rather than as separate KV cache groups. This avoids polluting group coordination logic while properly managing memory. Key changes: - Add CacheOnlySpec(MLAAttentionSpec) to kv_cache_interface.py - Add supplementary_specs field to KVCacheConfig for non-group tensors - split_supplementary_specs() separates CacheOnly before group routing - Memory accounting includes supplementary bytes in budget calculation - attn_utils: _reshape_one_layer() shared helper, supplementary init/reshape - gpu_model_runner: supplementary alloc/reshape/slot_mapping support - Connector uses GPU-authoritative slot_mapping from attn_metadata - Gate HMA disable with supports_hma() check for hybrid models Signed-off-by: Rahul Tuli <rtuli@redhat.com>
3 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
extract_hidden_statesspeculative decoding by modeling CacheOnly as supplementary tensors that share group 0's block tablesupplementary_specsfield onKVCacheConfig— CacheOnly layers get their own KV cache tensors but are invisible to the KV cache coordinator_reshape_one_layer()helper inattn_utils.pyeliminates reshape duplicationattn_metadata.slot_mappinginstead of scheduler-computed slot mappingssupports_hma()check so hybrid models keep per-group block allocatorsThis is 1 of 3 alternative approaches — see RFC document and sister PRs for comparison:
Test plan
pre-commit run --all-filespasses on changed filestests/v1/core/test_kv_cache_utils.py🤖 Generated with Claude Code