[RFC] Hybrid model ExtractHiddenStates: CacheOnly as supplementary tensors by rahul-tuli · Pull Request #161 · neuralmagic/vllm

rahul-tuli · 2026-04-15T17:07:42Z

Summary

Adds hybrid model support (e.g. Qwen3.5) for extract_hidden_states speculative decoding by modeling CacheOnly as supplementary tensors that share group 0's block table
Introduces supplementary_specs field on KVCacheConfig — CacheOnly layers get their own KV cache tensors but are invisible to the KV cache coordinator
Shared _reshape_one_layer() helper in attn_utils.py eliminates reshape duplication
Uses GPU-authoritative attn_metadata.slot_mapping instead of scheduler-computed slot mappings
Gates HMA disable with supports_hma() check so hybrid models keep per-group block allocators

This is 1 of 3 alternative approaches — see RFC document and sister PRs for comparison:

Approach A: CacheOnly as filtered KV cache group — [RFC] Hybrid model ExtractHiddenStates: CacheOnly as filtered KV cache group #160
Approach B (this PR): CacheOnly as supplementary tensors (8 files, +433/-87)
Approach C: Bypass KV cache entirely — #TBD

Test plan

Verified on Qwen3.5-9B with TP=4, non-zero hidden states extracted
Verified on standard (non-hybrid) model — no regression
pre-commit run --all-files passes on changed files
Unit tests added in tests/v1/core/test_kv_cache_utils.py

🤖 Generated with Claude Code

…nsors CacheOnly layers are modeled as supplementary tensors that share group 0's block table rather than as separate KV cache groups. This avoids polluting group coordination logic while properly managing memory. Key changes: - Add CacheOnlySpec(MLAAttentionSpec) to kv_cache_interface.py - Add supplementary_specs field to KVCacheConfig for non-group tensors - split_supplementary_specs() separates CacheOnly before group routing - Memory accounting includes supplementary bytes in budget calculation - attn_utils: _reshape_one_layer() shared helper, supplementary init/reshape - gpu_model_runner: supplementary alloc/reshape/slot_mapping support - Connector uses GPU-authoritative slot_mapping from attn_metadata - Gate HMA disable with supports_hma() check for hybrid models Signed-off-by: Rahul Tuli <rtuli@redhat.com>

rahul-tuli mentioned this pull request Apr 15, 2026

[RFC] Hybrid model ExtractHiddenStates: bypass KV cache #162

Draft

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[RFC] Hybrid model ExtractHiddenStates: CacheOnly as supplementary tensors#161

[RFC] Hybrid model ExtractHiddenStates: CacheOnly as supplementary tensors#161
rahul-tuli wants to merge 1 commit into
mainfrom
hma/extract-hidden-states-supplementary

rahul-tuli commented Apr 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

rahul-tuli commented Apr 15, 2026

Summary

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant