[RFC] Hybrid model ExtractHiddenStates: CacheOnly as filtered KV cache group by rahul-tuli · Pull Request #160 · neuralmagic/vllm

rahul-tuli · 2026-04-15T17:07:31Z

Summary

Adds hybrid model support (e.g. Qwen3.5) for extract_hidden_states speculative decoding by modeling CacheOnly as its own KV cache group, filtered from page-size uniformity checks and group coordination
CacheOnlySpec(MLAAttentionSpec) is pre-filtered in get_kv_cache_groups() before type-unification, then appended as a separate group with joint memory budget accounting
Gates HMA disable with supports_hma() check so hybrid models keep per-group block allocators when the connector supports it

This is 1 of 3 alternative approaches — see RFC document and sister PRs for comparison:

Approach A (this PR): CacheOnly as filtered KV cache group (9 files, +300/-78)
Approach B: CacheOnly as supplementary tensors — #TBD
Approach C: Bypass KV cache entirely — #TBD

Test plan

Verified on Qwen3.5-9B with TP=4, non-zero hidden states extracted
Verified on standard (non-hybrid) model — no regression
pre-commit run --all-files passes on changed files
Unit tests added in tests/v1/core/test_kv_cache_utils.py

🤖 Generated with Claude Code

- Add CacheOnlySpec(MLAAttentionSpec) to kv_cache_interface.py so it duck-types through all existing AttentionSpec code paths - Pre-filter CacheOnlySpec in get_kv_cache_groups() before type- unification routing to prevent crashes with mixed spec types - Joint budget calculation in get_kv_cache_config_from_groups() via extra_bytes_per_block parameter on get_num_blocks() - Gate HMA disable in config with supports_hma() check so hybrid models keep their per-group block allocators - Add SupportsHMA to ExampleHiddenStatesConnector with correct cache_group_idx for block_ids - Resolve CacheOnly slot_mapping from per-layer mappings in the proposer instead of using main group's common_attn_metadata Signed-off-by: Rahul-Tuli <rtuli@redhat.com>

CacheOnlySpec inherits MLAAttentionSpec -> FullAttentionSpec, so isinstance(cache_only, FullAttentionSpec) returns True. This causes build_block_map_addrs to include CacheOnly in page size uniformity checks, crashing with "Non-uniform page sizes" on hybrid models. Add explicit CacheOnlySpec filter after the FullAttentionSpec gate. Signed-off-by: Rahul Tuli <rtuli@redhat.com>

rahul-tuli added 2 commits April 15, 2026 16:25

This was referenced Apr 15, 2026

[RFC] Hybrid model ExtractHiddenStates: CacheOnly as supplementary tensors #161

Draft

[RFC] Hybrid model ExtractHiddenStates: bypass KV cache #162

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RFC] Hybrid model ExtractHiddenStates: CacheOnly as filtered KV cache group#160

[RFC] Hybrid model ExtractHiddenStates: CacheOnly as filtered KV cache group#160
rahul-tuli wants to merge 2 commits intomainfrom
hma/extract-hidden-states-filtered-group

rahul-tuli commented Apr 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

rahul-tuli commented Apr 15, 2026

Summary

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant