[RFC] Hybrid model ExtractHiddenStates: bypass KV cache by rahul-tuli · Pull Request #162 · neuralmagic/vllm

rahul-tuli · 2026-04-15T17:07:56Z

Summary

Adds hybrid model support (e.g. Qwen3.5) for extract_hidden_states speculative decoding by bypassing the KV cache entirely
Proposer stacks target-model hidden states and passes them directly to the connector via save_hidden_states() — no draft model loaded, no CacheOnly attention layer, no CUDAGraphs
Connector accumulates per-request hidden states on CPU and writes to disk each step
Zero interaction with KV cache — no CacheOnlySpec, no groups, no coordinator changes. Only 3 files changed, net -343 lines
Gates HMA disable with supports_hma() check so hybrid models keep per-group block allocators

This is 1 of 3 alternative approaches — see RFC document and sister PRs for comparison:

Approach A: CacheOnly as filtered KV cache group — [RFC] Hybrid model ExtractHiddenStates: CacheOnly as filtered KV cache group #160
Approach B: CacheOnly as supplementary tensors — [RFC] Hybrid model ExtractHiddenStates: CacheOnly as supplementary tensors #161
Approach C (this PR): Bypass KV cache entirely (3 files, +139/-482)

Test plan

Verified on Qwen3.5-9B with TP=4, non-zero hidden states extracted
Verified on standard (non-hybrid) model — no regression
pre-commit run --all-files passes on changed files

🤖 Generated with Claude Code

Instead of routing hidden states through a CacheOnly attention layer and the KV cache pipeline, the proposer stacks target-model intermediate hidden states and passes them directly to the connector. The connector accumulates per-request tensors on CPU and saves them when the request finishes. This avoids adding any KV cache groups or modifying core KV cache code, making it naturally compatible with hybrid models. Key changes: - Proposer: remove draft model loading, CUDAGraphs, attention metadata, and slot_mapping buffers. propose() calls connector.save_hidden_states() - Connector: add save_hidden_states() for direct hidden state accumulation, save to disk in request_finished(). Implement SupportsHMA. - Config: gate HMA disable with supports_hma() check so hybrid models keep their per-group block allocators when connector supports it Signed-off-by: Rahul Tuli <rtuli@redhat.com>

request_finished() runs on the scheduler process which has a separate connector instance with no accumulated data. Move the disk write into save_hidden_states() which runs on the worker, so the file is ready when the scheduler reports the path. Signed-off-by: Rahul Tuli <rtuli@redhat.com>

rahul-tuli added 2 commits April 15, 2026 16:55

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RFC] Hybrid model ExtractHiddenStates: bypass KV cache#162

[RFC] Hybrid model ExtractHiddenStates: bypass KV cache#162
rahul-tuli wants to merge 2 commits intomainfrom
hma/extract-hidden-states-bypass

rahul-tuli commented Apr 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

rahul-tuli commented Apr 15, 2026

Summary

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant