Skip to content

[RFC] Hybrid model ExtractHiddenStates: bypass KV cache#162

Draft
rahul-tuli wants to merge 2 commits intomainfrom
hma/extract-hidden-states-bypass
Draft

[RFC] Hybrid model ExtractHiddenStates: bypass KV cache#162
rahul-tuli wants to merge 2 commits intomainfrom
hma/extract-hidden-states-bypass

Conversation

@rahul-tuli
Copy link
Copy Markdown
Member

Summary

  • Adds hybrid model support (e.g. Qwen3.5) for extract_hidden_states speculative decoding by bypassing the KV cache entirely
  • Proposer stacks target-model hidden states and passes them directly to the connector via save_hidden_states() — no draft model loaded, no CacheOnly attention layer, no CUDAGraphs
  • Connector accumulates per-request hidden states on CPU and writes to disk each step
  • Zero interaction with KV cache — no CacheOnlySpec, no groups, no coordinator changes. Only 3 files changed, net -343 lines
  • Gates HMA disable with supports_hma() check so hybrid models keep per-group block allocators

This is 1 of 3 alternative approaches — see RFC document and sister PRs for comparison:

Test plan

  • Verified on Qwen3.5-9B with TP=4, non-zero hidden states extracted
  • Verified on standard (non-hybrid) model — no regression
  • pre-commit run --all-files passes on changed files

🤖 Generated with Claude Code

Instead of routing hidden states through a CacheOnly attention layer and
the KV cache pipeline, the proposer stacks target-model intermediate
hidden states and passes them directly to the connector. The connector
accumulates per-request tensors on CPU and saves them when the request
finishes. This avoids adding any KV cache groups or modifying core KV
cache code, making it naturally compatible with hybrid models.

Key changes:
- Proposer: remove draft model loading, CUDAGraphs, attention metadata,
  and slot_mapping buffers. propose() calls connector.save_hidden_states()
- Connector: add save_hidden_states() for direct hidden state accumulation,
  save to disk in request_finished(). Implement SupportsHMA.
- Config: gate HMA disable with supports_hma() check so hybrid models
  keep their per-group block allocators when connector supports it

Signed-off-by: Rahul Tuli <rtuli@redhat.com>
request_finished() runs on the scheduler process which has a separate
connector instance with no accumulated data. Move the disk write into
save_hidden_states() which runs on the worker, so the file is ready
when the scheduler reports the path.

Signed-off-by: Rahul Tuli <rtuli@redhat.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant