Skip to content

[Spec Decode] Support hybrid attention models in extract_hidden_states#39949

Open
mgoin wants to merge 1 commit intovllm-project:mainfrom
neuralmagic:extract-hidden-states-hybrid
Open

[Spec Decode] Support hybrid attention models in extract_hidden_states#39949
mgoin wants to merge 1 commit intovllm-project:mainfrom
neuralmagic:extract-hidden-states-hybrid

Conversation

@mgoin
Copy link
Copy Markdown
Member

@mgoin mgoin commented Apr 15, 2026

Summary

Hidden-state extraction (extract_hidden_states speculative method) currently doesn't work on hybrid-attention models like Qwen3.5 (GatedDeltaNet + full attention). The failure chain:

  1. kv_transfer_config is set → HMA unconditionally force-disabled in VllmConfig.__post_init__
  2. unify_hybrid_kv_cache_specs tries to fold all specs into one type → can't handle MambaSpec alongside attention specs → ValueError

This PR fixes the issue by letting connectors that declare SupportsHMA keep HMA enabled, and teaching the KV cache grouping to handle the cache-only hidden-state layer alongside hybrid attention groups.

Approach

Marker spec classHiddenStateCacheSpec is a thin subclass of MLAAttentionSpec with no behavioral overrides. It exists purely as a type tag so get_kv_cache_groups can identify cache-only layers. Because it inherits from MLAAttentionSpecFullAttentionSpec, it passes through all existing isinstance checks, is_uniform_type, spec_manager_map, and find_longest_cache_hit without any changes to those paths.

Filter-before-group, add-back-after — For hybrid models, the cache-only layer's page size (determined by num_aux_hidden_states × hidden_size) generally won't divide evenly into the Mamba-aligned common page. Rather than modifying the unification or grouping algorithms, get_kv_cache_groups filters HiddenStateCacheSpec layers out before calling unify_kv_cache_spec_page_size and _get_kv_cache_groups_uniform_page_size (both untouched), then adds them back as their own 1-layer groups with block_size shrunk and page_size_padded aligned to the common page.

Strided tensor reshape — The page padding means the allocated tensor has gaps between blocks. gpu_model_runner._reshape_kv_cache_tensors gets an as_strided branch (guarded by page_size_padded > real_page_size_bytes) that sets the block-level stride to span the full padded page, matching the pattern MambaSpec already uses for its state tensors.

Connector fixesExampleHiddenStatesConnector now inherits SupportsHMA and reads slot_mapping directly from attn_metadata instead of recomputing it from scheduler block IDs (which use the wrong block size under HMA). The now-dead ReqMeta.slot_mapping field and its per-request CPU tensor allocation are removed.

Proposer group selectionExtractHiddenStatesProposer records its kv_cache_gid in validate_same_kv_cache_group so the model runner selects the correct common_attn_metadata for the cache-only group, matching the existing EagleProposer pattern.

Block/offset cache opsbasic_cache and extract_from_kv_cache use slot_mapping // block_size / slot_mapping % block_size indexing instead of .view() / .flatten(), which works on non-contiguous (strided) tensors.

Test plan

  • tests/v1/kv_connector/extract_hidden_states_integration/test_extraction.py — Llama end-to-end (GPU)
  • Qwen3.5-4B + extract_hidden_states — hybrid model end-to-end (GPU), hidden states shape [N, 3, 2560] with non-zero values
  • pre-commit run ruff-check / ruff-format / mypy-3.10 — all passing
  • CI

🤖 AI-assisted (Claude)

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces the HiddenStateCacheSpec to support hidden-state extraction within the vLLM V1 engine. Key changes include updating the KV cache grouping heuristics to prevent singleton cache-only layers from collapsing group sizes, refactoring the ExampleHiddenStatesConnector to utilize attn_metadata.slot_mapping directly, and implementing dynamic HMA (Hybrid Memory Architecture) support checks for connectors. Feedback is provided regarding the max_memory_usage_bytes implementation in HiddenStateCacheSpec, which currently fails to account for context parallelism, potentially leading to memory over-estimation during initialization.

Comment thread vllm/v1/kv_cache_interface.py Outdated
Hidden-state extraction breaks on hybrid-attention models (e.g.
Qwen3.5) because kv_transfer_config force-disables HMA and
unify_hybrid_kv_cache_specs cannot fold MambaSpec into a uniform type.

Fix by gating HMA-disable on supports_hma(connector_cls), making
ExampleHiddenStatesConnector a SupportsHMA subclass, and handling the
cache-only layer's page alignment for hybrid models. Key changes:

- HiddenStateCacheSpec: thin marker subclass of MLAAttentionSpec
  (inherits all dispatch behavior, no overrides). Defined in
  kv_cache_interface.py, registered in spec_manager_map.
- get_kv_cache_groups: filter HiddenStateCacheSpec out before
  unify/grouping, add back as 1-layer group with page_size_padded
  aligned to the common page. General sub-functions untouched.
- gpu_model_runner: as_strided reshape branch for padded specs
  (page_size_padded > real_page), proposer isinstance for kv_cache_gid.
- Connector: read slot_mapping from attn_metadata (not scheduler
  block_ids), remove dead ReqMeta.slot_mapping field.
- Proposer: kv_cache_gid for correct common_attn_metadata selection.
- basic_cache/extract_from_kv_cache: block/offset indexing instead of
  flatten (works on non-contiguous strided tensors).

Verified: Llama integration test + Qwen3.5-4B end-to-end on GPU.

Signed-off-by: mgoin <mgoin64@gmail.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@mgoin mgoin force-pushed the extract-hidden-states-hybrid branch from 12019e0 to 530539a Compare April 16, 2026 16:43
@mgoin mgoin requested a review from xuechendi as a code owner April 16, 2026 16:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant