[Spec Decode] Support hybrid attention models in extract_hidden_states by mgoin · Pull Request #39949 · vllm-project/vllm

mgoin · 2026-04-15T21:35:02Z

Summary

Hidden-state extraction (extract_hidden_states speculative method) currently doesn't work on hybrid-attention models like Qwen3.5 (GatedDeltaNet + full attention). The failure chain:

kv_transfer_config is set → HMA unconditionally force-disabled in VllmConfig.__post_init__
unify_hybrid_kv_cache_specs tries to fold all specs into one type → can't handle MambaSpec alongside attention specs → ValueError

This PR fixes the issue by letting connectors that declare SupportsHMA keep HMA enabled, and teaching the KV cache grouping to handle the cache-only hidden-state layer alongside hybrid attention groups.

Approach

Marker spec class — HiddenStateCacheSpec is a thin subclass of MLAAttentionSpec with no behavioral overrides. It exists purely as a type tag so get_kv_cache_groups can identify cache-only layers. Because it inherits from MLAAttentionSpec → FullAttentionSpec, it passes through all existing isinstance checks, is_uniform_type, spec_manager_map, and find_longest_cache_hit without any changes to those paths.

Filter-before-group, add-back-after — For hybrid models, the cache-only layer's page size (determined by num_aux_hidden_states × hidden_size) generally won't divide evenly into the Mamba-aligned common page. Rather than modifying the unification or grouping algorithms, get_kv_cache_groups filters HiddenStateCacheSpec layers out before calling unify_kv_cache_spec_page_size and _get_kv_cache_groups_uniform_page_size (both untouched), then adds them back as their own 1-layer groups with block_size shrunk and page_size_padded aligned to the common page.

Strided tensor reshape — The page padding means the allocated tensor has gaps between blocks. gpu_model_runner._reshape_kv_cache_tensors gets an as_strided branch (guarded by page_size_padded > real_page_size_bytes) that sets the block-level stride to span the full padded page, matching the pattern MambaSpec already uses for its state tensors.

Connector fixes — ExampleHiddenStatesConnector now inherits SupportsHMA and reads slot_mapping directly from attn_metadata instead of recomputing it from scheduler block IDs (which use the wrong block size under HMA). The now-dead ReqMeta.slot_mapping field and its per-request CPU tensor allocation are removed.

Proposer group selection — ExtractHiddenStatesProposer records its kv_cache_gid in validate_same_kv_cache_group so the model runner selects the correct common_attn_metadata for the cache-only group, matching the existing EagleProposer pattern.

Block/offset cache ops — basic_cache and extract_from_kv_cache use slot_mapping // block_size / slot_mapping % block_size indexing instead of .view() / .flatten(), which works on non-contiguous (strided) tensors.

Test plan

tests/v1/kv_connector/extract_hidden_states_integration/test_extraction.py — Llama end-to-end (GPU)
Qwen3.5-4B + extract_hidden_states — hybrid model end-to-end (GPU), hidden states shape [N, 3, 2560] with non-zero values
pre-commit run ruff-check / ruff-format / mypy-3.10 — all passing
CI

🤖 AI-assisted (Claude)

gemini-code-assist

Code Review

This pull request introduces the HiddenStateCacheSpec to support hidden-state extraction within the vLLM V1 engine. Key changes include updating the KV cache grouping heuristics to prevent singleton cache-only layers from collapsing group sizes, refactoring the ExampleHiddenStatesConnector to utilize attn_metadata.slot_mapping directly, and implementing dynamic HMA (Hybrid Memory Architecture) support checks for connectors. Feedback is provided regarding the max_memory_usage_bytes implementation in HiddenStateCacheSpec, which currently fails to account for context parallelism, potentially leading to memory over-estimation during initialization.

Hidden-state extraction breaks on hybrid-attention models (e.g. Qwen3.5) because kv_transfer_config force-disables HMA and unify_hybrid_kv_cache_specs cannot fold MambaSpec into a uniform type. Fix by gating HMA-disable on supports_hma(connector_cls), making ExampleHiddenStatesConnector a SupportsHMA subclass, and handling the cache-only layer's page alignment for hybrid models. Key changes: - HiddenStateCacheSpec: thin marker subclass of MLAAttentionSpec (inherits all dispatch behavior, no overrides). Defined in kv_cache_interface.py, registered in spec_manager_map. - get_kv_cache_groups: filter HiddenStateCacheSpec out before unify/grouping, add back as 1-layer group with page_size_padded aligned to the common page. General sub-functions untouched. - gpu_model_runner: as_strided reshape branch for padded specs (page_size_padded > real_page), proposer isinstance for kv_cache_gid. - Connector: read slot_mapping from attn_metadata (not scheduler block_ids), remove dead ReqMeta.slot_mapping field. - Proposer: kv_cache_gid for correct common_attn_metadata selection. - basic_cache/extract_from_kv_cache: block/offset indexing instead of flatten (works on non-contiguous strided tensors). Verified: Llama integration test + Qwen3.5-4B end-to-end on GPU. Signed-off-by: mgoin <mgoin64@gmail.com> Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

mgoin requested review from ApostaC, NickLucche, ProExpertProg, WoosukKwon, alexm-redhat, heheda12345, hmellor, houseroad, njhill, orozery, robertgshaw2-redhat, tlrmchlsmth, yewentao256, youkaichao and ywang96 as code owners April 15, 2026 21:35

mergify bot added v1 kv-connector labels Apr 15, 2026

gemini-code-assist bot reviewed Apr 15, 2026

View reviewed changes

Comment thread vllm/v1/kv_cache_interface.py Outdated

mgoin requested review from MatthewBonanni, benchislett and luccafong as code owners April 15, 2026 22:00

mergify bot added the speculative-decoding label Apr 15, 2026

cferra mentioned this pull request Apr 16, 2026

[Bug]: Gemma 4 31B FP8_BLOCK checkpoint produces garbage repetitive output — logit saturation at softcap wall due to absorbed activation scales being double-applied #39407

Open

mgoin force-pushed the extract-hidden-states-hybrid branch from 12019e0 to 530539a Compare April 16, 2026 16:43

mgoin requested a review from xuechendi as a code owner April 16, 2026 16:43

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Spec Decode] Support hybrid attention models in extract_hidden_states#39949

[Spec Decode] Support hybrid attention models in extract_hidden_states#39949
mgoin wants to merge 1 commit intovllm-project:mainfrom
neuralmagic:extract-hidden-states-hybrid

mgoin commented Apr 15, 2026 •

edited

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

mgoin commented Apr 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Approach

Test plan

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

mgoin commented Apr 15, 2026 •

edited

Loading