Skip to content

[RFC] Hybrid model ExtractHiddenStates: CacheOnly as filtered KV cache group#160

Draft
rahul-tuli wants to merge 2 commits intomainfrom
hma/extract-hidden-states-filtered-group
Draft

[RFC] Hybrid model ExtractHiddenStates: CacheOnly as filtered KV cache group#160
rahul-tuli wants to merge 2 commits intomainfrom
hma/extract-hidden-states-filtered-group

Conversation

@rahul-tuli
Copy link
Copy Markdown
Member

Summary

  • Adds hybrid model support (e.g. Qwen3.5) for extract_hidden_states speculative decoding by modeling CacheOnly as its own KV cache group, filtered from page-size uniformity checks and group coordination
  • CacheOnlySpec(MLAAttentionSpec) is pre-filtered in get_kv_cache_groups() before type-unification, then appended as a separate group with joint memory budget accounting
  • Gates HMA disable with supports_hma() check so hybrid models keep per-group block allocators when the connector supports it

This is 1 of 3 alternative approaches — see RFC document and sister PRs for comparison:

  • Approach A (this PR): CacheOnly as filtered KV cache group (9 files, +300/-78)
  • Approach B: CacheOnly as supplementary tensors — #TBD
  • Approach C: Bypass KV cache entirely — #TBD

Test plan

  • Verified on Qwen3.5-9B with TP=4, non-zero hidden states extracted
  • Verified on standard (non-hybrid) model — no regression
  • pre-commit run --all-files passes on changed files
  • Unit tests added in tests/v1/core/test_kv_cache_utils.py

🤖 Generated with Claude Code

- Add CacheOnlySpec(MLAAttentionSpec) to kv_cache_interface.py so it
    duck-types through all existing AttentionSpec code paths
- Pre-filter CacheOnlySpec in get_kv_cache_groups() before type-
    unification routing to prevent crashes with mixed spec types
- Joint budget calculation in get_kv_cache_config_from_groups() via
    extra_bytes_per_block parameter on get_num_blocks()
- Gate HMA disable in config with supports_hma() check so hybrid
    models keep their per-group block allocators
- Add SupportsHMA to ExampleHiddenStatesConnector with correct
    cache_group_idx for block_ids
- Resolve CacheOnly slot_mapping from per-layer mappings in the
    proposer instead of using main group's common_attn_metadata

Signed-off-by: Rahul-Tuli <rtuli@redhat.com>
CacheOnlySpec inherits MLAAttentionSpec -> FullAttentionSpec, so
isinstance(cache_only, FullAttentionSpec) returns True. This causes
build_block_map_addrs to include CacheOnly in page size uniformity
checks, crashing with "Non-uniform page sizes" on hybrid models.

Add explicit CacheOnlySpec filter after the FullAttentionSpec gate.

Signed-off-by: Rahul Tuli <rtuli@redhat.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant