vllm-project · RGBmarya · Nov 27, 2025 · Nov 27, 2025 · Nov 27, 2025 · Nov 27, 2025
diff --git a/.cursor/plans/hy-666d8271.plan.md b/.cursor/plans/hy-666d8271.plan.md
@@ -0,0 +1,156 @@
+<!-- 666d8271-59ea-461e-b7bd-440e94fd7c3a 0fde745b-265b-4516-9b96-a199c5d286b0 -->
+# Hybrid SSM + Sliding-Window KV: Implementation Plan
+
+### 1. Solidify high-level architecture
+
+- **Goal**: Combine precise sliding-window attention over recent tokens with a compressed SSM state that summarizes the distant past, while reusing existing vLLM components.
+- **Key decisions**:
+  - **SSM state storage**: Reuse the existing Mamba-style KV pool (`MambaSpec` / `MambaManager`) as a separate state pool, rather than packing SSM state into normal KV blocks (phase 2 option).
+  - **Attention compute**: Keep standard paged/sliding-window attention kernels (Triton unified attention) untouched and fuse SSM only at the Python backend level.
+  - **Minimal surface changes**: Add a new SSM adapter module, a hybrid attention backend/impl, and a hybrid attention layer; avoid changing `KVCacheManager`, `BlockPool`, or CUDA kernels.
+
+### 2. Design and implement an SSM adapter (history branch)
+
+- **2.1. API and placement**
+  - Add a new module `vllm/model_executor/layers/hybrid_ssm_adapter.py` (or adjacent to `mamba_mixer.py`).
+  - Expose a small class, e.g. `HybridSSMAdapter`, with methods:
+    - `get_kv_cache_spec(vllm_config: VllmConfig) -> KVCacheSpec | None` (returns a `MambaSpec` or a thin wrapper) so it can obtain its own KV pool if needed.
+    - `get_state_shape()` / `get_state_dtype()` if using `MambaBase` inheritance.
+    - `forward_history_branch_prefill(hidden_states, attn_metadata) -> torch.Tensor`.
+    - `forward_history_branch_decode(hidden_states, attn_metadata) -> torch.Tensor`.
+- **2.2. Reuse Mamba SSM flows**
+  - Use `MambaMixer.forward_cuda` (`vllm/model_executor/layers/mamba/mamba_mixer.py`) as a reference for:
+    - How to split prefill vs decode tokens using `Mamba1AttentionMetadata`.
+    - How to wire `causal_conv1d_fn`, `selective_scan_fn`, `selective_state_update` and state indices.
+  - For **prefill**:
+    - Implement `forward_history_branch_prefill` that:
+      - Takes a contiguous prompt segment (from `hidden_states` and `attn_metadata.query_start_loc`) and runs the SSM scan path (`selective_scan_fn` and associated Triton kernels in `ssd_chunk_state.py` / `ssd_state_passing.py`).
+      - Writes the resulting SSM state into `self.kv_cache` (same pattern as `MambaMixer.kv_cache`).
+      - Optionally returns a per-token SSM output if you want SSM to influence prompt outputs.
+  - For **decode**:
+    - Implement `forward_history_branch_decode` that:
+      - Uses `selective_state_update` and `ssd_state_passing` to apply one or a few recurrent state updates per decode token, based on `Mamba1AttentionMetadata` indices.
+      - Produces `ssm_out` with shape `[num_tokens, num_heads, head_dim]` aligned with the Triton attention output.
+      - Updates `self.kv_cache` state in-place.
+
+### 3. Implement `HybridAttentionImpl` on top of Triton attention
+
+- **3.1. Backend scaffolding**
+  - Create `vllm/v1/attention/backends/hybrid_attn.py`.
+  - Define `HybridAttentionMetadata` as a thin alias or reuse `TritonAttentionMetadata` from `triton_attn.py`.
+  - Implement `HybridAttentionBackend(AttentionBackend)` similar to `TritonAttentionBackend`:
+    - `get_builder_cls() -> TritonAttentionMetadataBuilder` (reuse unchanged).
+    - `get_impl_cls() -> HybridAttentionImpl`.
+    - `get_name()` and feature flags (supported dtypes, kv_cache_dtypes, cascade support=false, etc.).
+
+- **3.2. HybridAttentionImpl.forward**
+  - Model it after `TritonAttentionImpl.forward` (`vllm/v1/attention/backends/triton_attn.py`):
+    - Hold an internal `TritonAttentionImpl` instance constructed with the same constructor args.
+  - Implement `forward` as:
+
+    1. **Sliding-window path**:
+
+       - Call `self.triton_impl.forward(layer, query, key, value, kv_cache, attn_metadata, output, ...)` to:
+         - Write K/V into the standard paged KV cache via `triton_reshape_and_cache_flash`.
+         - Call `unified_attention(...)` (`vllm/attention/ops/triton_unified_attention.py`), including the `window_size` (sliding window) and `block_table`.
+       - At this stage `output[:num_actual_tokens]` contains the sliding-window attention result.
+
+    1. **SSM history path**:
+
+       - Call the adapter, e.g. `ssm_out = self.ssm_adapter.forward_history_branch_decode(query_or_hidden_states, attn_metadata)` for decode, and similar for prefill if desired.
+       - Ensure `ssm_out` is indexed over the same flattened token set as `output` (use `attn_metadata.num_actual_tokens`, `query_start_loc`, etc.).
+
+    1. **Fusion**:
+
+       - Add the SSM contribution into the output:
+         - `output[:num_actual_tokens] += ssm_out[:num_actual_tokens]`.
+       - Return `output`.
+
+- **3.3. Constructor wiring**
+  - Modify the `__init__` of `HybridAttentionImpl` to:
+    - Accept either an `HybridSSMAdapter` or construct it from the layer (e.g. through a `layer.ssm_adapter` reference).
+    - Mirror key fields: `num_heads`, `head_size`, `num_kv_heads`, `scale`, `sliding_window`, `kv_cache_dtype`, etc.
+
+### 4. Define `HybridAttentionLayer` that exposes KV spec and backend
+
+- **4.1. Layer class**
+  - Add `vllm/model_executor/layers/hybrid_attn_layer.py` that implements `AttentionLayerBase`:
+    - Inherit from `torch.nn.Module` and `AttentionLayerBase`.
+    - Contain standard Q/K/V projection modules and any extra weights needed.
+    - Own an `HybridSSMAdapter` instance.
+- **4.2. KV cache spec for sliding-window KV**
+  - Implement `get_kv_cache_spec(self, vllm_config: VllmConfig) -> KVCacheSpec | None` using `SlidingWindowSpec` from `vllm/v1/kv_cache_interface.py`:
+    - Use `vllm_config.cache_config.block_size`.
+    - Use `model_config.get_num_kv_heads`, `model_config.get_head_size`, `model_config.dtype`.
+    - Set `sliding_window=self.sliding_window`.
+  - This keeps all sliding-window KV behavior in place and uses `SlidingWindowManager` in `single_type_kv_cache_manager.py`.
+- **4.3. Backend selection**
+  - Implement `get_attn_backend(self) -> type[AttentionBackend] `to return `HybridAttentionBackend`.
+  - Ensure that the model’s layer registration (in the model implementation) uses `HybridAttentionLayer` instead of a plain attention layer for the desired blocks.
+
+### 5. Wire into ModelRunner and KV cache manager
+
+- **5.1. KV cache spec collection**
+  - Confirm that `gpu_model_runner.get_kv_cache_spec` already discovers your new layer:
+    ```python
+    # vllm/v1/worker/gpu_model_runner.py
+    if spec := attn_module.get_kv_cache_spec(self.vllm_config):
+        kv_cache_spec[layer_name] = spec
+    ```
+
+  - For layers that should share SSM state across blocks or use a separate SSM pool, decide whether the adapter is:
+    - Embedded inside `HybridAttentionLayer` (per-layer SSM state), or
+    - A shared Mamba-style module referenced via KV-sharing if needed.
+
+- **5.2. KV grouping and managers**
+  - Let `kv_cache_utils.get_kv_cache_groups` and `get_kv_cache_configs` build groups normally:
+    - One group for sliding-window attention (using `SlidingWindowSpec`).
+    - One group for SSM state if you expose it as a `MambaSpec` group from `HybridSSMAdapter` / `MambaBase`.
+  - `SingleTypeKVCacheManager` will then create:
+    - `SlidingWindowManager` for attention KV.
+    - `MambaManager` for SSM state.
+  - No modifications required initially to `KVCacheManager` or `single_type_kv_cache_manager.py`.
+
+### 6. Integration into a specific model
+
+- **6.1. Choose where to introduce hybrid layers**
+  - Decide whether to:
+    - Replace all attention blocks with `HybridAttentionLayer`, or
+    - Use it only in a subset (e.g., every N-th layer or only later layers) for experimentation.
+- **6.2. Modify the model definition**
+  - In the relevant model file under `vllm/model_executor/models/`, swap the attention class:
+    - Replace standard attention layers with `HybridAttentionLayer` where desired.
+    - Pass in `sliding_window` (and any SSM hyperparameters: state size, ranks, etc.) from config.
+
+### 7. Testing & validation plan
+
+- **7.1. Unit tests for SSM adapter**
+  - Add tests under `tests/model_executor/mamba/` or a new `tests/model_executor/hybrid_attn/` to verify:
+    - State shape/dtype match between adapter and `MambaStateShapeCalculator`.
+    - `forward_history_branch_prefill` produces identical results to MambaMixer for a toy sequence.
+    - `forward_history_branch_decode` updates state correctly and is consistent with sequential scan.
+
+- **7.2. End-to-end correctness (small models)**
+  - For a small model (e.g., LLaMA‑like or synthetic hybrid model):
+    - Compare outputs between:
+      - Standard full attention.
+      - Sliding-window only.
+      - Hybrid SSM + sliding-window (with SSM disabled / weights zeroed to sanity-check fusion).
+    - Confirm that enabling SSM branch changes outputs but disabling it recovers sliding-window behavior.
+
+- **7.3. Performance and GPU memory**
+  - Benchmark end-to-end throughput using existing scripts, e.g. `benchmarks/benchmark_serving.py` or `benchmarks/benchmark_throughput.py`, on:
+    - Long-context prompts vs short prompts.
+    - Different window sizes and SSM state sizes.
+  - Verify that:
+    - Paged KV usage (`kv_cache_manager.usage`) is consistent with pure sliding-window.
+    - Additional SSM state pool fits within the memory budget defined by `MambaSpec.max_memory_usage_bytes`.
+
+### To-dos
+
+- [ ] Design and implement HybridSSMAdapter reusing Mamba SSM kernels for prefill and decode history branches.
+- [ ] Create HybridAttentionBackend and HybridAttentionImpl that wrap TritonAttentionImpl and fuse SSM outputs into sliding-window attention outputs.
+- [ ] Implement HybridAttentionLayer that exposes a SlidingWindowSpec KV cache spec and selects HybridAttentionBackend as its attention backend.
+- [ ] Ensure gpu_model_runner and KVCacheManager correctly include hybrid layers and SSM state groups without changes to core managers.
+- [ ] Swap selected model attention blocks to use HybridAttentionLayer with configured sliding window and SSM hyperparameters.
+- [ ] Add unit and end-to-end tests plus benchmarks to validate correctness, stability, and performance of the hybrid attention path.Optionally design HybridSSMSpec and HybridSSMManager that store compressed history directly in KV blocks and integrate a state-update kernel into SlidingWindowManager.Implement HybridAttentionLayer that exposes a SlidingWindowSpec KV cache spec and selects HybridAttentionBackend as its attention backend.
diff --git a/.cursor/plans/kv-1886f5f8.plan.md b/.cursor/plans/kv-1886f5f8.plan.md
@@ -0,0 +1,114 @@
+<!-- 1886f5f8-c538-4179-9dac-0f0940c56205 37533ec7-3a84-451b-b31e-26d0f6b7c235 -->
+# KV-Embedded Hybrid SSM – Phase 2 Plan
+
+## Overview
+
+- Implement a KV-embedded representation of SSM state that is updated on sliding-window eviction and read directly by attention kernels.
+- Introduce a composite sliding-window+SSM manager, a state-update kernel, and unified-attention wiring while preserving the existing Phase 1 (separate SSM pool) path as a guarded fallback.
+
+## 1. New KV cache spec for hybrid SSM state
+
+- Define `HybridSSMSpec` in `[vllm/v1/kv_cache_interface.py]`:
+- Subclass `KVCacheSpec` and add fields: `block_size`, `ssm_state_size`, `page_size_bytes`.
+- Implement `max_memory_usage_bytes(...)` assuming one logical SSM state block per active sequence (or per KV group), consistent with how sliding-window groups account for memory.
+- Document the relationship between `block_size` (tokens per block) and `ssm_state_size` (state dimension/compressed rank) and how `page_size_bytes` is derived.
+- Integrate `HybridSSMSpec` into KV grouping in `[vllm/v1/core/kv_cache_utils.py]`:
+- Extend `get_kv_cache_groups` / `get_kv_cache_configs` to accept a hybrid-SSM configuration and construct a corresponding SSM KV group.
+- Decide how the SSM group aligns with existing sliding-window groups (e.g., one SSM group per sliding-window group) and ensure `max_memory_usage_bytes` includes both.
+- Update `get_kv_cache_config_from_groups` to treat `HybridSSMSpec` as a first-class group alongside existing KV specs.
+
+## 2. Composite Hybrid Sliding-Window + SSM manager
+
+- Implement `HybridSSMManager` in `[vllm/v1/core/single_type_kv_cache_manager.py]`:
+- Subclass `SingleTypeKVCacheManager` and parameterize it with `HybridSSMSpec`.
+- Use a `BlockPool` to allocate a fixed number of SSM blocks per request (e.g., one per request per group), and track `request_id -> KVCacheBlock` mappings.
+- Expose helpers:
+- `get_blocks(request_id)` → current SSM block(s) for the request.
+- `get_state_ptrs_or_offsets(request_id, ...)` → device pointers/offsets for SSM state regions required by kernels.
+- Implement `HybridSlidingWindowManager` (composite manager) in the same module:
+- Subclass `SlidingWindowManager` and hold a `HybridSSMManager` instance for the paired SSM group.
+- Override/extend initialization to wire in both the sliding-window KV group and its associated SSM group, including any group IDs or indices used elsewhere.
+- Provide convenience methods to resolve, for a `(kv_cache_group_id, request_id)`, both the standard KV blocks (for attention) and the SSM block (for state).
+
+## 3. State-update kernel that absorbs evicted blocks
+
+- Add a Triton kernel for state updates (e.g., `hybrid_ssm_state_update.py`) under `[vllm/model_executor/layers/mamba/ops/]` (or a nearby attention-ops directory):
+- Pattern its structure after `_state_passing_fwd_kernel` / `_state_passing_fwd` from `mamba/ops/ssd_state_passing.py`.
+- Inputs per request/head:
+- Pointer(s) to the current SSM state in the SSM KV block.
+- Views onto K/V (or projected summaries) for the evicted blocks.
+- Optional decay parameters `(A, Δt)` or precomputed coefficients for the SSM.
+- Implement update rule:
+- For each head, compute `state_new = f(state_old, KV_evicted)` (e.g., `exp(A * Δt) * state_old + encode(K, V)` for linear SSMs) and write back to the SSM block.
+- Ensure the kernel supports batched heads and multiple evicted blocks at once for good utilization.
+- Implement a host-side wrapper `update_hybrid_ssm_state_from_evicted_blocks(...)` in an appropriate Python module (e.g., `[vllm/model_executor/layers/mamba/hybrid_ssm_utils.py]`):
+- Accept `(request_id, evicted_block_ids, ssm_block, model_params)`.
+- Use the `KVCacheTensor` layout to construct device views/slices for K and V corresponding to `evicted_block_ids`.
+- Marshal SSM state pointers from `HybridSSMManager` and pass them, along with hyperparameters and launch configuration, to the Triton kernel.
+- Return any metadata needed for debugging or testing (e.g., number of tokens/blocks processed).
+
+## 4. Hook KV eviction into SSM state update
+
+- Modify `HybridSlidingWindowManager.remove_skipped_blocks` in `[vllm/v1/core/single_type_kv_cache_manager.py]`:
+- Determine which concrete block IDs for the sliding-window group will be freed for a given `request_id` and `num_computed_tokens`.
+- Before freeing them:
+- Collect the list of evicted `KVCacheBlock` IDs and the associated group/layer context.
+- Call into `HybridSSMManager.update_state_from_evicted_blocks(request_id, evicted_block_ids, ...)`, which internally invokes `update_hybrid_ssm_state_from_evicted_blocks`.
+- Only after the state-update path completes, replace entries in `req_to_blocks` with `null_block` and free them via `block_pool.free_blocks(removed_blocks)`.
+- Make the eviction/state-update logic group-aware:
+- Track a mapping from sliding-window KV cache group IDs to their corresponding `HybridSSMManager` instances.
+- Ensure multi-group scenarios (e.g., different attention types or partitions) route the correct evicted blocks to the matching SSM manager.
+
+## 5. Expose SSM state to the unified attention kernel
+
+- Extend the unified attention Triton kernel (e.g., `kernel_unified_attention_2d` in `[vllm/attention/ops/triton_unified_attention.py]`) to integrate SSM directly:
+- Reserve a convention in the block tables, such as using block index `0` as the SSM state block for each sequence and starting real KV history blocks at index `1`.
+- At kernel entry, for each sequence/head:
+- Load the corresponding SSM state from the SSM block via the block table and/or explicit offsets.
+- Incorporate the SSM contribution either as extra sink-like contributions in the attention computation or as an additive term in the value/output accumulation, aligning with Phase 1 behavior but using KV-embedded state.
+- Add any required kernel arguments (e.g., SSM projection weights, decay coefficients, offsets) and ensure launch-side code passes them correctly.
+- Update the hybrid attention implementation (e.g., `HybridAttentionImpl` or a new `KVEmbeddedHybridAttentionImpl`) in the model executor:
+- Remove dependence on a separate SSM state pool; instead, rely on the block table and `HybridSSMManager`-managed SSM block to provide state.
+- Keep existing K/V reshape and cache-write logic (e.g., `triton_reshape_and_cache_flash`) but update it as needed to respect the reserved SSM block index convention.
+- Ensure compatibility with both Phase 1 and Phase 2 paths so the same model configuration can select behavior via flags.
+
+## 6. Configuration, flags, and migration
+
+- Add a configuration flag (e.g., `enable_hybrid_kv_embedded_ssm: bool = False`) in the appropriate config class (`VllmConfig` or `CacheConfig`):
+- Use this flag to decide whether to instantiate `HybridSlidingWindowManager` + `HybridSSMManager` and to enable SSM-aware kernel arguments.
+- Keep the existing Phase 1 path (separate SSM pool, separate adapter/kernel) as the default when the flag is `False`.
+- Wire the flag through model construction and engine initialization:
+- Ensure KV cache spec creation, manager selection, and attention implementation respect the flag.
+- Add documentation comments to the config explaining Phase 1 vs Phase 2 behavior and any compatibility constraints.
+
+## 7. Testing and validation for Phase 2
+
+- Low-level correctness tests:
+- Add unit tests for the state-update kernel and its Python wrapper:
+- Construct small toy SSM states and synthetic K/V blocks; compare kernel output against a Python reference SSM update.
+- Add tests for `HybridSlidingWindowManager.remove_skipped_blocks`:
+- Simulate sliding-window progression, verify that:
+- The state-update path is invoked with the correct evicted block IDs and group context.
+- The SSM block content changes as expected (within numerical tolerances).
+- Freed blocks are correctly returned to `block_pool` and `req_to_blocks` is updated.
+- End-to-end behavior tests:
+- Add or extend tests comparing Phase 1 (separate SSM pool) vs Phase 2 (KV-embedded SSM) on the same hybrid models and prompts:
+- Check outputs, log-probs, and long-range behavior equivalence within acceptable numerical tolerances.
+- Stress tests:
+- Very long contexts with aggressive sliding windows to ensure history compression behaves correctly.
+- High concurrency workloads to confirm no deadlocks and acceptable latency impact from eviction-driven updates.
+- Performance and memory evaluation:
+- Add benchmarks or profiling scripts to measure:
+- Overhead of eviction-triggered state updates vs Phase 1.
+- Overall KV usage and any memory savings due to tighter integration.
+- Iterate on kernel launch parameters, block sizes, and SSM block layout (e.g., alignment, vectorization) to keep the overhead low while maintaining numerical fidelity.
+
+### To-dos
+
+- [ ] Define `HybridSSMSpec` and integrate it into KV cache grouping and config utilities.
+- [ ] Implement `HybridSSMManager` and composite `HybridSlidingWindowManager` with proper group and request mappings.
+- [ ] Add the Triton state-update kernel and Python wrapper to update SSM state from evicted KV blocks.
+- [ ] Modify composite sliding-window eviction logic to invoke the SSM state-update path before freeing blocks.
+- [ ] Extend unified attention kernels and hybrid attention implementation to read SSM state from KV-embedded blocks.
+- [ ] Add configuration flag to toggle KV-embedded SSM and wire it through initialization paths.
+- [ ] Implement unit, integration, and performance tests comparing Phase 1 vs Phase 2 behavior and costs.