Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
42 commits
Select commit Hold shift + click to select a range
7a297da
Add ARCHITECTURE.md
RGBmarya Nov 27, 2025
a1b81db
Merge branch 'main' of https://github.com/RGBmarya/vllm
RGBmarya Nov 27, 2025
cfab2ca
feat: part 1
RGBmarya Nov 27, 2025
4133f5b
refactor: use ungated model
RGBmarya Nov 27, 2025
1f3774e
fix: init issues
RGBmarya Nov 27, 2025
35cbb96
add benchmarks
RGBmarya Nov 27, 2025
8e5a2a2
fix: test misalignment
RGBmarya Nov 27, 2025
7aba95a
feat: integrate real Mamba kernel into HybridAttention
RGBmarya Nov 28, 2025
cf23e82
test fixes
RGBmarya Nov 28, 2025
7d5d502
fix: add dist_init
RGBmarya Nov 28, 2025
9ea5fe4
fix: add missing triton_metadata attr
RGBmarya Nov 28, 2025
d24d7e8
feat: add tests
RGBmarya Nov 28, 2025
437a10c
fix: __dict__
RGBmarya Nov 28, 2025
2098bf7
chore: add CPU offloading
RGBmarya Nov 28, 2025
8873038
test: add synthetic video test
RGBmarya Nov 28, 2025
38cca57
refactor: change QA video test to synthetic
RGBmarya Nov 28, 2025
bd41fe0
works??
RGBmarya Nov 28, 2025
d62e285
add new test
RGBmarya Nov 28, 2025
c90043d
fix import issue
RGBmarya Nov 28, 2025
3c8231d
add paper
RGBmarya Nov 28, 2025
a757376
feat: llama-3.2 with hybrid attention
RGBmarya Dec 1, 2025
69d854b
fix: llama
RGBmarya Dec 1, 2025
475dc3e
potential fixes
RGBmarya Dec 1, 2025
4b0e277
fix: visualization
RGBmarya Dec 1, 2025
c44777f
chore: results
Dec 1, 2025
044d143
fix: hybrid benchmarks
RGBmarya Dec 1, 2025
afa9d1a
feat: add video benchmark
RGBmarya Dec 1, 2025
44dc187
fix: CUDA usage
RGBmarya Dec 1, 2025
bdaa3fa
fix: premature CUDA init
RGBmarya Dec 1, 2025
cc1f313
results 2
Dec 1, 2025
9b30a98
feat: add visualization
RGBmarya Dec 1, 2025
785bf28
test: 48 frames
Dec 1, 2025
dcf8eb1
feat: add streaming benchmark
RGBmarya Dec 1, 2025
4e60821
50 it results
Dec 9, 2025
070bf07
Merge remote-tracking branch 'refs/remotes/origin/main'
Dec 9, 2025
5236b54
results
Dec 9, 2025
2956991
results
Dec 10, 2025
b3454ef
evaluation procedure
RGBmarya Dec 10, 2025
ec5a39c
various results
RGBmarya Dec 10, 2025
183cfd1
fix: python path
RGBmarya Dec 10, 2025
345620f
eval results
Dec 10, 2025
f09a802
refactor: frame batch size 1
RGBmarya Dec 10, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
156 changes: 156 additions & 0 deletions .cursor/plans/hy-666d8271.plan.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,156 @@
<!-- 666d8271-59ea-461e-b7bd-440e94fd7c3a 0fde745b-265b-4516-9b96-a199c5d286b0 -->
# Hybrid SSM + Sliding-Window KV: Implementation Plan

### 1. Solidify high-level architecture

- **Goal**: Combine precise sliding-window attention over recent tokens with a compressed SSM state that summarizes the distant past, while reusing existing vLLM components.
- **Key decisions**:
- **SSM state storage**: Reuse the existing Mamba-style KV pool (`MambaSpec` / `MambaManager`) as a separate state pool, rather than packing SSM state into normal KV blocks (phase 2 option).
- **Attention compute**: Keep standard paged/sliding-window attention kernels (Triton unified attention) untouched and fuse SSM only at the Python backend level.
- **Minimal surface changes**: Add a new SSM adapter module, a hybrid attention backend/impl, and a hybrid attention layer; avoid changing `KVCacheManager`, `BlockPool`, or CUDA kernels.

### 2. Design and implement an SSM adapter (history branch)

- **2.1. API and placement**
- Add a new module `vllm/model_executor/layers/hybrid_ssm_adapter.py` (or adjacent to `mamba_mixer.py`).
- Expose a small class, e.g. `HybridSSMAdapter`, with methods:
- `get_kv_cache_spec(vllm_config: VllmConfig) -> KVCacheSpec | None` (returns a `MambaSpec` or a thin wrapper) so it can obtain its own KV pool if needed.
- `get_state_shape()` / `get_state_dtype()` if using `MambaBase` inheritance.
- `forward_history_branch_prefill(hidden_states, attn_metadata) -> torch.Tensor`.
- `forward_history_branch_decode(hidden_states, attn_metadata) -> torch.Tensor`.
- **2.2. Reuse Mamba SSM flows**
- Use `MambaMixer.forward_cuda` (`vllm/model_executor/layers/mamba/mamba_mixer.py`) as a reference for:
- How to split prefill vs decode tokens using `Mamba1AttentionMetadata`.
- How to wire `causal_conv1d_fn`, `selective_scan_fn`, `selective_state_update` and state indices.
- For **prefill**:
- Implement `forward_history_branch_prefill` that:
- Takes a contiguous prompt segment (from `hidden_states` and `attn_metadata.query_start_loc`) and runs the SSM scan path (`selective_scan_fn` and associated Triton kernels in `ssd_chunk_state.py` / `ssd_state_passing.py`).
- Writes the resulting SSM state into `self.kv_cache` (same pattern as `MambaMixer.kv_cache`).
- Optionally returns a per-token SSM output if you want SSM to influence prompt outputs.
- For **decode**:
- Implement `forward_history_branch_decode` that:
- Uses `selective_state_update` and `ssd_state_passing` to apply one or a few recurrent state updates per decode token, based on `Mamba1AttentionMetadata` indices.
- Produces `ssm_out` with shape `[num_tokens, num_heads, head_dim]` aligned with the Triton attention output.
- Updates `self.kv_cache` state in-place.

### 3. Implement `HybridAttentionImpl` on top of Triton attention

- **3.1. Backend scaffolding**
- Create `vllm/v1/attention/backends/hybrid_attn.py`.
- Define `HybridAttentionMetadata` as a thin alias or reuse `TritonAttentionMetadata` from `triton_attn.py`.
- Implement `HybridAttentionBackend(AttentionBackend)` similar to `TritonAttentionBackend`:
- `get_builder_cls() -> TritonAttentionMetadataBuilder` (reuse unchanged).
- `get_impl_cls() -> HybridAttentionImpl`.
- `get_name()` and feature flags (supported dtypes, kv_cache_dtypes, cascade support=false, etc.).

- **3.2. HybridAttentionImpl.forward**
- Model it after `TritonAttentionImpl.forward` (`vllm/v1/attention/backends/triton_attn.py`):
- Hold an internal `TritonAttentionImpl` instance constructed with the same constructor args.
- Implement `forward` as:

1. **Sliding-window path**:

- Call `self.triton_impl.forward(layer, query, key, value, kv_cache, attn_metadata, output, ...)` to:
- Write K/V into the standard paged KV cache via `triton_reshape_and_cache_flash`.
- Call `unified_attention(...)` (`vllm/attention/ops/triton_unified_attention.py`), including the `window_size` (sliding window) and `block_table`.
- At this stage `output[:num_actual_tokens]` contains the sliding-window attention result.

1. **SSM history path**:

- Call the adapter, e.g. `ssm_out = self.ssm_adapter.forward_history_branch_decode(query_or_hidden_states, attn_metadata)` for decode, and similar for prefill if desired.
- Ensure `ssm_out` is indexed over the same flattened token set as `output` (use `attn_metadata.num_actual_tokens`, `query_start_loc`, etc.).

1. **Fusion**:

- Add the SSM contribution into the output:
- `output[:num_actual_tokens] += ssm_out[:num_actual_tokens]`.
- Return `output`.

- **3.3. Constructor wiring**
- Modify the `__init__` of `HybridAttentionImpl` to:
- Accept either an `HybridSSMAdapter` or construct it from the layer (e.g. through a `layer.ssm_adapter` reference).
- Mirror key fields: `num_heads`, `head_size`, `num_kv_heads`, `scale`, `sliding_window`, `kv_cache_dtype`, etc.

### 4. Define `HybridAttentionLayer` that exposes KV spec and backend

- **4.1. Layer class**
- Add `vllm/model_executor/layers/hybrid_attn_layer.py` that implements `AttentionLayerBase`:
- Inherit from `torch.nn.Module` and `AttentionLayerBase`.
- Contain standard Q/K/V projection modules and any extra weights needed.
- Own an `HybridSSMAdapter` instance.
- **4.2. KV cache spec for sliding-window KV**
- Implement `get_kv_cache_spec(self, vllm_config: VllmConfig) -> KVCacheSpec | None` using `SlidingWindowSpec` from `vllm/v1/kv_cache_interface.py`:
- Use `vllm_config.cache_config.block_size`.
- Use `model_config.get_num_kv_heads`, `model_config.get_head_size`, `model_config.dtype`.
- Set `sliding_window=self.sliding_window`.
- This keeps all sliding-window KV behavior in place and uses `SlidingWindowManager` in `single_type_kv_cache_manager.py`.
- **4.3. Backend selection**
- Implement `get_attn_backend(self) -> type[AttentionBackend] `to return `HybridAttentionBackend`.
- Ensure that the model’s layer registration (in the model implementation) uses `HybridAttentionLayer` instead of a plain attention layer for the desired blocks.

### 5. Wire into ModelRunner and KV cache manager

- **5.1. KV cache spec collection**
- Confirm that `gpu_model_runner.get_kv_cache_spec` already discovers your new layer:
```python
# vllm/v1/worker/gpu_model_runner.py
if spec := attn_module.get_kv_cache_spec(self.vllm_config):
kv_cache_spec[layer_name] = spec
```

- For layers that should share SSM state across blocks or use a separate SSM pool, decide whether the adapter is:
- Embedded inside `HybridAttentionLayer` (per-layer SSM state), or
- A shared Mamba-style module referenced via KV-sharing if needed.

- **5.2. KV grouping and managers**
- Let `kv_cache_utils.get_kv_cache_groups` and `get_kv_cache_configs` build groups normally:
- One group for sliding-window attention (using `SlidingWindowSpec`).
- One group for SSM state if you expose it as a `MambaSpec` group from `HybridSSMAdapter` / `MambaBase`.
- `SingleTypeKVCacheManager` will then create:
- `SlidingWindowManager` for attention KV.
- `MambaManager` for SSM state.
- No modifications required initially to `KVCacheManager` or `single_type_kv_cache_manager.py`.

### 6. Integration into a specific model

- **6.1. Choose where to introduce hybrid layers**
- Decide whether to:
- Replace all attention blocks with `HybridAttentionLayer`, or
- Use it only in a subset (e.g., every N-th layer or only later layers) for experimentation.
- **6.2. Modify the model definition**
- In the relevant model file under `vllm/model_executor/models/`, swap the attention class:
- Replace standard attention layers with `HybridAttentionLayer` where desired.
- Pass in `sliding_window` (and any SSM hyperparameters: state size, ranks, etc.) from config.

### 7. Testing & validation plan

- **7.1. Unit tests for SSM adapter**
- Add tests under `tests/model_executor/mamba/` or a new `tests/model_executor/hybrid_attn/` to verify:
- State shape/dtype match between adapter and `MambaStateShapeCalculator`.
- `forward_history_branch_prefill` produces identical results to MambaMixer for a toy sequence.
- `forward_history_branch_decode` updates state correctly and is consistent with sequential scan.

- **7.2. End-to-end correctness (small models)**
- For a small model (e.g., LLaMA‑like or synthetic hybrid model):
- Compare outputs between:
- Standard full attention.
- Sliding-window only.
- Hybrid SSM + sliding-window (with SSM disabled / weights zeroed to sanity-check fusion).
- Confirm that enabling SSM branch changes outputs but disabling it recovers sliding-window behavior.

- **7.3. Performance and GPU memory**
- Benchmark end-to-end throughput using existing scripts, e.g. `benchmarks/benchmark_serving.py` or `benchmarks/benchmark_throughput.py`, on:
- Long-context prompts vs short prompts.
- Different window sizes and SSM state sizes.
- Verify that:
- Paged KV usage (`kv_cache_manager.usage`) is consistent with pure sliding-window.
- Additional SSM state pool fits within the memory budget defined by `MambaSpec.max_memory_usage_bytes`.

### To-dos

- [ ] Design and implement HybridSSMAdapter reusing Mamba SSM kernels for prefill and decode history branches.
- [ ] Create HybridAttentionBackend and HybridAttentionImpl that wrap TritonAttentionImpl and fuse SSM outputs into sliding-window attention outputs.
- [ ] Implement HybridAttentionLayer that exposes a SlidingWindowSpec KV cache spec and selects HybridAttentionBackend as its attention backend.
- [ ] Ensure gpu_model_runner and KVCacheManager correctly include hybrid layers and SSM state groups without changes to core managers.
- [ ] Swap selected model attention blocks to use HybridAttentionLayer with configured sliding window and SSM hyperparameters.
- [ ] Add unit and end-to-end tests plus benchmarks to validate correctness, stability, and performance of the hybrid attention path.Optionally design HybridSSMSpec and HybridSSMManager that store compressed history directly in KV blocks and integrate a state-update kernel into SlidingWindowManager.Implement HybridAttentionLayer that exposes a SlidingWindowSpec KV cache spec and selects HybridAttentionBackend as its attention backend.
114 changes: 114 additions & 0 deletions .cursor/plans/kv-1886f5f8.plan.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,114 @@
<!-- 1886f5f8-c538-4179-9dac-0f0940c56205 37533ec7-3a84-451b-b31e-26d0f6b7c235 -->
# KV-Embedded Hybrid SSM – Phase 2 Plan

## Overview

- Implement a KV-embedded representation of SSM state that is updated on sliding-window eviction and read directly by attention kernels.
- Introduce a composite sliding-window+SSM manager, a state-update kernel, and unified-attention wiring while preserving the existing Phase 1 (separate SSM pool) path as a guarded fallback.

## 1. New KV cache spec for hybrid SSM state

- Define `HybridSSMSpec` in `[vllm/v1/kv_cache_interface.py]`:
- Subclass `KVCacheSpec` and add fields: `block_size`, `ssm_state_size`, `page_size_bytes`.
- Implement `max_memory_usage_bytes(...)` assuming one logical SSM state block per active sequence (or per KV group), consistent with how sliding-window groups account for memory.
- Document the relationship between `block_size` (tokens per block) and `ssm_state_size` (state dimension/compressed rank) and how `page_size_bytes` is derived.
- Integrate `HybridSSMSpec` into KV grouping in `[vllm/v1/core/kv_cache_utils.py]`:
- Extend `get_kv_cache_groups` / `get_kv_cache_configs` to accept a hybrid-SSM configuration and construct a corresponding SSM KV group.
- Decide how the SSM group aligns with existing sliding-window groups (e.g., one SSM group per sliding-window group) and ensure `max_memory_usage_bytes` includes both.
- Update `get_kv_cache_config_from_groups` to treat `HybridSSMSpec` as a first-class group alongside existing KV specs.

## 2. Composite Hybrid Sliding-Window + SSM manager

- Implement `HybridSSMManager` in `[vllm/v1/core/single_type_kv_cache_manager.py]`:
- Subclass `SingleTypeKVCacheManager` and parameterize it with `HybridSSMSpec`.
- Use a `BlockPool` to allocate a fixed number of SSM blocks per request (e.g., one per request per group), and track `request_id -> KVCacheBlock` mappings.
- Expose helpers:
- `get_blocks(request_id)` → current SSM block(s) for the request.
- `get_state_ptrs_or_offsets(request_id, ...)` → device pointers/offsets for SSM state regions required by kernels.
- Implement `HybridSlidingWindowManager` (composite manager) in the same module:
- Subclass `SlidingWindowManager` and hold a `HybridSSMManager` instance for the paired SSM group.
- Override/extend initialization to wire in both the sliding-window KV group and its associated SSM group, including any group IDs or indices used elsewhere.
- Provide convenience methods to resolve, for a `(kv_cache_group_id, request_id)`, both the standard KV blocks (for attention) and the SSM block (for state).

## 3. State-update kernel that absorbs evicted blocks

- Add a Triton kernel for state updates (e.g., `hybrid_ssm_state_update.py`) under `[vllm/model_executor/layers/mamba/ops/]` (or a nearby attention-ops directory):
- Pattern its structure after `_state_passing_fwd_kernel` / `_state_passing_fwd` from `mamba/ops/ssd_state_passing.py`.
- Inputs per request/head:
- Pointer(s) to the current SSM state in the SSM KV block.
- Views onto K/V (or projected summaries) for the evicted blocks.
- Optional decay parameters `(A, Δt)` or precomputed coefficients for the SSM.
- Implement update rule:
- For each head, compute `state_new = f(state_old, KV_evicted)` (e.g., `exp(A * Δt) * state_old + encode(K, V)` for linear SSMs) and write back to the SSM block.
- Ensure the kernel supports batched heads and multiple evicted blocks at once for good utilization.
- Implement a host-side wrapper `update_hybrid_ssm_state_from_evicted_blocks(...)` in an appropriate Python module (e.g., `[vllm/model_executor/layers/mamba/hybrid_ssm_utils.py]`):
- Accept `(request_id, evicted_block_ids, ssm_block, model_params)`.
- Use the `KVCacheTensor` layout to construct device views/slices for K and V corresponding to `evicted_block_ids`.
- Marshal SSM state pointers from `HybridSSMManager` and pass them, along with hyperparameters and launch configuration, to the Triton kernel.
- Return any metadata needed for debugging or testing (e.g., number of tokens/blocks processed).

## 4. Hook KV eviction into SSM state update

- Modify `HybridSlidingWindowManager.remove_skipped_blocks` in `[vllm/v1/core/single_type_kv_cache_manager.py]`:
- Determine which concrete block IDs for the sliding-window group will be freed for a given `request_id` and `num_computed_tokens`.
- Before freeing them:
- Collect the list of evicted `KVCacheBlock` IDs and the associated group/layer context.
- Call into `HybridSSMManager.update_state_from_evicted_blocks(request_id, evicted_block_ids, ...)`, which internally invokes `update_hybrid_ssm_state_from_evicted_blocks`.
- Only after the state-update path completes, replace entries in `req_to_blocks` with `null_block` and free them via `block_pool.free_blocks(removed_blocks)`.
- Make the eviction/state-update logic group-aware:
- Track a mapping from sliding-window KV cache group IDs to their corresponding `HybridSSMManager` instances.
- Ensure multi-group scenarios (e.g., different attention types or partitions) route the correct evicted blocks to the matching SSM manager.

## 5. Expose SSM state to the unified attention kernel

- Extend the unified attention Triton kernel (e.g., `kernel_unified_attention_2d` in `[vllm/attention/ops/triton_unified_attention.py]`) to integrate SSM directly:
- Reserve a convention in the block tables, such as using block index `0` as the SSM state block for each sequence and starting real KV history blocks at index `1`.
- At kernel entry, for each sequence/head:
- Load the corresponding SSM state from the SSM block via the block table and/or explicit offsets.
- Incorporate the SSM contribution either as extra sink-like contributions in the attention computation or as an additive term in the value/output accumulation, aligning with Phase 1 behavior but using KV-embedded state.
- Add any required kernel arguments (e.g., SSM projection weights, decay coefficients, offsets) and ensure launch-side code passes them correctly.
- Update the hybrid attention implementation (e.g., `HybridAttentionImpl` or a new `KVEmbeddedHybridAttentionImpl`) in the model executor:
- Remove dependence on a separate SSM state pool; instead, rely on the block table and `HybridSSMManager`-managed SSM block to provide state.
- Keep existing K/V reshape and cache-write logic (e.g., `triton_reshape_and_cache_flash`) but update it as needed to respect the reserved SSM block index convention.
- Ensure compatibility with both Phase 1 and Phase 2 paths so the same model configuration can select behavior via flags.

## 6. Configuration, flags, and migration

- Add a configuration flag (e.g., `enable_hybrid_kv_embedded_ssm: bool = False`) in the appropriate config class (`VllmConfig` or `CacheConfig`):
- Use this flag to decide whether to instantiate `HybridSlidingWindowManager` + `HybridSSMManager` and to enable SSM-aware kernel arguments.
- Keep the existing Phase 1 path (separate SSM pool, separate adapter/kernel) as the default when the flag is `False`.
- Wire the flag through model construction and engine initialization:
- Ensure KV cache spec creation, manager selection, and attention implementation respect the flag.
- Add documentation comments to the config explaining Phase 1 vs Phase 2 behavior and any compatibility constraints.

## 7. Testing and validation for Phase 2

- Low-level correctness tests:
- Add unit tests for the state-update kernel and its Python wrapper:
- Construct small toy SSM states and synthetic K/V blocks; compare kernel output against a Python reference SSM update.
- Add tests for `HybridSlidingWindowManager.remove_skipped_blocks`:
- Simulate sliding-window progression, verify that:
- The state-update path is invoked with the correct evicted block IDs and group context.
- The SSM block content changes as expected (within numerical tolerances).
- Freed blocks are correctly returned to `block_pool` and `req_to_blocks` is updated.
- End-to-end behavior tests:
- Add or extend tests comparing Phase 1 (separate SSM pool) vs Phase 2 (KV-embedded SSM) on the same hybrid models and prompts:
- Check outputs, log-probs, and long-range behavior equivalence within acceptable numerical tolerances.
- Stress tests:
- Very long contexts with aggressive sliding windows to ensure history compression behaves correctly.
- High concurrency workloads to confirm no deadlocks and acceptable latency impact from eviction-driven updates.
- Performance and memory evaluation:
- Add benchmarks or profiling scripts to measure:
- Overhead of eviction-triggered state updates vs Phase 1.
- Overall KV usage and any memory savings due to tighter integration.
- Iterate on kernel launch parameters, block sizes, and SSM block layout (e.g., alignment, vectorization) to keep the overhead low while maintaining numerical fidelity.

### To-dos

- [ ] Define `HybridSSMSpec` and integrate it into KV cache grouping and config utilities.
- [ ] Implement `HybridSSMManager` and composite `HybridSlidingWindowManager` with proper group and request mappings.
- [ ] Add the Triton state-update kernel and Python wrapper to update SSM state from evicted KV blocks.
- [ ] Modify composite sliding-window eviction logic to invoke the SSM state-update path before freeing blocks.
- [ ] Extend unified attention kernels and hybrid attention implementation to read SSM state from KV-embedded blocks.
- [ ] Add configuration flag to toggle KV-embedded SSM and wire it through initialization paths.
- [ ] Implement unit, integration, and performance tests comparing Phase 1 vs Phase 2 behavior and costs.
Loading
Loading