sgl-project
diff --git a/‎python/docs/design/swa_eviction_strategy.md‎
Lines changed: 152 additions & 0 deletions b/‎python/docs/design/swa_eviction_strategy.md‎
Lines changed: 152 additions & 0 deletions
diff --git a/‎python/sgl_jax/srt/layers/attention/flashattention_backend.py‎
Lines changed: 52 additions & 12 deletions b/‎python/sgl_jax/srt/layers/attention/flashattention_backend.py‎
Lines changed: 52 additions & 12 deletions
diff --git a/‎python/sgl_jax/srt/managers/schedule_batch.py‎
Lines changed: 76 additions & 0 deletions b/‎python/sgl_jax/srt/managers/schedule_batch.py‎
Lines changed: 76 additions & 0 deletions
@@ -0,0 +1,152 @@
+# SWA Eviction Strategy
+
+## 1. Overview
+
+Hybrid models (e.g., MiMo-V2-Flash with 9 full-attention + 39 SWA layers) mix full-attention and sliding-window-attention (SWA) layers. Full-attention layers retain all historical KV data, while SWA layers only need the most recent `W` tokens. This document describes the dual-pool KV cache architecture and eviction strategy that exploits this difference to reduce memory usage.
+
+### Support Status
+
+| Cache Mode | SWA Support | Description |
+|-----------|-------------|-------------|
+| **ChunkCache** (`--disable-radix-cache`) | **Supported** | Per-request proactive eviction. Each request owns its KV slots; SWA slots outside the window are freed during extend/decode. |
+| **RadixCache** (default) | **Not supported** | RadixCache with SWA-aware eviction (tombstone strategies, dual LRU lists) is not implemented. Hybrid models must use `--disable-radix-cache`. |
+
+## 2. Dual-Pool Architecture
+
+Two separate KV cache pools are maintained:
+
+| Pool | Serves | Lifecycle | Eviction |
+|------|--------|-----------|----------|
+| **Full Pool** | Full-attention layers | Retains all historical KV data for the request lifetime | Freed on request completion |
+| **SWA Pool** | SWA layers | Only the most recent `W` tokens are needed | Proactively freed as tokens fall outside the sliding window |
+
+These pools are linked via `full_to_swa_index_mapping`, a numpy array that maps a full-pool index to its corresponding SWA-pool index. A mapping value of 0 means the SWA slot has been freed.
+
+### Key Term
+
+| Term | Definition |
+|------|-----------|
+| `swa_evicted_seqlen` | Per-request watermark. SWA slots in `[0, swa_evicted_seqlen)` have already been freed. Monotonically increasing. |
+
+### Memory Layout Example
+
+For MiMo-V2-Flash on TPU v6e-16 (TP=16, 1 KV head, head_dim=192+128=320, bf16):
+
+```
+Full pool:  104,704 tokens x 9 FA layers  x 640 bytes/token = ~603 MB
+SWA pool:   83,840 tokens x 39 SWA layers x 640 bytes/token = ~2.1 GB
+SWA held per request: ~256 tokens (sliding_window=128 + page_size=128 alignment)
+```
+
+## 3. Allocator: `SWATokenToKVPoolAllocator`
+
+The allocator maintains two independent sub-allocators (one for each pool) and the mapping array.
+
+### Allocation
+
+```
+alloc(need_size):
+  1. Check both pools have capacity >= need_size
+  2. Allocate full_indices from full pool
+  3. Allocate swa_indices from SWA pool
+  4. Update mapping: full_to_swa_index_mapping[full_indices] = swa_indices
+  5. Return full_indices (SWA indices are transparent to callers)
+```
+
+For paged mode (`page_size > 1`), `alloc_extend` and `alloc_decode` follow the same pattern but use page-level allocation with **atomic rollback**: if SWA allocation fails after full allocation succeeds, the full pages are rolled back to prevent partial allocation.
+
+### Freeing
+
+| Method | What it frees | When called |
+|--------|--------------|-------------|
+| `free(indices)` | Both full and SWA pools | Request completion |
+| `free_swa(indices)` | SWA pool only (looks up mapping, frees non-zero entries, zeroes mapping) | Per-request SWA eviction |
+| `count_swa_mapped(indices)` | Nothing (read-only) — counts indices with active SWA mapping | Bookkeeping before `free_swa` |
+
+## 4. Per-Request SWA Eviction (`_evict_swa`)
+
+This function frees SWA slots that fall outside the sliding window from a request's `req_to_token` buffer.
+
+### Algorithm
+
+```
+_evict_swa(req, pre_len, sliding_window_size, page_size):
+  1. new_evicted = max(req.swa_evicted_seqlen, pre_len - sliding_window_size)
+  2. If page_size > 1: align new_evicted down to page boundary
+  3. If new_evicted <= req.swa_evicted_seqlen: return (nothing to evict)
+  4. Read full-pool indices from req_to_token[swa_evicted_seqlen : new_evicted]
+  5. Count actual SWA slots to free (count_swa_mapped)
+  6. free_swa(those indices)
+  7. Update req.swa_evicted_seqlen = new_evicted
+```
+
+### Example
+
+With `sliding_window=128`, `page_size=256`, and `seqlen=2049`:
+
+```
+new_evicted = max(0, 2049 - 128) = 1921
+page-aligned: (1921 // 256) * 256 = 1792
+Free SWA slots in [0, 1792), retain [1792, 2049) within the window.
+```
+
+## 5. Extend Phase Behavior
+
+With overlap scheduling enabled, `maybe_evict_swa` is gated by `extend_batch_idx` to prevent freeing SWA pages that a previous extend batch may still be reading on device:
+
+| Condition | Action | Reason |
+|-----------|--------|--------|
+| `extend_batch_idx < 2` | Eviction **skipped** | Previous extend batch may still be executing |
+| `extend_batch_idx >= 2` | Eviction proceeds with `pre_len -= chunked_prefill_size` | Safe: previous batch has completed |
+
+This creates a one-chunk safety delay: chunk N+1 evicts chunk N-1's outdated SWA cache.
+
+**Example** (8K tokens, chunk_size=2048, sliding_window=128, page_size=256, overlap enabled):
+
+| Chunk | `extend_batch_idx` | Action | SWA slots freed |
+|-------|-------------------|--------|-----------------|
+| 1 | 0 | Skipped | — |
+| 2 | 1 | Skipped | — |
+| 3 | 2 | `pre_len=4096-2048=2048`, evicts `[0, 1792)` | 1792 |
+| 4 | 3 | `pre_len=6144-2048=4096`, evicts `[1792, 3840)` | 2048 |
+
+Without overlap scheduling, there is no `extend_batch_idx` gate and no `pre_len` adjustment:
+
+| Chunk | Action | SWA slots freed |
+|-------|--------|-----------------|
+| 1 | `pre_len=0`, nothing to evict | — |
+| 2 | `pre_len=2048`, evicts `[0, 1792)` | 1792 |
+| 3 | `pre_len=4096`, evicts `[1792, 3840)` | 2048 |
+
+## 6. Decode Phase Behavior
+
+### Eviction Interval
+
+```python
+evict_interval = max(min(sliding_window_size, page_size), 1)
+```
+
+| Scenario | Interval | Rationale |
+|----------|----------|-----------|
+| `page_size >= sliding_window_size` | Every step | Window advances past a full page each step |
+| `page_size < sliding_window_size` | Every `page_size` steps | Avoid partial-page eviction |
+| `evict_interval == 1` | Every step | `max(..., 1)` guard prevents `x % 1 == 1` (always false) from disabling eviction |
+
+### Overlap Safety
+
+| Condition | Action | Reason |
+|-----------|--------|--------|
+| `decode_batch_idx == 0` | Eviction **skipped** | Previous decode batch may still be reading SWA pages on device |
+| `decode_batch_idx > 0` | Eviction triggers on `decode_batch_idx % evict_interval == 1` | Safe: previous batch has completed |
+
+## 7. Summary
+
+| Component | Description |
+|-----------|-------------|
+| **Dual-pool architecture** | Full pool (all layers, all history) + SWA pool (SWA layers, window only) |
+| **Index mapping** | `full_to_swa_index_mapping` translates full-pool indices to SWA-pool indices |
+| **Allocation** | Atomic dual-pool alloc with rollback on SWA exhaustion |
+| **Extend eviction** | Proactive per-chunk; skips first 2 chunks for overlap safety |
+| **Decode eviction** | Periodic per-step; skips batch 0 for overlap safety |
+| **Eviction algorithm** | `_evict_swa`: advance watermark, page-align, free SWA slots in `[old, new)` |
+| **RadixCache SWA** | Not supported — hybrid models must use `--disable-radix-cache` |
@@ -41,6 +41,7 @@ class FlashAttentionMetadata:
     seq_lens: jax.Array = None
     distribution: jax.Array = None
     custom_mask: jax.Array = None
+    swa_page_indices: jax.Array = None
 
     def tree_flatten(self):
         children = (
@@ -51,6 +52,7 @@ def tree_flatten(self):
             self.seq_lens,
             self.distribution,
             self.custom_mask,
+            self.swa_page_indices,
         )
 
         aux_data = {}
@@ -67,6 +69,7 @@ def tree_unflatten(cls, aux_data, children):
         obj.seq_lens = children[4]
         obj.distribution = children[5]
         obj.custom_mask = children[6]
+        obj.swa_page_indices = children[7]
 
         return obj
 
@@ -96,6 +99,7 @@ def __init__(
         self.kv_partition_axis = kv_partition_axis
         self.forward_metadata = nnx.data(FlashAttentionMetadata())
         self.mesh = mesh
+        self.swa_index_mapping = None
 
     def get_forward_metadata(
         self,
@@ -151,17 +155,47 @@ def get_forward_metadata(
         else:
             raise ValueError(f"Invalid forward mode: {batch.forward_mode}")
 
-        (
-            metadata.num_seqs,
-            metadata.cu_q_lens,
-            metadata.cu_kv_lens,
-            metadata.page_indices,
-            metadata.seq_lens,
-            metadata.distribution,
-        ) = device_array(
-            (num_seqs, cu_q_lens, cu_kv_lens, page_indices, seq_lens, distribution),
-            sharding=(NamedSharding(self.mesh, P()) if jax.process_count() == 1 else None),
-        )
+        # Compute swa_page_indices if SWA index mapping is available
+        swa_page_indices = None
+        if self.swa_index_mapping is not None:
+            swa_cache_loc = self.swa_index_mapping[batch.cache_loc]
+            swa_indices = np.arange(0, len(swa_cache_loc), self.page_size)
+            swa_selected = swa_cache_loc[swa_indices]
+            swa_page_indices = (swa_selected // self.page_size).astype(np.int32)
+
+        if swa_page_indices is not None:
+            (
+                metadata.num_seqs,
+                metadata.cu_q_lens,
+                metadata.cu_kv_lens,
+                metadata.page_indices,
+                metadata.seq_lens,
+                metadata.distribution,
+                metadata.swa_page_indices,
+            ) = device_array(
+                (
+                    num_seqs,
+                    cu_q_lens,
+                    cu_kv_lens,
+                    page_indices,
+                    seq_lens,
+                    distribution,
+                    swa_page_indices,
+                ),
+                sharding=(NamedSharding(self.mesh, P()) if jax.process_count() == 1 else None),
+            )
+        else:
+            (
+                metadata.num_seqs,
+                metadata.cu_q_lens,
+                metadata.cu_kv_lens,
+                metadata.page_indices,
+                metadata.seq_lens,
+                metadata.distribution,
+            ) = device_array(
+                (num_seqs, cu_q_lens, cu_kv_lens, page_indices, seq_lens, distribution),
+                sharding=(NamedSharding(self.mesh, P()) if jax.process_count() == 1 else None),
+            )
         return metadata
 
     def get_eagle_forward_metadata(self, batch: ModelWorkerBatch):
@@ -454,7 +488,13 @@ def __call__(
             causal = 0
         # Select page indices and remap to SWA pool if KV cache supports it
         page_indices_arg = self.forward_metadata.page_indices
-        if hasattr(token_to_kv_pool, "remap_cache_loc") and self.page_size == 1:
+        if self.forward_metadata.swa_page_indices is not None and hasattr(
+            token_to_kv_pool, "layers_mapping"
+        ):
+            _, is_swa = token_to_kv_pool.layers_mapping[layer.layer_id]
+            if is_swa:
+                page_indices_arg = self.forward_metadata.swa_page_indices
+        elif hasattr(token_to_kv_pool, "remap_cache_loc") and self.page_size == 1:
             page_indices_arg = token_to_kv_pool.remap_cache_loc(page_indices_arg, layer.layer_id)
 
         in_specs = (
 
@@ -65,6 +65,7 @@
     "speculative_accept_threshold_single",
     "speculative_accept_threshold_acc",
     "enable_deterministic_sampling",
+    "chunked_prefill_size",
 ]
 
 PADDING_BUCKETS = [1 << i for i in range(6, 21)]
@@ -306,6 +307,11 @@ def __init__(
             ) = None
         self.hidden_states: list[list[float]] = []
 
+        # SWA eviction tracking
+        self.swa_evicted_seqlen: int = 0
+        self.extend_batch_idx: int = 0
+        self.decode_batch_idx: int = 0
+
         # The number of cached tokens that were already cached in the KV cache
         self.cached_tokens = 0
         self.already_computed = 0
@@ -741,6 +747,63 @@ def mix_with_running(self, running_batch: ScheduleBatch):
         self.extend_num_tokens += running_bs
         self.extend_logprob_start_lens.extend([0] * running_bs)
 
+    def _evict_swa(self, req: Req, pre_len: int, sliding_window_size: int, page_size: int):
+        """Evict SWA pool tokens outside the sliding window for a single request."""
+        new_evicted = max(req.swa_evicted_seqlen, pre_len - sliding_window_size)
+        if page_size > 1:
+            new_evicted = (new_evicted // page_size) * page_size
+        if new_evicted <= req.swa_evicted_seqlen:
+            return
+        free_slots = self.req_to_token_pool.req_to_token[
+            req.req_pool_idx, req.swa_evicted_seqlen : new_evicted
+        ]
+        # Count actual SWA slots that will be freed (those with active mapping)
+        num_swa_freed = self.token_to_kv_pool_allocator.count_swa_mapped(free_slots)
+        self.token_to_kv_pool_allocator.free_swa(free_slots)
+        # Notify cache layer: these slots were protected (node is locked),
+        # so adjust swa_protected_size_ to prevent bookkeeping leak.
+        if num_swa_freed > 0 and isinstance(self.tree_cache, SWARadixCache):
+            self.tree_cache.adjust_swa_protected_size(-num_swa_freed)
+        req.swa_evicted_seqlen = new_evicted
+
+    def maybe_evict_swa(self, sliding_window_size=None):
+        """Evict SWA pool tokens for all requests if hybrid model."""
+        if not self.is_hybrid:
+            return
+        if sliding_window_size is None:
+            sliding_window_size = getattr(self.model_config, "sliding_window", None)
+        if sliding_window_size is None or sliding_window_size <= 0:
+            return
+        page_size = getattr(
+            self.token_to_kv_pool_allocator,
+            "_page_size",
+            getattr(self.token_to_kv_pool_allocator, "page_size", 1),
+        )
+
+        if self.forward_mode is not None and self.forward_mode.is_decode():
+            # Evict at the smaller of sliding_window_size and page_size to avoid
+            # stale SWA slot accumulation.
+            evict_interval = max(min(sliding_window_size, page_size), 1)
+            for req in self.reqs:
+                if req.decode_batch_idx > 0 and (
+                    evict_interval <= 1 or req.decode_batch_idx % evict_interval == 1
+                ):
+                    self._evict_swa(req, req.seqlen - 1, sliding_window_size, page_size)
+            return
+
+        if self.forward_mode is None or not self.forward_mode.is_extend():
+            return
+
+        for i, req in enumerate(self.reqs):
+            pre_len = self.prefix_lens[i] if self.prefix_lens is not None else 0
+            if self.enable_overlap and req.is_chunked > 0:
+                if req.extend_batch_idx < 2:
+                    continue
+                chunked_prefill_size = global_server_args_dict.get("chunked_prefill_size")
+                if chunked_prefill_size is not None and chunked_prefill_size > 0:
+                    pre_len -= chunked_prefill_size
+            self._evict_swa(req, pre_len, sliding_window_size, page_size)
+
     def prepare_for_extend(self):
         self.forward_mode = ForwardMode.EXTEND
 
@@ -776,6 +839,7 @@ def prepare_for_extend(self):
             req.cached_tokens += pre_len - req.already_computed
             req.already_computed = seq_len
             req.is_retracted = False
+            req.extend_batch_idx += 1
 
             # Compute the relative logprob_start_len in an extend batch
             if req.logprob_start_len >= pre_len:
@@ -878,6 +942,10 @@ def prepare_for_extend(self):
             )
             pt += extend_lens[i]
 
+        # Evict SWA tokens outside sliding window
+        if self.is_hybrid:
+            self.maybe_evict_swa()
+
         # Build sampling info
         self.sampling_info = SamplingBatchInfo.from_schedule_batch(
             self,
@@ -1019,6 +1087,11 @@ def prepare_for_idle(self):
     def prepare_for_decode(self):
         self.forward_mode = ForwardMode.DECODE
         bs = len(self.reqs)
+
+        # Evict SWA tokens outside sliding window
+        if self.is_hybrid:
+            self.maybe_evict_swa()
+
         if self.spec_algorithm is not None and self.spec_algorithm.is_eagle():
             # if spec decoding is used, the decode batch is prepared inside
             # `forward_batch_speculative_generation` after running draft models.
@@ -1068,6 +1141,9 @@ def prepare_for_decode(self):
             (self.req_pool_indices, locs), self.out_cache_loc.astype(np.int32)
         )
 
+        for req in self.reqs:
+            req.decode_batch_idx += 1
+
     def filter_batch(
         self,
         chunked_req_to_exclude: Req | list[Req] | None = None,