Hybrid kv cache for LLaMA4 #6563

tarinkk · 2025-05-24T03:15:42Z

Motivation

LLaMA 4 uses local attention in 3/4 of its layers. To accommodate this, we divide the KV cache into two parts: a global cache and a local cache. Determining the optimal ratio between their sizes is nontrivial, so we introduce a tunable parameter p, where 0 ≤ p ≤ 1.

When p = 1, the ratio of global to local cache sizes is equal to context_length / attention_chunk_size (e.g., 8192).
When p = 0, the two caches are of equal size.
The ratio transitions linearly as p varies from 0 to 1. By default, we set p = 0.5.

Currently, we disable the radix tree, so prefix matching is not a concern.

During local attention, certain KV cache entries can be safely removed:

In chunked prefill: entries in the range attention_chunk_size * (prelen // attention_chunk_size) are no longer needed and can be evicted.
In decoding: entries in the range attention_chunk_size * ((seqlen - 1) // attention_chunk_size) are similarly unused and can be discarded

Modifications

Add a server argument: hybrid_kvcache_ratio with default value 0.5 .This turns on the hybrid KV cache mode and controls the global-to-local cache size ratio.
In model_config.py: add is_hybrid_model() to determine whether the current model configuration satisfies the conditions to enable hybrid KV caching.
In model_runner.py:
- Implement get_num_token_hybrid() to get the size of global and local KV cache
- Initialize token_to_kv_pool_allocator_local to allocate local cache indices
In memory_pool.py:
- In ReqToTokenPool, add a new attr req_to_token_local to store local indices in KV cache per req
- modify MHATokenToKVPool._create_buffer to create global and local cache buffers.
In schedule_batch.py:
- In prepare_for_extend() and prepare_for_decode(), allocate out_cache_loc_local for local attention KV indices, and store them in token_to_token_pool.req_to_token_local.
- Apply the new eviction rule via self.tree_cache.evict_hybrid() right before allocating new indices.
In chunk_cache.py:
- evict_hybrid() is defined to apply the new evict rule in chunked prefill and decoding.
- Modify cache_finished_req() to free local indices once the reqs are finished
In flashattention_backend.py
- When hybrid cache is enabled, set cache_loc = forward_batch.out_cache_loc_local in normal both decode and extend forward.
- The page_tables in metadata are modified correspondingly
some essential modification for memory computations

Every time we meet token_to_kv_pool_allocator.available_size(), change it to

min(token_to_kv_pool_allocator.available_size(), token_to_kv_pool_allocator_local.available_size())
Some essential changes to support CUDA graph

Experiments

Loogle Evaluation on H100:

Enabling hybrid KV cache increases the throughput by ~10% w.r.t the baseline.

With hybrid KV cache (total time ~ 694s, throughput ~222 token/s )

python3 -m sglang.launch_server --model-path meta-llama/Llama-4-Scout-17B-16E-Instruct --port 30002 --tp 8 --mem-fraction-static 0.8 --context-length 100000 --attention-backend fa3 --disable-radix-cache --hybrid-kvcache-ratio 0.95

Baseline (total time ~ 746s, throughput ~204 token/s)

python3 -m sglang.launch_server --model-path meta-llama/Llama-4-Scout-17B-16E-Instruct --port 30002 --tp 8 --mem-fraction-static 0.8 --context-length 100000 --attention-backend fa3 --disable-radix-cache

Context Length Improvements with Hybrid KV Cache

On H100:
Enabling hybrid KV cache significantly increases the maximum context length from 1.3M to 5M tokens:

With hybrid KV cache (5M context length):

python3 -m sglang.launch_server --model-path meta-llama/Llama-4-Scout-17B-16E-Instruct --port 30002 --tp 8 --context-length 5000000 --attention-backend fa3 --disable-radix-cache --hybrid-kvcache-ratio 1 --cuda-graph-max-bs 16 --max-running-requests 16

Baseline (1.3M context length):

python3 -m sglang.launch_server --model-path meta-llama/Llama-4-Scout-17B-16E-Instruct --port 30002 --tp 8 --context-length 1300000 --attention-backend fa3 --cuda-graph-max-bs 16 --max-running-requests 16

On H200:
With hybrid KV cache enabled, the maximum context length for LLaMA-4 reaches 10M tokens, compared to the 3.5M token baseline:

With hybrid KV cache (10M context length):

python3 -m sglang.launch_server --model-path meta-llama/Llama-4-Scout-17B-16E-Instruct --port 30002 --tp 8 --context-length 10000000 --attention-backend fa3 --disable-radix-cache --hybrid-kvcache-ratio 1 --cuda-graph-max-bs 32 --max-running-requests 32

Baseline (3.5M context length):

python3 -m sglang.launch_server --model-path meta-llama/Llama-4-Scout-17B-16E-Instruct --port 30002 --tp 8 --context-length 3500000 --attention-backend fa3 --cuda-graph-max-bs 32 --max-running-requests 32

TODO

Enable when page_size > 1
Apply evict rule when radix tree is enable
...

Checklist

Format your code according to the Code Formatting with Pre-Commit.
Add unit tests as outlined in the Running Unit Tests.
Update documentation / docstrings / example tutorials as needed, according to Writing Documentation.
Provide throughput / latency benchmark results and accuracy evaluation results as needed, according to Benchmark and Profiling and Accuracy Results.
For reviewers: If you haven't made any contributions to this PR and are only assisting with merging the main branch, please remove yourself as a co-author when merging the PR.
Please feel free to join our Slack channel at https://slack.sglang.ai to discuss your PR.

hybrid cache hybrid cache hybrid cache end with evict rules and reformat 1 2

python/sglang/srt/server_args.py

python/sglang/srt/configs/model_config.py

python/sglang/srt/layers/attention/flashattention_backend.py

hanming-lu · 2025-06-27T00:18:01Z

python/sglang/srt/layers/attention/flashattention_backend.py

@@ -624,6 +626,9 @@ def forward_extend(
        q_rope: Optional[torch.Tensor] = None,
        k_rope: Optional[torch.Tensor] = None,
    ):
+        use_hybrid_loc = self.is_hybrid is not None and (


didn't find the usage of use_hybrid_loc

hanming-lu · 2025-06-27T00:18:08Z

python/sglang/srt/layers/attention/flashattention_backend.py

@@ -887,6 +892,9 @@ def forward_decode(
        q_rope: Optional[torch.Tensor] = None,
        k_rope: Optional[torch.Tensor] = None,
    ) -> torch.Tensor:
+        use_hybrid_loc = self.is_hybrid is not None and (


similar here

hanming-lu · 2025-06-27T00:24:44Z

python/sglang/srt/managers/schedule_batch.py

@@ -523,6 +526,8 @@ def __init__(
        # Prefix info
        # The indices to kv cache for the shared prefix.
        self.prefix_indices: torch.Tensor = []
+        # The indices to local kv cache for the shared prefix.
+        self.prefix_indices_local: torch.Tensor = []


didn't find usage of prefix_indices_local

python/sglang/srt/managers/scheduler.py

hanming-lu · 2025-06-27T00:30:14Z

python/sglang/srt/mem_cache/allocator.py

@@ -55,6 +57,11 @@ def __init__(
    def debug_print(self) -> str:
        return ""

+    def log_usage(self, evictable_size: int = 0):
+        num_used = self.size - (self.available_size() + evictable_size)
+        msg = f"#token: {num_used}, token usage: {num_used / self.size:.2f}, "


should we show both swa and full token usage?

I define log_usage for SWA case around line 216 in allocator.py.

hanming-lu

Overall looks great! Left some comments, all small changes, thanks!

hanming-lu · 2025-06-27T00:39:00Z

python/sglang/srt/managers/scheduler.py

+            available_token_size = self.token_to_kv_pool_allocator.full_available_size()
+        else:
+            available_token_size = self.token_to_kv_pool_allocator.available_size()
+        available_size = available_token_size + self.tree_cache.evictable_size()


We should use self.full_max_total_num_tokens and self.swa_max_total_num_tokens here, I think you already have it, each determines the max total per full attn / swa layer, resp. And compare full_available_size + 0 == max_total_full_num_tokens and swa_available_size + 0 = max_total_swa_num_tokens

hanming-lu · 2025-06-27T00:40:08Z

python/sglang/srt/mem_cache/allocator.py

@@ -113,7 +120,7 @@ def __init__(self, size: int, dtype: torch.dtype, device: str, kvcache: KVCache)
    def clear(self):
        # The padded slot 0 is used for writing dummy outputs from padded tokens.
        self.free_pages = torch.arange(
-            1, self.size + 1, dtype=torch.int64, device=self.device
+            1, self.size + 1, dtype=torch.int32, device=self.device


what's the reason behind this change?

I am not quite sure this part. I will make it back to int64.

I made all kv indices to be torch.int64. And later those indices will convert to torch.int32 when building page_table in order to support flash_attn_with_kvcache

hanming-lu · 2025-06-27T00:41:08Z

python/sglang/srt/mem_cache/allocator.py

+            device=device,
+        )
+        self.clear()
+        self._kvcache.register_mapping(weakref.proxy(self.full_to_swa_index_mapping))


same question as gemini, better to explain the reason for weakref

hanming-lu · 2025-06-27T00:46:17Z

python/sglang/srt/mem_cache/allocator.py

+            f"#token: global={used_full}, swa={used_swa}, "
+            f"token usage: global={used_full / self.size_full:.2f}, "


let's be consistent with naming, either full or global

I will double check the namings, Thank you so much for pointing out this problem.

hanming-lu · 2025-06-27T00:47:39Z

python/sglang/srt/mem_cache/allocator.py

+
+    def log_usage(self, evictable_size: int = 0):
+        used_full = self.size_full - (self.full_available_size() + evictable_size)
+        used_swa = self.size_swa - self.swa_available_size()


should pass in both swa_evictable_size and full_evictable_size. For SWAChunkCache, this value is always 0, but the logic here is cleaner.

hanming-lu · 2025-06-27T01:00:44Z

python/sglang/srt/model_executor/model_runner.py

+                * self.attention_chunk_size
+                / self.model_config.context_len
+            )
+            self.local_max_total_num_tokens = (


consistent naming please, either local or swa

hanming-lu · 2025-06-27T01:01:22Z

python/sglang/srt/model_executor/model_runner.py

+            self.local_max_total_num_tokens = (
+                4 * self.max_total_num_tokens * temp_ratio // (3 * temp_ratio + 1)
+            )
+            self.max_total_num_tokens = (


please add full or global for consistent naming

hanming-lu · 2025-06-27T01:02:47Z

python/sglang/srt/model_executor/model_runner.py

@@ -852,6 +859,39 @@ def profile_max_num_token(self, total_gpu_memory: int):
        max_num_token = int(rest_memory * (1 << 30) // cell_size)
        return max_num_token

+    def get_num_token_hybrid(self):


Suggested change

def get_num_token_hybrid(self):

def set_num_token_hybrid(self):

hanming-lu · 2025-06-27T01:05:52Z

python/sglang/srt/model_executor/model_runner.py

+
+        if self.token_to_kv_pool_allocator is None:
+            if self.page_size == 1:
+                if self.is_hybrid is None:


better to do if self.is_hybrid:, easier for future additions

hanming-lu · 2025-06-27T01:06:32Z

python/sglang/srt/server_args.py

@@ -61,6 +61,7 @@ class ServerArgs:
    is_embedding: bool = False
    enable_multimodal: Optional[bool] = None
    revision: Optional[str] = None
+    hybrid_kvcache_ratio: Optional[float] = None


didn't find usage of it

It is for getting a mix ratio from server parser.
hybrid_kvcache_ratio == 0: pure uniform: swa_size / full_size = 1.
hybrid_kvcache_ratio ==1.0: pure hybrid: swa_size / full_size = local_attention_size / context_length
It is called it in model_config.py around 280

I missed it.

I have modified my code according to your comments. Thank you very much for your helpful suggestions.

CatherineSue · 2025-06-27T03:14:09Z

python/sglang/srt/mem_cache/chunk_cache.py

@@ -63,3 +66,32 @@ def dec_lock_ref(self, node: Any):

    def pretty_print(self):
        return ""
+
+
+class SWAChunkCache(ChunkCache):


what does SWA mean?

I tried to make my namings in consistent with the ones in #7367.
SWA means sliding window attention (I guess...).

Co-authored-by: Hanming Lu <[email protected]>

CatherineSue · 2025-06-27T05:16:08Z

python/sglang/srt/mem_cache/memory_pool.py

@@ -431,6 +436,136 @@ def move_kv_cache(self, tgt_loc: torch.Tensor, src_loc: torch.Tensor):
        )


+class SWAKVPool(KVCache):


Can we add a docstring for this to indicate its usage and meaning?

hanming-lu

Looks great! Thanks for addressing all the comments.

zhyncs · 2025-06-27T08:36:17Z

python/sglang/srt/model_executor/cuda_graph_runner.py

@@ -29,6 +29,7 @@
 from sglang.srt.custom_op import CustomOp
 from sglang.srt.distributed import get_tensor_model_parallel_rank
 from sglang.srt.distributed.parallel_state import GroupCoordinator, graph_capture
+from sglang.srt.layers.attention.flashattention_backend import FlashAttentionBackend


We shouldn't import this on non-nv devices

I rewrote this part.

hybrid cache

cf4fc0a

hybrid cache hybrid cache hybrid cache end with evict rules and reformat 1 2

tarinkk changed the title ~~[WIP]hybrid kv cache for LlaMa4~~ [WIP]hybrid kv cache for LlaMA4 May 24, 2025

tarinkk changed the title ~~[WIP]hybrid kv cache for LlaMA4~~ [WIP]hybrid kv cache for LLaMA4 May 24, 2025

zhyncs assigned CatherineSue and ch-wan May 24, 2025

zhyncs added the high priority label May 24, 2025

set default values

d053027

tarinkk force-pushed the llama4hybridCache branch from 5e66e89 to d053027 Compare May 25, 2025 00:29

tarinkk marked this pull request as ready for review May 25, 2025 00:29

tarinkk requested review from merrymercy, Ying1123, hnyls2002, zhyncs, ispobock, xiezhq-hermann, HaiShaw, ch-wan, BBuf and ByronHsu as code owners May 25, 2025 00:29

Fix a bug

d1203cb

tarinkk force-pushed the llama4hybridCache branch from c9fa7ea to d1203cb Compare May 25, 2025 01:50

Merge branch 'main' into llama4hybridCache

693d553

CatherineSue reviewed May 25, 2025

View reviewed changes

python/sglang/srt/server_args.py Outdated Show resolved Hide resolved

CatherineSue reviewed May 25, 2025

View reviewed changes

python/sglang/srt/configs/model_config.py Outdated Show resolved Hide resolved

tarinkk and others added 6 commits May 25, 2025 03:08

reformat

35206e7

Merge branch 'main' into llama4hybridCache

4926196

p&d log

371dde0

Merge branch 'main' into llama4hybridCache

0be2cc3

lint fix

7a0da21

Merge branch 'main' into llama4hybridCache

124b18f

ch-wan and others added 7 commits June 24, 2025 03:06

Merge branch 'main' into llama4hybridCache

ae4bec0

Merge branch 'main' into llama4hybridCache

eeaedcc

Merge branch 'main' into llama4hybridCache

2587a8c

add todo

57dee50

Merge branch 'main' into llama4hybridCache

86e735f

fix typo

0fc79df

Merge branch 'main' into llama4hybridCache

91120dd

zhyncs requested review from CatherineSue and hanming-lu June 25, 2025 08:57

hybrid_layer_id

6f52ae6

hanming-lu reviewed Jun 27, 2025

View reviewed changes

python/sglang/srt/layers/attention/flashattention_backend.py Outdated Show resolved Hide resolved

hanming-lu reviewed Jun 27, 2025

View reviewed changes

python/sglang/srt/managers/scheduler.py Outdated Show resolved Hide resolved

hanming-lu reviewed Jun 27, 2025

View reviewed changes

hanming-lu requested changes Jun 27, 2025

View reviewed changes

CatherineSue reviewed Jun 27, 2025

View reviewed changes

tarinkk and others added 3 commits June 27, 2025 00:39

Update python/sglang/srt/layers/attention/flashattention_backend.py

7805618

Co-authored-by: Hanming Lu <[email protected]>

Update python/sglang/srt/managers/scheduler.py

8c39ab3

Co-authored-by: Hanming Lu <[email protected]>

delete use_hybrid_loc and prefix_indices_local

26c41fc

CatherineSue reviewed Jun 27, 2025

View reviewed changes

tarinkk added 2 commits June 27, 2025 07:50

changes from comment

64de8b1

change local

712ed41

hanming-lu approved these changes Jun 27, 2025

View reviewed changes

Merge branch 'main' into llama4hybridCache

d65b359

zhyncs requested changes Jun 27, 2025

View reviewed changes

tarinkk and others added 2 commits June 27, 2025 09:55

import fa3 issue

9eb67ca

Merge branch 'main' into llama4hybridCache

beb4b74

zhyncs merged commit eb6c2c1 into sgl-project:main Jun 28, 2025
102 of 132 checks passed

		f"#token: global={used_full}, swa={used_swa}, "
		f"token usage: global={used_full / self.size_full:.2f}, "

	def get_num_token_hybrid(self):
	def set_num_token_hybrid(self):

		@@ -431,6 +436,136 @@ def move_kv_cache(self, tgt_loc: torch.Tensor, src_loc: torch.Tensor):
		)


		class SWAKVPool(KVCache):

Hybrid kv cache for LLaMA4 #6563

Hybrid kv cache for LLaMA4 #6563

Uh oh!

Conversation

tarinkk commented May 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Modifications

Experiments

Loogle Evaluation on H100:

Context Length Improvements with Hybrid KV Cache

TODO

Checklist

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

hanming-lu Jun 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

hanming-lu left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

hanming-lu Jun 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

hanming-lu Jun 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

CatherineSue Jun 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tarinkk commented May 24, 2025 •

edited

Loading

hanming-lu Jun 27, 2025 •

edited

Loading

hanming-lu left a comment •

edited

Loading

hanming-lu Jun 27, 2025 •

edited

Loading

hanming-lu Jun 27, 2025 •

edited

Loading

CatherineSue Jun 27, 2025 •

edited

Loading

hanming-lu left a comment •

edited

Loading