Checklist
Describe the bug
We evaluated the hierarchical KV cache (HiCache) performance on SGLang v0.5.10, comparing an MLA model and a GQA model across different cache scenarios. The metrics collected are TPOT, TTFT, and end-to-end latency.
- GQA Model : MiniMax M2.7
- MLA Model : GLM-4.7-Flash
We tested the following three scenarios:
Full Recompute: HiCache disabled (i.e., without --enable-hierarchical-cache).
HBM Hit: HiCache enabled (L1/L2 only, no L3 backend), with a 90% hit rate.
L3 Backend Hit: HiCache enabled with a configured L3 storage backend (we choose two : hifile and ucm[https://github.com/ModelEngine-Group/unified-cache-management]), with a 90% hit rate.
Test Result:
- For the GQA model (MiniMax M2.7): The results match expectations. TPOT remains normal across all scenarios. In scenarios 2 and 3, we observed clear TTFT improvements under long-input, high-concurrency workloads.
Input Length: 8192, Output: 512, Concurrency: 16
| Scenario |
TPOT(ms) |
TTFT(ms) |
E2E(ms) |
| Full Recompute |
16.45 |
1725.09 |
10132.96 |
| HBM Hit |
14.13 |
481.69 |
7701.72 |
| Hifile Hit |
16.05 |
2450.24 |
10652.99 |
| UCM Hit |
15.54 |
1447.09 |
9290.44 |
Input Length: 131072, Output: 512, Concurrency: 1
| Scenario |
TPOT(ms) |
TTFT(ms) |
E2E(ms) |
| Full Recompute |
9.21 |
5117.95 |
9825.68 |
| HBM Hit |
9.07 |
1027.18 |
5663.57 |
| Hifile Hit |
9.25 |
5249.94 |
9975.92 |
| UCM Hit |
9.04 |
2697 |
7315.72 |
- For the MLA model (GLM-4.7-Flash): The results are unexpected and show significant issues. TPOT performance is only acceptable in Scenario 1 (Full Recompute, i.e., without --enable-hierarchical-cache). In contrast, when HiCache is enabled (Scenario 2 and Scenario 3), the TPOT performance degrades significantly, which is contrary to our expectations.
Input Length: 8192, Output: 512, Concurrency: 16
| Scenario |
TPOT(ms) |
TTFT(ms) |
E2E(ms) |
| Full Recompute |
12.53 |
1114.29 |
7515.15 |
| HBM Hit |
17.79 |
420.55 |
9510.62 |
| Hifile Hit |
18.95 |
2586.86 |
12270.21 |
| UCM Hit |
18.89 |
1065.07 |
10716.84 |
Input Length: 32768, Output: 512, Concurrency: 4
| Scenario |
TPOT(ms) |
TTFT(ms) |
E2E(ms) |
| Full Recompute |
10.94 |
1557.5 |
7147.77 |
| HBM Hit |
42.28 |
378.8 |
21986.12 |
| Hifile Hit |
40.24 |
4595.19 |
25155.9 |
| UCM Hit |
40.07 |
1302.31 |
21778.21 |
Input Length: 131072, Output: 512, Concurrency: 1
| Scenario |
TPOT(ms) |
TTFT(ms) |
E2E(ms) |
| Full Recompute |
8.71 |
5263.76 |
9715.19 |
| HBM Hit |
136.13 |
1099.65 |
70663.87 |
| Hifile Hit |
135.64 |
12790.47 |
82102.65 |
| UCM Hit |
136.8 |
3295.84 |
73201.72 |
Reproduction
Full Recompute:
sglang serve
--model-path /models/GLM-4.7-Flash
--tp 4
--mem-fraction-static 0.85
--page-size 128
--trust-remote-code
--port 7800
HBM Hit:
sglang serve
--model-path /models/GLM-4.7-Flash
--tp 4
--mem-fraction-static 0.85
--page-size 128
--trust-remote-code
--port 7800
--enable-hierarchical-cache
--hicache-mem-layout page_first
--hicache-write-policy write_through
Hifile Hit:
export SGLANG_HICACHE_FILE_BACKEND_STORAGE_DIR=/mnt/test
sglang serve
--model-path /models/GLM-4.7-Flash
--tp 4
--mem-fraction-static 0.85
--page-size 128
--trust-remote-code
--port 7800
--enable-hierarchical-cache
--hicache-mem-layout page_first
--hicache-write-policy write_through
--hicache-storage-backend file
--hicache-storage-prefetch-policy wait_complete
HBM hit:
HICACHE_CONFIG='{
"backend_name":"unifiedcache",
"module_path":"ucm.integration.sglang.unifiedcache_store",
"class_name":"UnifiedCacheStore",
"interface_v1":1,
"kv_connector_extra_config":{
"ucm_connector_name":"UcmPipelineStore",
"ucm_connector_config":{
"storage_backends":"/mnt/test",
"posix_io_engine": "aio"
}
}
}'
sglang serve
--model-path /models/GLM-4.7-Flash
--tp 4
--mem-fraction-static 0.85
--page-size 128
--trust-remote-code
--port 7800
--enable-hierarchical-cache
--hicache-mem-layout page_first
--hicache-write-policy write_through
--hicache-storage-backend dynamic
--hicache-storage-prefetch-policy wait_complete
--hicache-storage-backend-extra-config "$HICACHE_CONFIG"
Environment
Python: 3.12.3 (main, Mar 3 2026, 12:15:18) [GCC 13.3.0]
CUDA available: True
GPU 0,1,2,3,4,5,6,7: NVIDIA H100 80GB HBM3
GPU 0,1,2,3,4,5,6,7 Compute Capability: 9.0
CUDA_HOME: /usr/local/cuda
NVCC: Cuda compilation tools, release 12.9, V12.9.86
CUDA Driver Version: 570.195.03
PyTorch: 2.9.1+cu129
sglang: 0.5.10
sglang-kernel: 0.4.1
flashinfer_python: 0.6.7.post2
flashinfer_cubin: 0.6.7.post2
flashinfer_jit_cache: 0.6.7.post2+cu129
triton: 3.5.1
transformers: 5.3.0
torchao: 0.9.0
numpy: 2.3.5
aiohttp: 3.13.5
fastapi: 0.135.3
huggingface_hub: 1.9.0
interegular: 0.3.3
modelscope: 1.35.3
orjson: 3.11.8
outlines: 0.1.11
packaging: 26.0
psutil: 7.2.2
pydantic: 2.12.5
python-multipart: 0.0.22
pyzmq: 27.1.0
uvicorn: 0.43.0
uvloop: 0.22.1
vllm: Module Not Found
xgrammar: 0.1.32
openai: 2.6.1
tiktoken: 0.12.0
anthropic: 0.89.0
litellm: Module Not Found
torchcodec: 0.9.1
Checklist
Describe the bug
We evaluated the hierarchical KV cache (HiCache) performance on SGLang v0.5.10, comparing an MLA model and a GQA model across different cache scenarios. The metrics collected are TPOT, TTFT, and end-to-end latency.
We tested the following three scenarios:
Full Recompute: HiCache disabled (i.e., without --enable-hierarchical-cache).
HBM Hit: HiCache enabled (L1/L2 only, no L3 backend), with a 90% hit rate.
L3 Backend Hit: HiCache enabled with a configured L3 storage backend (we choose two : hifile and ucm[https://github.com/ModelEngine-Group/unified-cache-management]), with a 90% hit rate.
Test Result:
Input Length: 8192, Output: 512, Concurrency: 16
Input Length: 131072, Output: 512, Concurrency: 1
Input Length: 8192, Output: 512, Concurrency: 16
Input Length: 32768, Output: 512, Concurrency: 4
Input Length: 131072, Output: 512, Concurrency: 1
Reproduction
Full Recompute:
sglang serve
--model-path /models/GLM-4.7-Flash
--tp 4
--mem-fraction-static 0.85
--page-size 128
--trust-remote-code
--port 7800
HBM Hit:
sglang serve
--model-path /models/GLM-4.7-Flash
--tp 4
--mem-fraction-static 0.85
--page-size 128
--trust-remote-code
--port 7800
--enable-hierarchical-cache
--hicache-mem-layout page_first
--hicache-write-policy write_through
Hifile Hit:
export SGLANG_HICACHE_FILE_BACKEND_STORAGE_DIR=/mnt/test
sglang serve
--model-path /models/GLM-4.7-Flash
--tp 4
--mem-fraction-static 0.85
--page-size 128
--trust-remote-code
--port 7800
--enable-hierarchical-cache
--hicache-mem-layout page_first
--hicache-write-policy write_through
--hicache-storage-backend file
--hicache-storage-prefetch-policy wait_complete
HBM hit:
HICACHE_CONFIG='{
"backend_name":"unifiedcache",
"module_path":"ucm.integration.sglang.unifiedcache_store",
"class_name":"UnifiedCacheStore",
"interface_v1":1,
"kv_connector_extra_config":{
"ucm_connector_name":"UcmPipelineStore",
"ucm_connector_config":{
"storage_backends":"/mnt/test",
"posix_io_engine": "aio"
}
}
}'
sglang serve
--model-path /models/GLM-4.7-Flash
--tp 4
--mem-fraction-static 0.85
--page-size 128
--trust-remote-code
--port 7800
--enable-hierarchical-cache
--hicache-mem-layout page_first
--hicache-write-policy write_through
--hicache-storage-backend dynamic
--hicache-storage-prefetch-policy wait_complete
--hicache-storage-backend-extra-config "$HICACHE_CONFIG"
Environment
Python: 3.12.3 (main, Mar 3 2026, 12:15:18) [GCC 13.3.0]
CUDA available: True
GPU 0,1,2,3,4,5,6,7: NVIDIA H100 80GB HBM3
GPU 0,1,2,3,4,5,6,7 Compute Capability: 9.0
CUDA_HOME: /usr/local/cuda
NVCC: Cuda compilation tools, release 12.9, V12.9.86
CUDA Driver Version: 570.195.03
PyTorch: 2.9.1+cu129
sglang: 0.5.10
sglang-kernel: 0.4.1
flashinfer_python: 0.6.7.post2
flashinfer_cubin: 0.6.7.post2
flashinfer_jit_cache: 0.6.7.post2+cu129
triton: 3.5.1
transformers: 5.3.0
torchao: 0.9.0
numpy: 2.3.5
aiohttp: 3.13.5
fastapi: 0.135.3
huggingface_hub: 1.9.0
interegular: 0.3.3
modelscope: 1.35.3
orjson: 3.11.8
outlines: 0.1.11
packaging: 26.0
psutil: 7.2.2
pydantic: 2.12.5
python-multipart: 0.0.22
pyzmq: 27.1.0
uvicorn: 0.43.0
uvloop: 0.22.1
vllm: Module Not Found
xgrammar: 0.1.32
openai: 2.6.1
tiktoken: 0.12.0
anthropic: 0.89.0
litellm: Module Not Found
torchcodec: 0.9.1