[Bug] GLM-4.7-Flash: TPOT performance severely degraded when --enable-hierarchical-cache is on

### Checklist

- [x] I searched related issues but found no solution.
- [x] The bug persists in the latest version.
- [x] Issues without environment info and a minimal reproducible demo are hard to resolve and may receive no feedback.
- [x] If this is not a bug report but a general question, please start a discussion at https://github.com/sgl-project/sglang/discussions. Otherwise, it will be closed.
- [x] Please use English. Otherwise, it will be closed.

### Describe the bug

We evaluated the hierarchical KV cache (HiCache) performance on SGLang v0.5.10, comparing an MLA model and a GQA model across different cache scenarios. The metrics collected are TPOT, TTFT, and end-to-end latency.
- GQA Model : MiniMax M2.7
- MLA Model : GLM-4.7-Flash

We tested the following three scenarios:
Full Recompute: HiCache disabled (i.e., without --enable-hierarchical-cache).
HBM Hit: HiCache enabled (L1/L2 only, no L3 backend), with a 90% hit rate.
L3 Backend Hit: HiCache enabled with a configured L3 storage backend (we choose two : hifile and ucm[https://github.com/ModelEngine-Group/unified-cache-management]), with a 90% hit rate.

Test Result:
- For the GQA model (MiniMax M2.7): The results match expectations. TPOT remains normal across all scenarios. In scenarios 2 and 3, we observed clear TTFT improvements under long-input, high-concurrency workloads.
#### Input Length: 8192, Output: 512, Concurrency: 16

| Scenario | TPOT(ms) | TTFT(ms) | E2E(ms) |
|----------|----------|----------|---------|
| Full Recompute | 16.45 | 1725.09 | 10132.96 |
| HBM Hit | 14.13 | 481.69 | 7701.72 |
| Hifile Hit | 16.05 | 2450.24 | 10652.99 |
| UCM Hit |15.54 | 1447.09 | 9290.44 |

#### Input Length: 131072, Output: 512, Concurrency: 1

| Scenario | TPOT(ms) | TTFT(ms) | E2E(ms) |
|----------|----------|----------|---------|
| Full Recompute | 9.21 | 5117.95 | 9825.68 |
| HBM Hit | 9.07 | 1027.18 | 5663.57 |
| Hifile Hit | 9.25 | 5249.94 | 9975.92 |
| UCM Hit | 9.04 | 2697 | 7315.72 |
- For the MLA model (GLM-4.7-Flash): The results are unexpected and show significant issues. TPOT performance is only acceptable in Scenario 1 (Full Recompute, i.e., without --enable-hierarchical-cache). In contrast, when HiCache is enabled (Scenario 2 and Scenario 3), the TPOT performance degrades significantly, which is contrary to our expectations.
#### Input Length: 8192, Output: 512, Concurrency: 16

| Scenario | TPOT(ms) | TTFT(ms) | E2E(ms) |
|----------|----------|----------|---------|
| Full Recompute | 12.53 | 1114.29 | 7515.15 |
| HBM Hit | 17.79 | 420.55 | 9510.62 |
| Hifile Hit | 18.95 | 2586.86 | 12270.21 |
| UCM Hit |18.89 | 1065.07 | 10716.84 |

#### Input Length: 32768, Output: 512, Concurrency: 4

| Scenario | TPOT(ms) | TTFT(ms) | E2E(ms) |
|----------|----------|----------|---------|
| Full Recompute | 10.94 | 1557.5 | 7147.77 |
| HBM Hit | **42.28** | 378.8 | 21986.12 |
| Hifile Hit | **40.24** | 4595.19 | 25155.9 |
| UCM Hit | **40.07** | 1302.31 | 21778.21 |

#### Input Length: 131072, Output: 512, Concurrency: 1

| Scenario | TPOT(ms) | TTFT(ms) | E2E(ms) |
|----------|----------|----------|---------|
| Full Recompute | 8.71 | 5263.76 | 9715.19 |
| HBM Hit | **136.13** | 1099.65 | 70663.87 |
| Hifile Hit | **135.64** | 12790.47 | 82102.65 |
| UCM Hit | **136.8** | 3295.84 | 73201.72 |


### Reproduction

**Full Recompute:**
sglang serve \
  --model-path /models/GLM-4.7-Flash \
  --tp 4 \
  --mem-fraction-static 0.85 \
  --page-size 128 \
  --trust-remote-code \
  --port 7800

**HBM Hit:**
sglang serve \
  --model-path /models/GLM-4.7-Flash \
  --tp 4 \
  --mem-fraction-static 0.85 \
  --page-size 128 \
  --trust-remote-code \
  --port 7800 \
  --enable-hierarchical-cache \
  --hicache-mem-layout page_first \
  --hicache-write-policy write_through

**Hifile Hit:**
export SGLANG_HICACHE_FILE_BACKEND_STORAGE_DIR=/mnt/test
sglang serve \
  --model-path /models/GLM-4.7-Flash \
  --tp 4 \
  --mem-fraction-static 0.85 \
  --page-size 128 \
  --trust-remote-code \
  --port 7800 \
  --enable-hierarchical-cache \
  --hicache-mem-layout page_first \
  --hicache-write-policy write_through \
  --hicache-storage-backend file \
  --hicache-storage-prefetch-policy wait_complete

**HBM hit:**
HICACHE_CONFIG='{
  "backend_name":"unifiedcache",
  "module_path":"ucm.integration.sglang.unifiedcache_store",
  "class_name":"UnifiedCacheStore",
  "interface_v1":1,
  "kv_connector_extra_config":{
    "ucm_connector_name":"UcmPipelineStore",
    "ucm_connector_config":{
      "storage_backends":"/mnt/test",
	  "posix_io_engine": "aio"
    }
  }
}'
sglang serve \
  --model-path /models/GLM-4.7-Flash \
  --tp 4 \
  --mem-fraction-static 0.85 \
  --page-size 128 \
  --trust-remote-code \
  --port 7800 \
  --enable-hierarchical-cache \
  --hicache-mem-layout page_first \
  --hicache-write-policy write_through \
  --hicache-storage-backend dynamic \
  --hicache-storage-prefetch-policy wait_complete \
  --hicache-storage-backend-extra-config "$HICACHE_CONFIG"

### Environment

Python: 3.12.3 (main, Mar  3 2026, 12:15:18) [GCC 13.3.0]
CUDA available: True
GPU 0,1,2,3,4,5,6,7: NVIDIA H100 80GB HBM3
GPU 0,1,2,3,4,5,6,7 Compute Capability: 9.0
CUDA_HOME: /usr/local/cuda
NVCC: Cuda compilation tools, release 12.9, V12.9.86
CUDA Driver Version: 570.195.03
PyTorch: 2.9.1+cu129
sglang: 0.5.10
sglang-kernel: 0.4.1
flashinfer_python: 0.6.7.post2
flashinfer_cubin: 0.6.7.post2
flashinfer_jit_cache: 0.6.7.post2+cu129
triton: 3.5.1
transformers: 5.3.0
torchao: 0.9.0
numpy: 2.3.5
aiohttp: 3.13.5
fastapi: 0.135.3
huggingface_hub: 1.9.0
interegular: 0.3.3
modelscope: 1.35.3
orjson: 3.11.8
outlines: 0.1.11
packaging: 26.0
psutil: 7.2.2
pydantic: 2.12.5
python-multipart: 0.0.22
pyzmq: 27.1.0
uvicorn: 0.43.0
uvloop: 0.22.1
vllm: Module Not Found
xgrammar: 0.1.32
openai: 2.6.1
tiktoken: 0.12.0
anthropic: 0.89.0
litellm: Module Not Found
torchcodec: 0.9.1

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug] GLM-4.7-Flash: TPOT performance severely degraded when --enable-hierarchical-cache is on #26305

Checklist

Describe the bug

Input Length: 8192, Output: 512, Concurrency: 16

Input Length: 131072, Output: 512, Concurrency: 1

Input Length: 8192, Output: 512, Concurrency: 16

Input Length: 32768, Output: 512, Concurrency: 4

Input Length: 131072, Output: 512, Concurrency: 1

Reproduction

Environment

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Scenario	TPOT(ms)	TTFT(ms)	E2E(ms)
Full Recompute	16.45	1725.09	10132.96
HBM Hit	14.13	481.69	7701.72
Hifile Hit	16.05	2450.24	10652.99
UCM Hit	15.54	1447.09	9290.44

Scenario	TPOT(ms)	TTFT(ms)	E2E(ms)
Full Recompute	9.21	5117.95	9825.68
HBM Hit	9.07	1027.18	5663.57
Hifile Hit	9.25	5249.94	9975.92
UCM Hit	9.04	2697	7315.72

Scenario	TPOT(ms)	TTFT(ms)	E2E(ms)
Full Recompute	12.53	1114.29	7515.15
HBM Hit	17.79	420.55	9510.62
Hifile Hit	18.95	2586.86	12270.21
UCM Hit	18.89	1065.07	10716.84

Scenario	TPOT(ms)	TTFT(ms)	E2E(ms)
Full Recompute	10.94	1557.5	7147.77
HBM Hit	42.28	378.8	21986.12
Hifile Hit	40.24	4595.19	25155.9
UCM Hit	40.07	1302.31	21778.21

Scenario	TPOT(ms)	TTFT(ms)	E2E(ms)
Full Recompute	8.71	5263.76	9715.19
HBM Hit	136.13	1099.65	70663.87
Hifile Hit	135.64	12790.47	82102.65
UCM Hit	136.8	3295.84	73201.72

[Bug] GLM-4.7-Flash: TPOT performance severely degraded when --enable-hierarchical-cache is on #26305

Description

Checklist

Describe the bug

Input Length: 8192, Output: 512, Concurrency: 16

Input Length: 131072, Output: 512, Concurrency: 1

Input Length: 8192, Output: 512, Concurrency: 16

Input Length: 32768, Output: 512, Concurrency: 4

Input Length: 131072, Output: 512, Concurrency: 1

Reproduction

Environment

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions