Skip to content

[Bug] GLM-4.7-Flash: TPOT performance severely degraded when --enable-hierarchical-cache is on #26305

@lc1314555

Description

@lc1314555

Checklist

  • I searched related issues but found no solution.
  • The bug persists in the latest version.
  • Issues without environment info and a minimal reproducible demo are hard to resolve and may receive no feedback.
  • If this is not a bug report but a general question, please start a discussion at https://github.com/sgl-project/sglang/discussions. Otherwise, it will be closed.
  • Please use English. Otherwise, it will be closed.

Describe the bug

We evaluated the hierarchical KV cache (HiCache) performance on SGLang v0.5.10, comparing an MLA model and a GQA model across different cache scenarios. The metrics collected are TPOT, TTFT, and end-to-end latency.

  • GQA Model : MiniMax M2.7
  • MLA Model : GLM-4.7-Flash

We tested the following three scenarios:
Full Recompute: HiCache disabled (i.e., without --enable-hierarchical-cache).
HBM Hit: HiCache enabled (L1/L2 only, no L3 backend), with a 90% hit rate.
L3 Backend Hit: HiCache enabled with a configured L3 storage backend (we choose two : hifile and ucm[https://github.com/ModelEngine-Group/unified-cache-management]), with a 90% hit rate.

Test Result:

  • For the GQA model (MiniMax M2.7): The results match expectations. TPOT remains normal across all scenarios. In scenarios 2 and 3, we observed clear TTFT improvements under long-input, high-concurrency workloads.

Input Length: 8192, Output: 512, Concurrency: 16

Scenario TPOT(ms) TTFT(ms) E2E(ms)
Full Recompute 16.45 1725.09 10132.96
HBM Hit 14.13 481.69 7701.72
Hifile Hit 16.05 2450.24 10652.99
UCM Hit 15.54 1447.09 9290.44

Input Length: 131072, Output: 512, Concurrency: 1

Scenario TPOT(ms) TTFT(ms) E2E(ms)
Full Recompute 9.21 5117.95 9825.68
HBM Hit 9.07 1027.18 5663.57
Hifile Hit 9.25 5249.94 9975.92
UCM Hit 9.04 2697 7315.72
  • For the MLA model (GLM-4.7-Flash): The results are unexpected and show significant issues. TPOT performance is only acceptable in Scenario 1 (Full Recompute, i.e., without --enable-hierarchical-cache). In contrast, when HiCache is enabled (Scenario 2 and Scenario 3), the TPOT performance degrades significantly, which is contrary to our expectations.

Input Length: 8192, Output: 512, Concurrency: 16

Scenario TPOT(ms) TTFT(ms) E2E(ms)
Full Recompute 12.53 1114.29 7515.15
HBM Hit 17.79 420.55 9510.62
Hifile Hit 18.95 2586.86 12270.21
UCM Hit 18.89 1065.07 10716.84

Input Length: 32768, Output: 512, Concurrency: 4

Scenario TPOT(ms) TTFT(ms) E2E(ms)
Full Recompute 10.94 1557.5 7147.77
HBM Hit 42.28 378.8 21986.12
Hifile Hit 40.24 4595.19 25155.9
UCM Hit 40.07 1302.31 21778.21

Input Length: 131072, Output: 512, Concurrency: 1

Scenario TPOT(ms) TTFT(ms) E2E(ms)
Full Recompute 8.71 5263.76 9715.19
HBM Hit 136.13 1099.65 70663.87
Hifile Hit 135.64 12790.47 82102.65
UCM Hit 136.8 3295.84 73201.72

Reproduction

Full Recompute:
sglang serve
--model-path /models/GLM-4.7-Flash
--tp 4
--mem-fraction-static 0.85
--page-size 128
--trust-remote-code
--port 7800

HBM Hit:
sglang serve
--model-path /models/GLM-4.7-Flash
--tp 4
--mem-fraction-static 0.85
--page-size 128
--trust-remote-code
--port 7800
--enable-hierarchical-cache
--hicache-mem-layout page_first
--hicache-write-policy write_through

Hifile Hit:
export SGLANG_HICACHE_FILE_BACKEND_STORAGE_DIR=/mnt/test
sglang serve
--model-path /models/GLM-4.7-Flash
--tp 4
--mem-fraction-static 0.85
--page-size 128
--trust-remote-code
--port 7800
--enable-hierarchical-cache
--hicache-mem-layout page_first
--hicache-write-policy write_through
--hicache-storage-backend file
--hicache-storage-prefetch-policy wait_complete

HBM hit:
HICACHE_CONFIG='{
"backend_name":"unifiedcache",
"module_path":"ucm.integration.sglang.unifiedcache_store",
"class_name":"UnifiedCacheStore",
"interface_v1":1,
"kv_connector_extra_config":{
"ucm_connector_name":"UcmPipelineStore",
"ucm_connector_config":{
"storage_backends":"/mnt/test",
"posix_io_engine": "aio"
}
}
}'
sglang serve
--model-path /models/GLM-4.7-Flash
--tp 4
--mem-fraction-static 0.85
--page-size 128
--trust-remote-code
--port 7800
--enable-hierarchical-cache
--hicache-mem-layout page_first
--hicache-write-policy write_through
--hicache-storage-backend dynamic
--hicache-storage-prefetch-policy wait_complete
--hicache-storage-backend-extra-config "$HICACHE_CONFIG"

Environment

Python: 3.12.3 (main, Mar 3 2026, 12:15:18) [GCC 13.3.0]
CUDA available: True
GPU 0,1,2,3,4,5,6,7: NVIDIA H100 80GB HBM3
GPU 0,1,2,3,4,5,6,7 Compute Capability: 9.0
CUDA_HOME: /usr/local/cuda
NVCC: Cuda compilation tools, release 12.9, V12.9.86
CUDA Driver Version: 570.195.03
PyTorch: 2.9.1+cu129
sglang: 0.5.10
sglang-kernel: 0.4.1
flashinfer_python: 0.6.7.post2
flashinfer_cubin: 0.6.7.post2
flashinfer_jit_cache: 0.6.7.post2+cu129
triton: 3.5.1
transformers: 5.3.0
torchao: 0.9.0
numpy: 2.3.5
aiohttp: 3.13.5
fastapi: 0.135.3
huggingface_hub: 1.9.0
interegular: 0.3.3
modelscope: 1.35.3
orjson: 3.11.8
outlines: 0.1.11
packaging: 26.0
psutil: 7.2.2
pydantic: 2.12.5
python-multipart: 0.0.22
pyzmq: 27.1.0
uvicorn: 0.43.0
uvloop: 0.22.1
vllm: Module Not Found
xgrammar: 0.1.32
openai: 2.6.1
tiktoken: 0.12.0
anthropic: 0.89.0
litellm: Module Not Found
torchcodec: 0.9.1

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions