[Feature] Add SiMM as sglang HiCache Storage backend by hhu-scitix · Pull Request #18016 · sgl-project/sglang

hhu-scitix · 2026-01-31T02:32:57Z

Description

SiMM(Scitix In-Memory Middleware) is a distributed, high-performance, elastic cache acceleration layer for all AI workloads.

Feature

support 'page_first' and 'page_first_direct' for zero copy cache layout.
support numa-aware RDMA nic select.

Benchmarking and Profiling

Test DeepSeek R1 on 8* H200 141G with 8*400Gbps RoCE network (MT4129).

GPU Driver: 570.86.15, CUDA version: 12.9

Use benchmark/hicache/bench_multiture.py.

client:

python3 bench_multiturn.py \
  --num-clients 128 --max-parallel * \
  --request-length 8000 --output-length 200 \
  --host 127.0.0.1 --port 8080 --disable-auto-run \
  --disable-random-sample \
  --model-path /models/preset/deepseek-ai/DeepSeek-R1/v1.0 \
  --num-rounds=* --seed=* \
  --ready-queue-policy=fifo \
  --sub-question-input-length=128

server:

python3 -m sglang.launch_server \
  --model /models/preset/deepseek-ai/DeepSeek-R1/v1.0 \
  --trust-remote-code \
  --tp 8 --mem-fraction-static 0.75 \
  --page-size 64 \
  --enable-hierarchical-cache \
  --hicache-ratio 1.1 \
  --hicache-mem-layout page_first_direct \
  --hicache-io-backend direct \
  --hicache-storage-backend simm \
  --hicache-write-policy write_through \
  --hicache-storage-prefetch-policy timeout \
  --hicache-storage-backend-extra-config '{"manager_address":"0.0.0.0:30001"}' \
  --port 8080

rounds	parallel	Req throughput (req/s)		Input throughput (token/s)		Output throughput (token/s)		SLO (ms)
		SiMM	GPU	SiMM	GPU	SiMM	GPU	SiMM		GPU
								TTFT	E2E Latency	TTFT	E2E Latency
3	4	0.81	0.80	6856.36	6810.02	161.14	159.83	0.43	3.20	0.50	3.45
	8	0.97	0.90	8250.51	7670.94	193.62	180.09	0.47	3.84	0.56	4.05
	16	0.97	1.02	8236.79	8653.88	193.49	203.30	0.49	3.73	0.63	4.92
10	4	0.88	0.80	8153.91	7767.07	168.62	160.61	0.41	3.11	0.56	3.57
	8	0.97	0.91	9353.14	8819.79	193.55	182.40	0.43	3.65	0.66	4.51
	16	1.01	0.98	9733.29	9459.86	201.32	195.47	0.50	4.08	0.72	5.23

'GPU' means only use GPU cache (no use hicache).

Checklist

Format your code according to the Format code with pre-commit.
Add unit tests according to the Run and add unit tests.
Update documentation according to Write documentations.
Provide accuracy and speed benchmark results according to Test the accuracy and Benchmark the speed.
Follow the SGLang code style guidance.

Signed-off-by: lkm2835 <lkm2835@gmail.com> Co-authored-by: lgai-exaone <exaonemodels@lgresearch.ai> Co-authored-by: lkm2835 <lkm2835@gmail.com> Co-authored-by: Xinyuan Tong <xinyuantong.cs@gmail.com>

Co-authored-by: yizhang2077 <1109276519@qq.com>

…ed (sgl-project#17585)

…#17492) Co-authored-by: McZyWu <zhuoyun.wu.23@ucl.ac.uk> Co-authored-by: chenyang08056032 <chenyang08056032@163.com> Co-authored-by: Hexq0210 <893781835@qq.com>

…oject#17781)

Co-authored-by: yizhang2077 <1109276519@qq.com>

…oject#17781)

Co-authored-by: wangtiance <tiancew@qq.com>

…gl-project#16685) Co-authored-by: Cheng Wan <54331508+ch-wan@users.noreply.github.com>

…#17492) Co-authored-by: McZyWu <zhuoyun.wu.23@ucl.ac.uk> Co-authored-by: chenyang08056032 <chenyang08056032@163.com> Co-authored-by: Hexq0210 <893781835@qq.com>

hhu-scitix and others added 9 commits January 30, 2026 15:24

feat: add simm as hicache backend

89f6ddb

[Model] Add K-EXAONE model support (sgl-project#16294)

0787c67

Signed-off-by: lkm2835 <lkm2835@gmail.com> Co-authored-by: lgai-exaone <exaonemodels@lgresearch.ai> Co-authored-by: lkm2835 <lkm2835@gmail.com> Co-authored-by: Xinyuan Tong <xinyuantong.cs@gmail.com>

[BUGFIX] Fix dp size > 1 for qwen3 vl model (sgl-project#17624)

59ef673

Co-authored-by: yizhang2077 <1109276519@qq.com>

[Diffusion] Fix lora default lora_scale bug (sgl-project#17982)

cc30e3a

Optimize GDN decode for Qwen3 Next (sgl-project#17094)

eca36e6

[BugFix] Fix server crashes when req.grammar and ngram spec are enabl…

eb6a67b

…ed (sgl-project#17585)

[NPU] support llama-3.2-11B-vision-instruct mode for NPU (sgl-project…

1d723b0

…#17492) Co-authored-by: McZyWu <zhuoyun.wu.23@ucl.ac.uk> Co-authored-by: chenyang08056032 <chenyang08056032@163.com> Co-authored-by: Hexq0210 <893781835@qq.com>

[sglang] fix mm token padded value overlap with text token id (sgl-pr…

c61b169

…oject#17781)

feat: format code

a346e99

hhu-scitix requested review from BBuf, Edwardf0t1, Fridge003, HaiShaw, JustinTong0323, Qiaolin-Yu, Ying1123, ch-wan, hanming-lu, hebiao064, hnyls2002, iforgetmyname, ispobock, merrymercy, mickqian, ping1jing2, xiezhq-hermann, yhyang201 and yizhang2077 as code owners January 31, 2026 02:32

github-actions bot added documentation Improvements or additions to documentation lora labels Jan 31, 2026

hhu-scitix requested review from DarkSharpness and Kangyan-Zhou as code owners February 16, 2026 01:33

github-actions bot added quant LLM Quantization amd dependencies Pull requests that update a dependency file deepseek speculative-decoding sgl-kernel blackwell SM100/SM120 piecewise-cuda-graph model-gateway mthreads labels Feb 16, 2026

hhu-scitix and others added 18 commits February 16, 2026 09:34

Merge branch 'main' into feature/simm

f4b1165

[BUGFIX] Fix dp size > 1 for qwen3 vl model (sgl-project#17624)

4b0b336

Co-authored-by: yizhang2077 <1109276519@qq.com>

[sglang] fix mm token padded value overlap with text token id (sgl-pr…

41fa2e1

…oject#17781)

doc update for CANN version (sgl-project#18014)

2b0be46

Co-authored-by: wangtiance <tiancew@qq.com>

MoE Refactor: Refactor modelopt_quant.py -> flashinfer_trllm.py (s…

18467b5

…gl-project#16685) Co-authored-by: Cheng Wan <54331508+ch-wan@users.noreply.github.com>

[NPU] support llama-3.2-11B-vision-instruct mode for NPU (sgl-project…

10d244d

…#17492) Co-authored-by: McZyWu <zhuoyun.wu.23@ucl.ac.uk> Co-authored-by: chenyang08056032 <chenyang08056032@163.com> Co-authored-by: Hexq0210 <893781835@qq.com>

fix: revert conflict with latest

7df0f6e

feat: revert unchanged files

57d68ef

feat: revert conflict

ac84c49

Merge branch 'main' into feature/simm

9c3e7e9

Merge branch 'main' into feature/simm

604f96e

Merge branch 'main' into feature/simm

fab2815

Merge branch 'main' into feature/simm

57f36d4

Merge branch 'main' into feature/simm

3bfddfe

Merge branch 'main' into feature/simm

2b1c771

Merge branch 'main' into feature/simm

7d0df36

Merge branch 'main' into feature/simm

49d7632

Merge branch 'main' into feature/simm

4bce73d

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature] Add SiMM as sglang HiCache Storage backend#18016

[Feature] Add SiMM as sglang HiCache Storage backend#18016
hhu-scitix wants to merge 53 commits intosgl-project:mainfrom
scitix:feature/simm

hhu-scitix commented Jan 31, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants

Comments

Conversation

hhu-scitix commented Jan 31, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Feature

Benchmarking and Profiling

Checklist

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants

Comments

hhu-scitix commented Jan 31, 2026 •

edited

Loading