Skip to content

[Feature] Add SiMM as sglang HiCache Storage backend#18016

Open
hhu-scitix wants to merge 53 commits intosgl-project:mainfrom
scitix:feature/simm
Open

[Feature] Add SiMM as sglang HiCache Storage backend#18016
hhu-scitix wants to merge 53 commits intosgl-project:mainfrom
scitix:feature/simm

Conversation

@hhu-scitix
Copy link

@hhu-scitix hhu-scitix commented Jan 31, 2026

Description

SiMM(Scitix In-Memory Middleware) is a distributed, high-performance, elastic cache acceleration layer for all AI workloads.

Feature

  1. support 'page_first' and 'page_first_direct' for zero copy cache layout.
  2. support numa-aware RDMA nic select.

Benchmarking and Profiling

Test DeepSeek R1 on 8* H200 141G with 8*400Gbps RoCE network (MT4129).

GPU Driver: 570.86.15, CUDA version: 12.9

Use benchmark/hicache/bench_multiture.py.

client:

python3 bench_multiturn.py \
  --num-clients 128 --max-parallel * \
  --request-length 8000 --output-length 200 \
  --host 127.0.0.1 --port 8080 --disable-auto-run \
  --disable-random-sample \
  --model-path /models/preset/deepseek-ai/DeepSeek-R1/v1.0 \
  --num-rounds=* --seed=* \
  --ready-queue-policy=fifo \
  --sub-question-input-length=128

server:

python3 -m sglang.launch_server \
  --model /models/preset/deepseek-ai/DeepSeek-R1/v1.0 \
  --trust-remote-code \
  --tp 8 --mem-fraction-static 0.75 \
  --page-size 64 \
  --enable-hierarchical-cache \
  --hicache-ratio 1.1 \
  --hicache-mem-layout page_first_direct \
  --hicache-io-backend direct \
  --hicache-storage-backend simm \
  --hicache-write-policy write_through \
  --hicache-storage-prefetch-policy timeout \
  --hicache-storage-backend-extra-config '{"manager_address":"0.0.0.0:30001"}' \
  --port 8080
rounds parallel Req throughput (req/s) Input throughput (token/s) Output throughput (token/s) SLO (ms)
SiMM GPU SiMM GPU SiMM GPU SiMM GPU
TTFT E2E Latency TTFT E2E Latency
3 4 0.81 0.80 6856.36 6810.02 161.14 159.83 0.43 3.20 0.50 3.45
8 0.97 0.90 8250.51 7670.94 193.62 180.09 0.47 3.84 0.56 4.05
16 0.97 1.02 8236.79 8653.88 193.49 203.30 0.49 3.73 0.63 4.92
10 4 0.88 0.80 8153.91 7767.07 168.62 160.61 0.41 3.11 0.56 3.57
8 0.97 0.91 9353.14 8819.79 193.55 182.40 0.43 3.65 0.66 4.51
16 1.01 0.98 9733.29 9459.86 201.32 195.47 0.50 4.08 0.72 5.23

'GPU' means only use GPU cache (no use hicache).

Checklist

hhu-scitix and others added 9 commits January 30, 2026 15:24
Signed-off-by: lkm2835 <lkm2835@gmail.com>
Co-authored-by: lgai-exaone <exaonemodels@lgresearch.ai>
Co-authored-by: lkm2835 <lkm2835@gmail.com>
Co-authored-by: Xinyuan Tong <xinyuantong.cs@gmail.com>
Co-authored-by: yizhang2077 <1109276519@qq.com>
…#17492)

Co-authored-by: McZyWu <zhuoyun.wu.23@ucl.ac.uk>
Co-authored-by: chenyang08056032 <chenyang08056032@163.com>
Co-authored-by: Hexq0210 <893781835@qq.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

amd blackwell SM100/SM120 deepseek dependencies Pull requests that update a dependency file diffusion SGLang Diffusion documentation Improvements or additions to documentation hicache Hierarchical Caching for SGLang lora model-gateway mthreads Multi-modal multi-modal language model npu piecewise-cuda-graph quant LLM Quantization run-ci sgl-kernel speculative-decoding

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Comments