Skip to content

Commit b8f1f8f

Browse files
committed
start setup
1 parent 8968f36 commit b8f1f8f

3 files changed

Lines changed: 132 additions & 5 deletions

File tree

benchmarks_and_experiments/coding_vs_vllm/README.md

Lines changed: 5 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -41,13 +41,13 @@ start one server, benchmark it (save with `--out`), stop it, start the other,
4141
benchmark it — then combine the two saved files into the side-by-side report
4242
with `--compare` (no server needed for that step).
4343

44-
**Step 1 — kvboost.** Start its server:
44+
**Step 1 — kvboost.** Start its server (best setup — see `start_kvboost.sh`):
4545
```bash
46-
python -m kvboost.server --model Qwen/Qwen2.5-3B-Instruct --dtype float16 \
47-
--recompute-strategy cacheblend_sparse --kv-cache-bits 8 \
48-
--max-cache-bytes 4e9 --max-batch-size 1 --port 9000
46+
./start_kvboost.sh # MODEL=... PORT=... MAX_CACHE_BYTES=... to override
4947
```
50-
Then in another shell:
48+
That runs kvboost with `cacheblend_sparse` (faithful selective recompute),
49+
int8 KV, and the OOM planner — the features the benchmark measures. Then in
50+
another shell:
5151
```bash
5252
python bench_coding.py --backend kvboost --url http://localhost:9000 \
5353
--model Qwen/Qwen2.5-3B-Instruct \
Lines changed: 74 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,74 @@
1+
#!/usr/bin/env bash
2+
# Launch kvboost in its BEST setup for the coding benchmark — showcases the
3+
# features the benchmark measures: KV reuse (faster TTFT) + OOM recovery, with
4+
# the recent correctness/perf fixes all active.
5+
#
6+
# Run this, then in another shell:
7+
# python bench_coding.py --backend kvboost --url http://localhost:9000 \
8+
# --model "$MODEL" --mode both --out kvboost.json
9+
# Stop it (Ctrl-C) before launching vLLM — one model fits the GPU at a time.
10+
#
11+
# Override via env: MODEL=... PORT=... MAX_CACHE_BYTES=... ./start_kvboost.sh
12+
13+
set -euo pipefail
14+
15+
MODEL="${MODEL:-Qwen/Qwen2.5-3B-Instruct}"
16+
PORT="${PORT:-9000}"
17+
# KV-cache budget for cross-request chunk reuse. Size to (free VRAM after
18+
# weights). On a 14.6 GiB card with a 3B fp16 model (~6 GiB) → ~4 GiB leaves
19+
# headroom for prefill activations + the live request. Lower for the OOM-
20+
# stress run to make the planner's adaptation more visible (e.g. 1.5e9).
21+
MAX_CACHE_BYTES="${MAX_CACHE_BYTES:-4e9}"
22+
SAFETY_MARGIN="${SAFETY_MARGIN:-0.15}"
23+
24+
echo "kvboost (best setup)"
25+
echo " model: $MODEL"
26+
echo " port: $PORT"
27+
echo " recompute: cacheblend_sparse (faithful selective recompute)"
28+
echo " kv-cache-bits: 8 (int8 KV → 2× reuse capacity)"
29+
echo " max-cache-bytes: $MAX_CACHE_BYTES"
30+
echo " oom planning: on (safety_margin=$SAFETY_MARGIN)"
31+
echo
32+
33+
# Why each flag:
34+
# --recompute-strategy cacheblend_sparse
35+
# Faithful CacheBlend: recompute only high-deviation tokens layer-by-
36+
# layer (paper's 2.2-3.3× TTFT), not the full-forward variant. This is
37+
# the "faster TTFT on reused context" feature. Falls back to plain
38+
# cacheblend automatically on unsupported architectures.
39+
# --kv-cache-bits 8
40+
# int8 KV cache: ~2× the cached-chunk capacity (more cross-request
41+
# reuse) and lower memory pressure, negligible quality cost.
42+
# --max-cache-bytes
43+
# Cross-request chunk-cache budget — bigger = more reuse, bounded by VRAM.
44+
# OOM planner (on by default) + --planner-safety-margin
45+
# Per-request peak prediction → picks chunk_size/kv_bits that fit, or a
46+
# clean HTTP 413. This is the "OOM recovery" feature. Add --auto-truncate
47+
# to truncate-and-complete oversized prompts instead of 413.
48+
# --max-batch-size 1
49+
# The benchmark replays sequentially (single GPU worker); 1 avoids
50+
# pointless batch-window latency. Raise for concurrent throughput tests.
51+
# (automatic, no flag: O(n) incremental detok, chunked CacheBlend forward,
52+
# streaming usage emission for input-throughput, planner cost probe.)
53+
exec python -m kvboost.server \
54+
--model "$MODEL" \
55+
--dtype float16 \
56+
--recompute-strategy cacheblend_sparse \
57+
--kv-cache-bits 8 \
58+
--max-cache-bytes "$MAX_CACHE_BYTES" \
59+
--planner-safety-margin "$SAFETY_MARGIN" \
60+
--max-batch-size 1 \
61+
--host 0.0.0.0 \
62+
--port "$PORT"
63+
64+
# ── Optional add-ons (uncomment to enable) ───────────────────────────────────
65+
# Speculative decoding to lift DECODE throughput (where vLLM's continuous
66+
# batching otherwise leads). Needs a small same-family draft model and ~1 GiB
67+
# extra VRAM; --speculative-tree turns on the SpecBlock-inspired tree variant
68+
# with cost-aware per-request mode selection:
69+
# --speculative-draft-model Qwen/Qwen2.5-0.5B-Instruct \
70+
# --speculative-tree \
71+
#
72+
# Oversized-prompt policy for the OOM ramp: complete-by-truncation instead of
73+
# a clean 413 reject:
74+
# --auto-truncate
Lines changed: 53 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,53 @@
1+
#!/usr/bin/env bash
2+
# Launch vLLM in its USUAL serving setup for the coding benchmark — the
3+
# standard OpenAI server with prefix caching (vLLM's cross-request reuse) and
4+
# continuous batching (its default). Matched model + dtype to kvboost so the
5+
# comparison is fair.
6+
#
7+
# Run this AFTER stopping the kvboost server (one model fits the GPU at a
8+
# time), then in another shell:
9+
# python bench_coding.py --backend vllm --url http://localhost:8001 \
10+
# --model "$MODEL" --mode both --out vllm.json
11+
# # ... use the SAME --dataset/--n/--n-files/--contexts/--corpus-size as the
12+
# # kvboost run so both backends see identical prompts.
13+
#
14+
# Override via env: MODEL=... PORT=... GPU_MEM_UTIL=... MAX_MODEL_LEN=...
15+
16+
set -euo pipefail
17+
18+
MODEL="${MODEL:-Qwen/Qwen2.5-3B-Instruct}"
19+
PORT="${PORT:-8001}"
20+
# vLLM pre-allocates this fraction of total VRAM for weights + its paged KV
21+
# pool. 0.85 is the common production value.
22+
GPU_MEM_UTIL="${GPU_MEM_UTIL:-0.85}"
23+
# Max admitted context. 32768 covers the throughput/TTFT workload. For the OOM
24+
# ramp: a HIGH value (e.g. 131072) admits long prompts so they hit the runtime
25+
# KV ceiling (real OOM); a LOW value makes vLLM reject over-length prompts with
26+
# a graceful 400 instead (the benchmark scores that as success, not failure).
27+
MAX_MODEL_LEN="${MAX_MODEL_LEN:-32768}"
28+
29+
echo "vLLM (usual setup)"
30+
echo " model: $MODEL"
31+
echo " port: $PORT"
32+
echo " prefix caching: on (vLLM cross-request reuse)"
33+
echo " gpu-memory-utilization: $GPU_MEM_UTIL"
34+
echo " max-model-len: $MAX_MODEL_LEN"
35+
echo
36+
37+
# Why each flag:
38+
# --enable-prefix-caching vLLM's reuse mechanism — the matched counterpart
39+
# to kvboost's chunk-reuse/CacheBlend (reuses an
40+
# exact shared *prefix* across requests).
41+
# --gpu-memory-utilization standard memory budget; matched to leave the same
42+
# class of headroom kvboost gets.
43+
# --max-model-len admitted context length (see note above re: OOM).
44+
# --dtype float16 matched to kvboost.
45+
# Continuous batching is vLLM's default and stays on — it's why vLLM usually
46+
# leads raw decode throughput; the benchmark reports that honestly.
47+
exec vllm serve "$MODEL" \
48+
--dtype float16 \
49+
--enable-prefix-caching \
50+
--gpu-memory-utilization "$GPU_MEM_UTIL" \
51+
--max-model-len "$MAX_MODEL_LEN" \
52+
--host 0.0.0.0 \
53+
--port "$PORT"

0 commit comments

Comments
 (0)