@@ -33,51 +33,55 @@ Default dataset `openai_humaneval` (small, no auth, real Python). For
3333long-context coding agents, point ` --dataset ` at a repo-level set
3434(e.g. ` repobench ` ); the adapter pulls code text from common field names.
3535
36- ## One GPU → one backend at a time
36+ ## One GPU → run each backend separately, then compare
3737
3838vLLM's ` --gpu-memory-utilization ` pre-allocates most of the VRAM, so ** both
39- servers cannot be resident at once** on a single GPU. The script handles this:
40- when you pass each backend's launch command, it ** starts kvboost, benchmarks
41- it, kills it, waits for the VRAM to actually free, then starts vLLM** — only
42- one model is ever on the GPU.
43-
44- ### Recommended: let the script manage server lifecycle
45-
46- Pass the full launch command for each backend (quote it). The script does the
47- launch → bench → teardown → wait-for-VRAM-free → next sequence:
39+ servers can't be resident at once** . So you bench them in ** separate runs** —
40+ start one server, benchmark it (save with ` --out ` ), stop it, start the other,
41+ benchmark it — then combine the two saved files into the side-by-side report
42+ with ` --compare ` (no server needed for that step).
4843
44+ ** Step 1 — kvboost.** Start its server:
4945``` bash
50- python benchmarks_and_experiments/coding_vs_vllm/bench_coding.py \
51- --dataset openai_humaneval --mode both --n 10 \
52- --contexts 2000 8000 16000 32000 64000 96000 \
53- --gpu-free-mb 2000 \
54- --kvboost-cmd " python -m kvboost.server --model Qwen/Qwen2.5-3B-Instruct \
55- --dtype float16 --recompute-strategy cacheblend_sparse --kv-cache-bits 8 \
56- --max-cache-bytes 4e9 --max-batch-size 1 --port 9000" \
57- --vllm-cmd " vllm serve Qwen/Qwen2.5-3B-Instruct --dtype float16 \
58- --enable-prefix-caching --gpu-memory-utilization 0.85 \
59- --max-model-len 32768 --port 8001"
46+ python -m kvboost.server --model Qwen/Qwen2.5-3B-Instruct --dtype float16 \
47+ --recompute-strategy cacheblend_sparse --kv-cache-bits 8 \
48+ --max-cache-bytes 4e9 --max-batch-size 1 --port 9000
6049```
50+ Then in another shell:
51+ ``` bash
52+ python bench_coding.py --backend kvboost --url http://localhost:9000 \
53+ --model Qwen/Qwen2.5-3B-Instruct \
54+ --dataset openai_humaneval --mode both --n 10 \
55+ --contexts 2000 8000 16000 32000 64000 96000 \
56+ --out kvboost.json
57+ ```
58+ Stop the kvboost server when it finishes (frees the GPU).
6159
62- - Server stdout/stderr go to ` ./bench_server_logs/<backend>_server.log ` .
63- - ` --gpu-free-mb ` is the VRAM-used threshold (MiB) treated as "freed" between
64- backends — set it a bit above your idle/driver baseline (2000 is safe on a
65- dedicated card; raise if other processes share the GPU).
66- - ` --ready-timeout ` (default 600s) is how long to wait for a server to load.
67-
68- ### Alternative: start one server yourself, run once per backend
69-
70- If you'd rather manage servers manually (also one-at-a-time), omit the
71- ` --*-cmd ` flags and use ` --only ` . Start kvboost, run:
60+ ** Step 2 — vLLM.** Start its server:
7261``` bash
73- python bench_coding.py --only kvboost --out kvboost.json --mode both
62+ vllm serve Qwen/Qwen2.5-3B-Instruct --dtype float16 \
63+ --enable-prefix-caching --gpu-memory-utilization 0.85 \
64+ --max-model-len 32768 --port 8001
7465```
75- stop kvboost, start vLLM, run :
66+ Then run the ** same ** workload flags (so prompts match) against it :
7667``` bash
77- python bench_coding.py --only vllm --out vllm.json --mode both
68+ python bench_coding.py --backend vllm --url http://localhost:8001 \
69+ --model Qwen/Qwen2.5-3B-Instruct \
70+ --dataset openai_humaneval --mode both --n 10 \
71+ --contexts 2000 8000 16000 32000 64000 96000 \
72+ --out vllm.json
7873```
79- Each run with ` --only ` benchmarks just that backend (the other is never
80- loaded). ` --out ` saves raw outcomes so you can diff the two JSONs.
74+
75+ ** Step 3 — compare** (no GPU/server needed):
76+ ``` bash
77+ python bench_coding.py --compare kvboost.json vllm.json
78+ ```
79+
80+ Each single run also prints its own (single-backend) report immediately, so
81+ you get kvboost's numbers before vLLM is even started. Keep
82+ ` --dataset / --n / --n-files / --contexts / --corpus-size ` identical across
83+ the two runs — the prompts are deterministic from the dataset, so matching
84+ those flags guarantees both backends saw the exact same inputs.
8185
8286` --mode ttft ` or ` --mode oom ` to run a single axis.
8387
0 commit comments