Skip to content

Commit 8968f36

Browse files
committed
fix the benchmark
1 parent 719cec7 commit 8968f36

2 files changed

Lines changed: 155 additions & 220 deletions

File tree

benchmarks_and_experiments/coding_vs_vllm/README.md

Lines changed: 39 additions & 35 deletions
Original file line numberDiff line numberDiff line change
@@ -33,51 +33,55 @@ Default dataset `openai_humaneval` (small, no auth, real Python). For
3333
long-context coding agents, point `--dataset` at a repo-level set
3434
(e.g. `repobench`); the adapter pulls code text from common field names.
3535

36-
## One GPU → one backend at a time
36+
## One GPU → run each backend separately, then compare
3737

3838
vLLM's `--gpu-memory-utilization` pre-allocates most of the VRAM, so **both
39-
servers cannot be resident at once** on a single GPU. The script handles this:
40-
when you pass each backend's launch command, it **starts kvboost, benchmarks
41-
it, kills it, waits for the VRAM to actually free, then starts vLLM** — only
42-
one model is ever on the GPU.
43-
44-
### Recommended: let the script manage server lifecycle
45-
46-
Pass the full launch command for each backend (quote it). The script does the
47-
launch → bench → teardown → wait-for-VRAM-free → next sequence:
39+
servers can't be resident at once**. So you bench them in **separate runs**
40+
start one server, benchmark it (save with `--out`), stop it, start the other,
41+
benchmark it — then combine the two saved files into the side-by-side report
42+
with `--compare` (no server needed for that step).
4843

44+
**Step 1 — kvboost.** Start its server:
4945
```bash
50-
python benchmarks_and_experiments/coding_vs_vllm/bench_coding.py \
51-
--dataset openai_humaneval --mode both --n 10 \
52-
--contexts 2000 8000 16000 32000 64000 96000 \
53-
--gpu-free-mb 2000 \
54-
--kvboost-cmd "python -m kvboost.server --model Qwen/Qwen2.5-3B-Instruct \
55-
--dtype float16 --recompute-strategy cacheblend_sparse --kv-cache-bits 8 \
56-
--max-cache-bytes 4e9 --max-batch-size 1 --port 9000" \
57-
--vllm-cmd "vllm serve Qwen/Qwen2.5-3B-Instruct --dtype float16 \
58-
--enable-prefix-caching --gpu-memory-utilization 0.85 \
59-
--max-model-len 32768 --port 8001"
46+
python -m kvboost.server --model Qwen/Qwen2.5-3B-Instruct --dtype float16 \
47+
--recompute-strategy cacheblend_sparse --kv-cache-bits 8 \
48+
--max-cache-bytes 4e9 --max-batch-size 1 --port 9000
6049
```
50+
Then in another shell:
51+
```bash
52+
python bench_coding.py --backend kvboost --url http://localhost:9000 \
53+
--model Qwen/Qwen2.5-3B-Instruct \
54+
--dataset openai_humaneval --mode both --n 10 \
55+
--contexts 2000 8000 16000 32000 64000 96000 \
56+
--out kvboost.json
57+
```
58+
Stop the kvboost server when it finishes (frees the GPU).
6159

62-
- Server stdout/stderr go to `./bench_server_logs/<backend>_server.log`.
63-
- `--gpu-free-mb` is the VRAM-used threshold (MiB) treated as "freed" between
64-
backends — set it a bit above your idle/driver baseline (2000 is safe on a
65-
dedicated card; raise if other processes share the GPU).
66-
- `--ready-timeout` (default 600s) is how long to wait for a server to load.
67-
68-
### Alternative: start one server yourself, run once per backend
69-
70-
If you'd rather manage servers manually (also one-at-a-time), omit the
71-
`--*-cmd` flags and use `--only`. Start kvboost, run:
60+
**Step 2 — vLLM.** Start its server:
7261
```bash
73-
python bench_coding.py --only kvboost --out kvboost.json --mode both
62+
vllm serve Qwen/Qwen2.5-3B-Instruct --dtype float16 \
63+
--enable-prefix-caching --gpu-memory-utilization 0.85 \
64+
--max-model-len 32768 --port 8001
7465
```
75-
stop kvboost, start vLLM, run:
66+
Then run the **same** workload flags (so prompts match) against it:
7667
```bash
77-
python bench_coding.py --only vllm --out vllm.json --mode both
68+
python bench_coding.py --backend vllm --url http://localhost:8001 \
69+
--model Qwen/Qwen2.5-3B-Instruct \
70+
--dataset openai_humaneval --mode both --n 10 \
71+
--contexts 2000 8000 16000 32000 64000 96000 \
72+
--out vllm.json
7873
```
79-
Each run with `--only` benchmarks just that backend (the other is never
80-
loaded). `--out` saves raw outcomes so you can diff the two JSONs.
74+
75+
**Step 3 — compare** (no GPU/server needed):
76+
```bash
77+
python bench_coding.py --compare kvboost.json vllm.json
78+
```
79+
80+
Each single run also prints its own (single-backend) report immediately, so
81+
you get kvboost's numbers before vLLM is even started. Keep
82+
`--dataset / --n / --n-files / --contexts / --corpus-size` identical across
83+
the two runs — the prompts are deterministic from the dataset, so matching
84+
those flags guarantees both backends saw the exact same inputs.
8185

8286
`--mode ttft` or `--mode oom` to run a single axis.
8387

0 commit comments

Comments
 (0)