Commit 000a7d1
committed
Remove vLLM max-num-seqs=5 bottleneck and align WVA thresholds to defaults
The benchmark was deploying vLLM with --max-num-seqs=5 (only 5 concurrent
requests per pod), causing 2-3% KV cache utilization and ~1 RPS instead of
the expected 60-100% KV cache and ~9 RPS. Removing this allows vLLM to use
its default (256), matching the colleague's benchmark configuration.
Also aligns WVA saturation thresholds (kvSpareTrigger, queueSpareTrigger)
to chart defaults (0.1, 3) to match the colleague's setup.
Made-with: Cursor1 parent e64fe7a commit 000a7d1
1 file changed
+8
-9
lines changed| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
287 | 287 | | |
288 | 288 | | |
289 | 289 | | |
290 | | - | |
291 | | - | |
| 290 | + | |
| 291 | + | |
292 | 292 | | |
293 | 293 | | |
294 | 294 | | |
| |||
302 | 302 | | |
303 | 303 | | |
304 | 304 | | |
305 | | - | |
306 | | - | |
| 305 | + | |
| 306 | + | |
307 | 307 | | |
308 | 308 | | |
309 | 309 | | |
| |||
550 | 550 | | |
551 | 551 | | |
552 | 552 | | |
553 | | - | |
554 | 553 | | |
555 | 554 | | |
556 | 555 | | |
557 | | - | |
558 | | - | |
| 556 | + | |
| 557 | + | |
559 | 558 | | |
560 | 559 | | |
561 | 560 | | |
| |||
589 | 588 | | |
590 | 589 | | |
591 | 590 | | |
592 | | - | |
593 | | - | |
| 591 | + | |
| 592 | + | |
594 | 593 | | |
595 | 594 | | |
596 | 595 | | |
| |||
0 commit comments