Commit 950bc96
committed
gemma-4-31B-it: cap max-running-requests at 64
Restores --max-running-requests 64 (v0.0.194 dropped it, letting sglang
default to 2048, which inflated CUDA-graph/buffer pre-alloc — boot
headroom fell 20.05->16.41 GB — and admitted too much concurrent
prefill, causing CUDA OOM 500s under load on gpu11). Keeps the
--context-length removal from v0.0.194.1 parent d8364e0 commit 950bc96
1 file changed
Lines changed: 1 addition & 0 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
515 | 515 | | |
516 | 516 | | |
517 | 517 | | |
| 518 | + | |
518 | 519 | | |
519 | 520 | | |
520 | 521 | | |
| |||
0 commit comments