Skip to content

Commit 950bc96

Browse files
committed
gemma-4-31B-it: cap max-running-requests at 64
Restores --max-running-requests 64 (v0.0.194 dropped it, letting sglang default to 2048, which inflated CUDA-graph/buffer pre-alloc — boot headroom fell 20.05->16.41 GB — and admitted too much concurrent prefill, causing CUDA OOM 500s under load on gpu11). Keeps the --context-length removal from v0.0.194.
1 parent d8364e0 commit 950bc96

1 file changed

Lines changed: 1 addition & 0 deletions

File tree

small-models.yaml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -515,6 +515,7 @@ services:
515515
--reasoning-parser gemma4
516516
--tool-call-parser gemma4
517517
--mem-fraction-static 0.85
518+
--max-running-requests 64
518519
--chunked-prefill-size 8192
519520
--num-continuous-decode-steps 5
520521
--enable-mixed-chunk

0 commit comments

Comments
 (0)