gemma-4-31B-it: cap max-running-requests at 64

Evrard-Nil · Evrard-Nil · commit 950bc967a6c8 · 2026-05-27T10:46:58.000+02:00
Restores --max-running-requests 64 (v0.0.194 dropped it, letting sglang
default to 2048, which inflated CUDA-graph/buffer pre-alloc — boot
headroom fell 20.05-&gt;16.41 GB — and admitted too much concurrent
prefill, causing CUDA OOM 500s under load on gpu11). Keeps the
--context-length removal from v0.0.194.
diff --git a/small-models.yaml b/small-models.yaml
@@ -515,6 +515,7 @@ services:
         --reasoning-parser gemma4
         --tool-call-parser gemma4
         --mem-fraction-static 0.85
+        --max-running-requests 64
         --chunked-prefill-size 8192
         --num-continuous-decode-steps 5
         --enable-mixed-chunk