Set Ray Serve min_replicas to 1 to avoid cold starts

rajsinghtech · rajsinghtech · commit 5fcd1f62c868 · 2025-11-03T07:46:56.000-06:00
Change Ray Serve deployment min_replicas from 0 to 1 to keep one replica
always loaded with the model. This eliminates cold start latency when
querying the service.

With min_replicas: 0, the 20GB model download and loading would timeout
on first request (~3-5 minutes). Keeping 1 replica active provides
immediate response times.
diff --git a/clusters/k3s-stpetersburg/apps/ai/deepseek-ocr/rayservice.yaml b/clusters/k3s-stpetersburg/apps/ai/deepseek-ocr/rayservice.yaml
@@ -19,7 +19,7 @@ spec:
         deployments:
           - name: deepseek-ocr
             autoscaling_config:
-              min_replicas: 0
+              min_replicas: 1
               max_replicas: 2
               target_ongoing_requests: 1
               upscale_delay_s: 30