-
Notifications
You must be signed in to change notification settings - Fork 37
Open
Description
In our infrastructure, we are testing different models and we often need to scale some models down to 0 replicas to free up capacity for new ones, while wanting to quickly be able to restart some replicas if needed.
At the moment, the minimum amount of replicas allowed for a NIMService is 1; it would be nice to be able to scale it down to 0 as well.
With older NIM Operator versions, we could work around this issue by setting the number of replicas to 0 at the Deployment level (k scale deployment mistral-small-3-2-24b-instruct-2506 --replicas 0) but, with newer NIM Operator versions, this produces quite a lot of noise in the logs of the NIM Operator:
(every 5 seconds)
nim-operator-k8s-nim-operator-5fcd5f57b-jvf4k manager {"level":"info","ts":"2025-11-04T07:48:46Z","msg":"Reconciling","controller":"nimservice","controllerGroup":"apps.nvidia.com","controllerKind":"NIMService","NIMService":{"name":"mistral-small-3-2-24b-instruct-2506","namespace":"test-ns"},"namespace":"test-ns","name":"mistral-small-3-2-24b-instruct-2506","reconcileID":"5cc948e0-5599-4696-bfd8-1ce32cf6dc1e","NIMService":"mistral-small-3-2-24b-instruct-2506"}
nim-operator-k8s-nim-operator-5fcd5f57b-jvf4k manager {"level":"info","ts":"2025-11-04T07:48:46Z","logger":"controllers.NIMService","msg":"Reconciling NIMService instance","nimservice":"mistral-small-3-2-24b-instruct-2506"}
nim-operator-k8s-nim-operator-5fcd5f57b-jvf4k manager {"level":"error","ts":"2025-11-04T07:48:46Z","msg":"GET request failed","controller":"nimservice","controllerGroup":"apps.nvidia.com","controllerKind":"NIMService","NIMService":{"name":"mistral-small-3-2-24b-instruct-2506","namespace":"test-ns"},"namespace":"test-ns","name":"mistral-small-3-2-24b-instruct-2506","reconcileID":"5cc948e0-5599-4696-bfd8-1ce32cf6dc1e","url":"http://10.150.100.224:8000/v1/models","error":"Get \"http://10.150.100.224:8000/v1/models\": dial tcp 10.150.100.224:8000: connect: network is unreachable","stacktrace":"github.com/NVIDIA/k8s-nim-operator/internal/nimmodels.doGetRequest\n\t/workspace/internal/nimmodels/models.go:86\ngithub.com/NVIDIA/k8s-nim-operator/internal/nimmodels.ListModelsV1\n\t/workspace/internal/nimmodels/models.go:110\ngithub.com/NVIDIA/k8s-nim-operator/internal/controller/platform/standalone.(*NIMServiceReconciler).getNIMModelName\n\t/workspace/internal/controller/platform/standalone/nimservice.go:636\ngithub.com/NVIDIA/k8s-nim-operator/internal/controller/platform/standalone.(*NIMServiceReconciler).updateModelStatus\n\t/workspace/internal/controller/platform/standalone/nimservice.go:605\ngithub.com/NVIDIA/k8s-nim-operator/internal/controller/platform/standalone.(*NIMServiceReconciler).reconcileNIMService\n\t/workspace/internal/controller/platform/standalone/nimservice.go:511\ngithub.com/NVIDIA/k8s-nim-operator/internal/controller/platform/standalone.(*Standalone).Sync\n\t/workspace/internal/controller/platform/standalone/standalone.go:119\ngithub.com/NVIDIA/k8s-nim-operator/internal/controller.(*NIMServiceReconciler).Reconcile\n\t/workspace/internal/controller/nimservice_controller.go:179\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller[...]).Reconcile\n\t/workspace/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:119\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller[...]).reconcileHandler\n\t/workspace/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:334\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller[...]).processNextWorkItem\n\t/workspace/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:294\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller[...]).Start.func2.2\n\t/workspace/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:255"}
nim-operator-k8s-nim-operator-5fcd5f57b-jvf4k manager {"level":"error","ts":"2025-11-04T07:48:46Z","msg":"Failed to list models","controller":"nimservice","controllerGroup":"apps.nvidia.com","controllerKind":"NIMService","NIMService":{"name":"mistral-small-3-2-24b-instruct-2506","namespace":"test-ns"},"namespace":"test-ns","name":"mistral-small-3-2-24b-instruct-2506","reconcileID":"5cc948e0-5599-4696-bfd8-1ce32cf6dc1e","endpoint":"10.150.100.224:8000","error":"Get \"http://10.150.100.224:8000/v1/models\": dial tcp 10.150.100.224:8000: connect: network is unreachable","stacktrace":"github.com/NVIDIA/k8s-nim-operator/internal/controller/platform/standalone.(*NIMServiceReconciler).getNIMModelName\n\t/workspace/internal/controller/platform/standalone/nimservice.go:638\ngithub.com/NVIDIA/k8s-nim-operator/internal/controller/platform/standalone.(*NIMServiceReconciler).updateModelStatus\n\t/workspace/internal/controller/platform/standalone/nimservice.go:605\ngithub.com/NVIDIA/k8s-nim-operator/internal/controller/platform/standalone.(*NIMServiceReconciler).reconcileNIMService\n\t/workspace/internal/controller/platform/standalone/nimservice.go:511\ngithub.com/NVIDIA/k8s-nim-operator/internal/controller/platform/standalone.(*Standalone).Sync\n\t/workspace/internal/controller/platform/standalone/standalone.go:119\ngithub.com/NVIDIA/k8s-nim-operator/internal/controller.(*NIMServiceReconciler).Reconcile\n\t/workspace/internal/controller/nimservice_controller.go:179\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller[...]).Reconcile\n\t/workspace/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:119\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller[...]).reconcileHandler\n\t/workspace/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:334\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller[...]).processNextWorkItem\n\t/workspace/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:294\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller[...]).Start.func2.2\n\t/workspace/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:255"}
nim-operator-k8s-nim-operator-5fcd5f57b-jvf4k manager {"level":"info","ts":"2025-11-04T07:48:46Z","msg":"WARN: Model status update failed, will retry in 5 seconds","controller":"nimservice","controllerGroup":"apps.nvidia.com","controllerKind":"NIMService","NIMService":{"name":"mistral-small-3-2-24b-instruct-2506","namespace":"test-ns"},"namespace":"test-ns","name":"mistral-small-3-2-24b-instruct-2506","reconcileID":"5cc948e0-5599-4696-bfd8-1ce32cf6dc1e","error":"Get \"http://10.150.100.224:8000/v1/models\": dial tcp 10.150.100.224:8000: connect: network is unreachable"}
Thank you very much in advance for your support!
Metadata
Metadata
Assignees
Labels
enhancementNew feature or requestNew feature or request