Skip to content

Add support for scaling down NIMServices to 0 replicas #710

@tgdfool2

Description

@tgdfool2

In our infrastructure, we are testing different models and we often need to scale some models down to 0 replicas to free up capacity for new ones, while wanting to quickly be able to restart some replicas if needed.

At the moment, the minimum amount of replicas allowed for a NIMService is 1; it would be nice to be able to scale it down to 0 as well.

With older NIM Operator versions, we could work around this issue by setting the number of replicas to 0 at the Deployment level (k scale deployment mistral-small-3-2-24b-instruct-2506 --replicas 0) but, with newer NIM Operator versions, this produces quite a lot of noise in the logs of the NIM Operator:
(every 5 seconds)

nim-operator-k8s-nim-operator-5fcd5f57b-jvf4k manager {"level":"info","ts":"2025-11-04T07:48:46Z","msg":"Reconciling","controller":"nimservice","controllerGroup":"apps.nvidia.com","controllerKind":"NIMService","NIMService":{"name":"mistral-small-3-2-24b-instruct-2506","namespace":"test-ns"},"namespace":"test-ns","name":"mistral-small-3-2-24b-instruct-2506","reconcileID":"5cc948e0-5599-4696-bfd8-1ce32cf6dc1e","NIMService":"mistral-small-3-2-24b-instruct-2506"}
nim-operator-k8s-nim-operator-5fcd5f57b-jvf4k manager {"level":"info","ts":"2025-11-04T07:48:46Z","logger":"controllers.NIMService","msg":"Reconciling NIMService instance","nimservice":"mistral-small-3-2-24b-instruct-2506"}
nim-operator-k8s-nim-operator-5fcd5f57b-jvf4k manager {"level":"error","ts":"2025-11-04T07:48:46Z","msg":"GET request failed","controller":"nimservice","controllerGroup":"apps.nvidia.com","controllerKind":"NIMService","NIMService":{"name":"mistral-small-3-2-24b-instruct-2506","namespace":"test-ns"},"namespace":"test-ns","name":"mistral-small-3-2-24b-instruct-2506","reconcileID":"5cc948e0-5599-4696-bfd8-1ce32cf6dc1e","url":"http://10.150.100.224:8000/v1/models","error":"Get \"http://10.150.100.224:8000/v1/models\": dial tcp 10.150.100.224:8000: connect: network is unreachable","stacktrace":"github.com/NVIDIA/k8s-nim-operator/internal/nimmodels.doGetRequest\n\t/workspace/internal/nimmodels/models.go:86\ngithub.com/NVIDIA/k8s-nim-operator/internal/nimmodels.ListModelsV1\n\t/workspace/internal/nimmodels/models.go:110\ngithub.com/NVIDIA/k8s-nim-operator/internal/controller/platform/standalone.(*NIMServiceReconciler).getNIMModelName\n\t/workspace/internal/controller/platform/standalone/nimservice.go:636\ngithub.com/NVIDIA/k8s-nim-operator/internal/controller/platform/standalone.(*NIMServiceReconciler).updateModelStatus\n\t/workspace/internal/controller/platform/standalone/nimservice.go:605\ngithub.com/NVIDIA/k8s-nim-operator/internal/controller/platform/standalone.(*NIMServiceReconciler).reconcileNIMService\n\t/workspace/internal/controller/platform/standalone/nimservice.go:511\ngithub.com/NVIDIA/k8s-nim-operator/internal/controller/platform/standalone.(*Standalone).Sync\n\t/workspace/internal/controller/platform/standalone/standalone.go:119\ngithub.com/NVIDIA/k8s-nim-operator/internal/controller.(*NIMServiceReconciler).Reconcile\n\t/workspace/internal/controller/nimservice_controller.go:179\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller[...]).Reconcile\n\t/workspace/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:119\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller[...]).reconcileHandler\n\t/workspace/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:334\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller[...]).processNextWorkItem\n\t/workspace/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:294\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller[...]).Start.func2.2\n\t/workspace/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:255"}
nim-operator-k8s-nim-operator-5fcd5f57b-jvf4k manager {"level":"error","ts":"2025-11-04T07:48:46Z","msg":"Failed to list models","controller":"nimservice","controllerGroup":"apps.nvidia.com","controllerKind":"NIMService","NIMService":{"name":"mistral-small-3-2-24b-instruct-2506","namespace":"test-ns"},"namespace":"test-ns","name":"mistral-small-3-2-24b-instruct-2506","reconcileID":"5cc948e0-5599-4696-bfd8-1ce32cf6dc1e","endpoint":"10.150.100.224:8000","error":"Get \"http://10.150.100.224:8000/v1/models\": dial tcp 10.150.100.224:8000: connect: network is unreachable","stacktrace":"github.com/NVIDIA/k8s-nim-operator/internal/controller/platform/standalone.(*NIMServiceReconciler).getNIMModelName\n\t/workspace/internal/controller/platform/standalone/nimservice.go:638\ngithub.com/NVIDIA/k8s-nim-operator/internal/controller/platform/standalone.(*NIMServiceReconciler).updateModelStatus\n\t/workspace/internal/controller/platform/standalone/nimservice.go:605\ngithub.com/NVIDIA/k8s-nim-operator/internal/controller/platform/standalone.(*NIMServiceReconciler).reconcileNIMService\n\t/workspace/internal/controller/platform/standalone/nimservice.go:511\ngithub.com/NVIDIA/k8s-nim-operator/internal/controller/platform/standalone.(*Standalone).Sync\n\t/workspace/internal/controller/platform/standalone/standalone.go:119\ngithub.com/NVIDIA/k8s-nim-operator/internal/controller.(*NIMServiceReconciler).Reconcile\n\t/workspace/internal/controller/nimservice_controller.go:179\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller[...]).Reconcile\n\t/workspace/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:119\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller[...]).reconcileHandler\n\t/workspace/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:334\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller[...]).processNextWorkItem\n\t/workspace/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:294\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller[...]).Start.func2.2\n\t/workspace/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:255"}
nim-operator-k8s-nim-operator-5fcd5f57b-jvf4k manager {"level":"info","ts":"2025-11-04T07:48:46Z","msg":"WARN: Model status update failed, will retry in 5 seconds","controller":"nimservice","controllerGroup":"apps.nvidia.com","controllerKind":"NIMService","NIMService":{"name":"mistral-small-3-2-24b-instruct-2506","namespace":"test-ns"},"namespace":"test-ns","name":"mistral-small-3-2-24b-instruct-2506","reconcileID":"5cc948e0-5599-4696-bfd8-1ce32cf6dc1e","error":"Get \"http://10.150.100.224:8000/v1/models\": dial tcp 10.150.100.224:8000: connect: network is unreachable"}

Thank you very much in advance for your support!

Metadata

Metadata

Labels

enhancementNew feature or request

Type

No type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions