@@ -103,7 +103,8 @@ spec:
103103 type : array
104104 draResources :
105105 description : DRAResources is the list of DRA resource claims
106- to be used for the NIMService deployment.
106+ to be used for the NIMService deployment or leader worker
107+ set.
107108 items :
108109 description : |-
109110 DRAResource references exactly one ResourceClaim, either directly
@@ -829,6 +830,42 @@ spec:
829830 type : string
830831 type : object
831832 type : object
833+ multiNode :
834+ description : NimServiceMultiNodeConfig defines the configuration
835+ for multi-node NIMService.
836+ properties :
837+ backendType :
838+ default : lws
839+ description : BackendType specifies the backend type
840+ for the multi-node NIMService. Currently only LWS
841+ is supported.
842+ enum :
843+ - lws
844+ type : string
845+ gpusPerPod :
846+ default : 1
847+ description : GPUSPerPod specifies the number of GPUs
848+ for each instance. In most cases, this should match
849+ ` resources.limits.nvidia.com/gpu` .
850+ type : integer
851+ mpi :
852+ description : MPI config for NIMService using LeaderWorkerSet
853+ properties :
854+ mpiStartTimeout :
855+ default : 300
856+ description : MPIStartTimeout specifies the timeout
857+ in seconds for starting the cluster.
858+ type : integer
859+ required :
860+ - mpiStartTimeout
861+ type : object
862+ size :
863+ default : 1
864+ description : Size specifies the number of pods to create
865+ for the multi-node NIMService.
866+ minimum : 1
867+ type : integer
868+ type : object
832869 nodeSelector :
833870 additionalProperties :
834871 type : string
@@ -1372,7 +1409,7 @@ spec:
13721409 type : integer
13731410 resources :
13741411 description : |-
1375- Resources is the resource requirements for the NIMService deployment.
1412+ Resources is the resource requirements for the NIMService deployment or leader worker set .
13761413
13771414 Note: Only traditional resources like cpu/memory and custom device plugin resources are supported here.
13781415 Any DRA claim references are ignored. Use DRAResources instead for those.
@@ -2381,6 +2418,11 @@ spec:
23812418 - authSecret
23822419 - image
23832420 type : object
2421+ x-kubernetes-validations :
2422+ - message : autoScaling must be nil or disabled when multiNode
2423+ is set
2424+ rule : ' !(has(self.multiNode) && has(self.scale) && has(self.scale.enabled)
2425+ && self.scale.enabled)'
23842426 type : object
23852427 type : array
23862428 type : object
0 commit comments