Skip to content

Commit df68759

Browse files
shengnuoshivamerla
authored andcommitted
Update manifests
Signed-off-by: Sheng Lin <[email protected]>
1 parent 5cc745f commit df68759

8 files changed

Lines changed: 265 additions & 13 deletions

File tree

bundle/manifests/apps.nvidia.com_nimpipelines.yaml

Lines changed: 44 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -103,7 +103,8 @@ spec:
103103
type: array
104104
draResources:
105105
description: DRAResources is the list of DRA resource claims
106-
to be used for the NIMService deployment.
106+
to be used for the NIMService deployment or leader worker
107+
set.
107108
items:
108109
description: |-
109110
DRAResource references exactly one ResourceClaim, either directly
@@ -829,6 +830,42 @@ spec:
829830
type: string
830831
type: object
831832
type: object
833+
multiNode:
834+
description: NimServiceMultiNodeConfig defines the configuration
835+
for multi-node NIMService.
836+
properties:
837+
backendType:
838+
default: lws
839+
description: BackendType specifies the backend type
840+
for the multi-node NIMService. Currently only LWS
841+
is supported.
842+
enum:
843+
- lws
844+
type: string
845+
gpusPerPod:
846+
default: 1
847+
description: GPUSPerPod specifies the number of GPUs
848+
for each instance. In most cases, this should match
849+
`resources.limits.nvidia.com/gpu`.
850+
type: integer
851+
mpi:
852+
description: MPI config for NIMService using LeaderWorkerSet
853+
properties:
854+
mpiStartTimeout:
855+
default: 300
856+
description: MPIStartTimeout specifies the timeout
857+
in seconds for starting the cluster.
858+
type: integer
859+
required:
860+
- mpiStartTimeout
861+
type: object
862+
size:
863+
default: 1
864+
description: Size specifies the number of pods to create
865+
for the multi-node NIMService.
866+
minimum: 1
867+
type: integer
868+
type: object
832869
nodeSelector:
833870
additionalProperties:
834871
type: string
@@ -1372,7 +1409,7 @@ spec:
13721409
type: integer
13731410
resources:
13741411
description: |-
1375-
Resources is the resource requirements for the NIMService deployment.
1412+
Resources is the resource requirements for the NIMService deployment or leader worker set.
13761413
13771414
Note: Only traditional resources like cpu/memory and custom device plugin resources are supported here.
13781415
Any DRA claim references are ignored. Use DRAResources instead for those.
@@ -2381,6 +2418,11 @@ spec:
23812418
- authSecret
23822419
- image
23832420
type: object
2421+
x-kubernetes-validations:
2422+
- message: autoScaling must be nil or disabled when multiNode
2423+
is set
2424+
rule: '!(has(self.multiNode) && has(self.scale) && has(self.scale.enabled)
2425+
&& self.scale.enabled)'
23842426
type: object
23852427
type: array
23862428
type: object

bundle/manifests/apps.nvidia.com_nimservices.yaml

Lines changed: 40 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -64,7 +64,7 @@ spec:
6464
type: array
6565
draResources:
6666
description: DRAResources is the list of DRA resource claims to be
67-
used for the NIMService deployment.
67+
used for the NIMService deployment or leader worker set.
6868
items:
6969
description: |-
7070
DRAResource references exactly one ResourceClaim, either directly
@@ -775,6 +775,40 @@ spec:
775775
type: string
776776
type: object
777777
type: object
778+
multiNode:
779+
description: NimServiceMultiNodeConfig defines the configuration for
780+
multi-node NIMService.
781+
properties:
782+
backendType:
783+
default: lws
784+
description: BackendType specifies the backend type for the multi-node
785+
NIMService. Currently only LWS is supported.
786+
enum:
787+
- lws
788+
type: string
789+
gpusPerPod:
790+
default: 1
791+
description: GPUSPerPod specifies the number of GPUs for each
792+
instance. In most cases, this should match `resources.limits.nvidia.com/gpu`.
793+
type: integer
794+
mpi:
795+
description: MPI config for NIMService using LeaderWorkerSet
796+
properties:
797+
mpiStartTimeout:
798+
default: 300
799+
description: MPIStartTimeout specifies the timeout in seconds
800+
for starting the cluster.
801+
type: integer
802+
required:
803+
- mpiStartTimeout
804+
type: object
805+
size:
806+
default: 1
807+
description: Size specifies the number of pods to create for the
808+
multi-node NIMService.
809+
minimum: 1
810+
type: integer
811+
type: object
778812
nodeSelector:
779813
additionalProperties:
780814
type: string
@@ -1309,7 +1343,7 @@ spec:
13091343
type: integer
13101344
resources:
13111345
description: |-
1312-
Resources is the resource requirements for the NIMService deployment.
1346+
Resources is the resource requirements for the NIMService deployment or leader worker set.
13131347
13141348
Note: Only traditional resources like cpu/memory and custom device plugin resources are supported here.
13151349
Any DRA claim references are ignored. Use DRAResources instead for those.
@@ -2295,6 +2329,10 @@ spec:
22952329
- authSecret
22962330
- image
22972331
type: object
2332+
x-kubernetes-validations:
2333+
- message: autoScaling must be nil or disabled when multiNode is set
2334+
rule: '!(has(self.multiNode) && has(self.scale) && has(self.scale.enabled)
2335+
&& self.scale.enabled)'
22982336
status:
22992337
description: NIMServiceStatus defines the observed state of NIMService.
23002338
properties:

config/crd/bases/apps.nvidia.com_nimpipelines.yaml

Lines changed: 44 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -103,7 +103,8 @@ spec:
103103
type: array
104104
draResources:
105105
description: DRAResources is the list of DRA resource claims
106-
to be used for the NIMService deployment.
106+
to be used for the NIMService deployment or leader worker
107+
set.
107108
items:
108109
description: |-
109110
DRAResource references exactly one ResourceClaim, either directly
@@ -829,6 +830,42 @@ spec:
829830
type: string
830831
type: object
831832
type: object
833+
multiNode:
834+
description: NimServiceMultiNodeConfig defines the configuration
835+
for multi-node NIMService.
836+
properties:
837+
backendType:
838+
default: lws
839+
description: BackendType specifies the backend type
840+
for the multi-node NIMService. Currently only LWS
841+
is supported.
842+
enum:
843+
- lws
844+
type: string
845+
gpusPerPod:
846+
default: 1
847+
description: GPUSPerPod specifies the number of GPUs
848+
for each instance. In most cases, this should match
849+
`resources.limits.nvidia.com/gpu`.
850+
type: integer
851+
mpi:
852+
description: MPI config for NIMService using LeaderWorkerSet
853+
properties:
854+
mpiStartTimeout:
855+
default: 300
856+
description: MPIStartTimeout specifies the timeout
857+
in seconds for starting the cluster.
858+
type: integer
859+
required:
860+
- mpiStartTimeout
861+
type: object
862+
size:
863+
default: 1
864+
description: Size specifies the number of pods to create
865+
for the multi-node NIMService.
866+
minimum: 1
867+
type: integer
868+
type: object
832869
nodeSelector:
833870
additionalProperties:
834871
type: string
@@ -1372,7 +1409,7 @@ spec:
13721409
type: integer
13731410
resources:
13741411
description: |-
1375-
Resources is the resource requirements for the NIMService deployment.
1412+
Resources is the resource requirements for the NIMService deployment or leader worker set.
13761413
13771414
Note: Only traditional resources like cpu/memory and custom device plugin resources are supported here.
13781415
Any DRA claim references are ignored. Use DRAResources instead for those.
@@ -2381,6 +2418,11 @@ spec:
23812418
- authSecret
23822419
- image
23832420
type: object
2421+
x-kubernetes-validations:
2422+
- message: autoScaling must be nil or disabled when multiNode
2423+
is set
2424+
rule: '!(has(self.multiNode) && has(self.scale) && has(self.scale.enabled)
2425+
&& self.scale.enabled)'
23842426
type: object
23852427
type: array
23862428
type: object

config/crd/bases/apps.nvidia.com_nimservices.yaml

Lines changed: 40 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -64,7 +64,7 @@ spec:
6464
type: array
6565
draResources:
6666
description: DRAResources is the list of DRA resource claims to be
67-
used for the NIMService deployment.
67+
used for the NIMService deployment or leader worker set.
6868
items:
6969
description: |-
7070
DRAResource references exactly one ResourceClaim, either directly
@@ -775,6 +775,40 @@ spec:
775775
type: string
776776
type: object
777777
type: object
778+
multiNode:
779+
description: NimServiceMultiNodeConfig defines the configuration for
780+
multi-node NIMService.
781+
properties:
782+
backendType:
783+
default: lws
784+
description: BackendType specifies the backend type for the multi-node
785+
NIMService. Currently only LWS is supported.
786+
enum:
787+
- lws
788+
type: string
789+
gpusPerPod:
790+
default: 1
791+
description: GPUSPerPod specifies the number of GPUs for each
792+
instance. In most cases, this should match `resources.limits.nvidia.com/gpu`.
793+
type: integer
794+
mpi:
795+
description: MPI config for NIMService using LeaderWorkerSet
796+
properties:
797+
mpiStartTimeout:
798+
default: 300
799+
description: MPIStartTimeout specifies the timeout in seconds
800+
for starting the cluster.
801+
type: integer
802+
required:
803+
- mpiStartTimeout
804+
type: object
805+
size:
806+
default: 1
807+
description: Size specifies the number of pods to create for the
808+
multi-node NIMService.
809+
minimum: 1
810+
type: integer
811+
type: object
778812
nodeSelector:
779813
additionalProperties:
780814
type: string
@@ -1309,7 +1343,7 @@ spec:
13091343
type: integer
13101344
resources:
13111345
description: |-
1312-
Resources is the resource requirements for the NIMService deployment.
1346+
Resources is the resource requirements for the NIMService deployment or leader worker set.
13131347
13141348
Note: Only traditional resources like cpu/memory and custom device plugin resources are supported here.
13151349
Any DRA claim references are ignored. Use DRAResources instead for those.
@@ -2295,6 +2329,10 @@ spec:
22952329
- authSecret
22962330
- image
22972331
type: object
2332+
x-kubernetes-validations:
2333+
- message: autoScaling must be nil or disabled when multiNode is set
2334+
rule: '!(has(self.multiNode) && has(self.scale) && has(self.scale.enabled)
2335+
&& self.scale.enabled)'
22982336
status:
22992337
description: NIMServiceStatus defines the observed state of NIMService.
23002338
properties:

config/rbac/role.yaml

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -49,6 +49,13 @@ rules:
4949
- pods/log
5050
verbs:
5151
- get
52+
- apiGroups:
53+
- apiextensions.k8s.io
54+
resources:
55+
- customresourcedefinitions
56+
verbs:
57+
- get
58+
- list
5259
- apiGroups:
5360
- apps
5461
resources:

config/samples/nim/llm/nimservice.yaml

Lines changed: 6 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -5,7 +5,7 @@ metadata:
55
spec:
66
image:
77
repository: nvcr.io/nim/meta/llama-3.1-8b-instruct
8-
tag: 1.3.3
8+
tag: "1.8"
99
pullPolicy: IfNotPresent
1010
pullSecrets:
1111
- ngc-secret
@@ -22,3 +22,8 @@ spec:
2222
service:
2323
type: ClusterIP
2424
port: 8000
25+
multiNode:
26+
workers: 2
27+
gpusPerWorker: 1
28+
# mpi:
29+
# clusterStartTimeout: 300

deployments/helm/k8s-nim-operator/crds/apps.nvidia.com_nimpipelines.yaml

Lines changed: 44 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -103,7 +103,8 @@ spec:
103103
type: array
104104
draResources:
105105
description: DRAResources is the list of DRA resource claims
106-
to be used for the NIMService deployment.
106+
to be used for the NIMService deployment or leader worker
107+
set.
107108
items:
108109
description: |-
109110
DRAResource references exactly one ResourceClaim, either directly
@@ -829,6 +830,42 @@ spec:
829830
type: string
830831
type: object
831832
type: object
833+
multiNode:
834+
description: NimServiceMultiNodeConfig defines the configuration
835+
for multi-node NIMService.
836+
properties:
837+
backendType:
838+
default: lws
839+
description: BackendType specifies the backend type
840+
for the multi-node NIMService. Currently only LWS
841+
is supported.
842+
enum:
843+
- lws
844+
type: string
845+
gpusPerPod:
846+
default: 1
847+
description: GPUSPerPod specifies the number of GPUs
848+
for each instance. In most cases, this should match
849+
`resources.limits.nvidia.com/gpu`.
850+
type: integer
851+
mpi:
852+
description: MPI config for NIMService using LeaderWorkerSet
853+
properties:
854+
mpiStartTimeout:
855+
default: 300
856+
description: MPIStartTimeout specifies the timeout
857+
in seconds for starting the cluster.
858+
type: integer
859+
required:
860+
- mpiStartTimeout
861+
type: object
862+
size:
863+
default: 1
864+
description: Size specifies the number of pods to create
865+
for the multi-node NIMService.
866+
minimum: 1
867+
type: integer
868+
type: object
832869
nodeSelector:
833870
additionalProperties:
834871
type: string
@@ -1372,7 +1409,7 @@ spec:
13721409
type: integer
13731410
resources:
13741411
description: |-
1375-
Resources is the resource requirements for the NIMService deployment.
1412+
Resources is the resource requirements for the NIMService deployment or leader worker set.
13761413
13771414
Note: Only traditional resources like cpu/memory and custom device plugin resources are supported here.
13781415
Any DRA claim references are ignored. Use DRAResources instead for those.
@@ -2381,6 +2418,11 @@ spec:
23812418
- authSecret
23822419
- image
23832420
type: object
2421+
x-kubernetes-validations:
2422+
- message: autoScaling must be nil or disabled when multiNode
2423+
is set
2424+
rule: '!(has(self.multiNode) && has(self.scale) && has(self.scale.enabled)
2425+
&& self.scale.enabled)'
23842426
type: object
23852427
type: array
23862428
type: object

0 commit comments

Comments
 (0)