Skip to content

Commit dea62aa

Browse files
authored
Merge branch 'main' into rag_dependencies
2 parents 1fce9c0 + 4dcedf0 commit dea62aa

File tree

2,048 files changed

+739480
-6185
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

2,048 files changed

+739480
-6185
lines changed

api/apps/v1alpha1/nimservice_types.go

Lines changed: 715 additions & 9 deletions
Large diffs are not rendered by default.

api/apps/v1alpha1/zz_generated.deepcopy.go

Lines changed: 40 additions & 0 deletions
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

bundle/manifests/apps.nvidia.com_nimpipelines.yaml

Lines changed: 44 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -103,7 +103,8 @@ spec:
103103
type: array
104104
draResources:
105105
description: DRAResources is the list of DRA resource claims
106-
to be used for the NIMService deployment.
106+
to be used for the NIMService deployment or leader worker
107+
set.
107108
items:
108109
description: |-
109110
DRAResource references exactly one ResourceClaim, either directly
@@ -829,6 +830,42 @@ spec:
829830
type: string
830831
type: object
831832
type: object
833+
multiNode:
834+
description: NimServiceMultiNodeConfig defines the configuration
835+
for multi-node NIMService.
836+
properties:
837+
backendType:
838+
default: lws
839+
description: BackendType specifies the backend type
840+
for the multi-node NIMService. Currently only LWS
841+
is supported.
842+
enum:
843+
- lws
844+
type: string
845+
gpusPerPod:
846+
default: 1
847+
description: GPUSPerPod specifies the number of GPUs
848+
for each instance. In most cases, this should match
849+
`resources.limits.nvidia.com/gpu`.
850+
type: integer
851+
mpi:
852+
description: MPI config for NIMService using LeaderWorkerSet
853+
properties:
854+
mpiStartTimeout:
855+
default: 300
856+
description: MPIStartTimeout specifies the timeout
857+
in seconds for starting the cluster.
858+
type: integer
859+
required:
860+
- mpiStartTimeout
861+
type: object
862+
size:
863+
default: 1
864+
description: Size specifies the number of pods to create
865+
for the multi-node NIMService.
866+
minimum: 1
867+
type: integer
868+
type: object
832869
nodeSelector:
833870
additionalProperties:
834871
type: string
@@ -1372,7 +1409,7 @@ spec:
13721409
type: integer
13731410
resources:
13741411
description: |-
1375-
Resources is the resource requirements for the NIMService deployment.
1412+
Resources is the resource requirements for the NIMService deployment or leader worker set.
13761413
13771414
Note: Only traditional resources like cpu/memory and custom device plugin resources are supported here.
13781415
Any DRA claim references are ignored. Use DRAResources instead for those.
@@ -2381,6 +2418,11 @@ spec:
23812418
- authSecret
23822419
- image
23832420
type: object
2421+
x-kubernetes-validations:
2422+
- message: autoScaling must be nil or disabled when multiNode
2423+
is set
2424+
rule: '!(has(self.multiNode) && has(self.scale) && has(self.scale.enabled)
2425+
&& self.scale.enabled)'
23842426
type: object
23852427
type: array
23862428
type: object

bundle/manifests/apps.nvidia.com_nimservices.yaml

Lines changed: 40 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -64,7 +64,7 @@ spec:
6464
type: array
6565
draResources:
6666
description: DRAResources is the list of DRA resource claims to be
67-
used for the NIMService deployment.
67+
used for the NIMService deployment or leader worker set.
6868
items:
6969
description: |-
7070
DRAResource references exactly one ResourceClaim, either directly
@@ -775,6 +775,40 @@ spec:
775775
type: string
776776
type: object
777777
type: object
778+
multiNode:
779+
description: NimServiceMultiNodeConfig defines the configuration for
780+
multi-node NIMService.
781+
properties:
782+
backendType:
783+
default: lws
784+
description: BackendType specifies the backend type for the multi-node
785+
NIMService. Currently only LWS is supported.
786+
enum:
787+
- lws
788+
type: string
789+
gpusPerPod:
790+
default: 1
791+
description: GPUSPerPod specifies the number of GPUs for each
792+
instance. In most cases, this should match `resources.limits.nvidia.com/gpu`.
793+
type: integer
794+
mpi:
795+
description: MPI config for NIMService using LeaderWorkerSet
796+
properties:
797+
mpiStartTimeout:
798+
default: 300
799+
description: MPIStartTimeout specifies the timeout in seconds
800+
for starting the cluster.
801+
type: integer
802+
required:
803+
- mpiStartTimeout
804+
type: object
805+
size:
806+
default: 1
807+
description: Size specifies the number of pods to create for the
808+
multi-node NIMService.
809+
minimum: 1
810+
type: integer
811+
type: object
778812
nodeSelector:
779813
additionalProperties:
780814
type: string
@@ -1309,7 +1343,7 @@ spec:
13091343
type: integer
13101344
resources:
13111345
description: |-
1312-
Resources is the resource requirements for the NIMService deployment.
1346+
Resources is the resource requirements for the NIMService deployment or leader worker set.
13131347
13141348
Note: Only traditional resources like cpu/memory and custom device plugin resources are supported here.
13151349
Any DRA claim references are ignored. Use DRAResources instead for those.
@@ -2295,6 +2329,10 @@ spec:
22952329
- authSecret
22962330
- image
22972331
type: object
2332+
x-kubernetes-validations:
2333+
- message: autoScaling must be nil or disabled when multiNode is set
2334+
rule: '!(has(self.multiNode) && has(self.scale) && has(self.scale.enabled)
2335+
&& self.scale.enabled)'
22982336
status:
22992337
description: NIMServiceStatus defines the observed state of NIMService.
23002338
properties:

bundle/manifests/k8s-nim-operator.clusterserviceversion.yaml

Lines changed: 12 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1249,6 +1249,18 @@ spec:
12491249
- get
12501250
- list
12511251
- watch
1252+
- apiGroups:
1253+
- leaderworkerset.x-k8s.io
1254+
resources:
1255+
- leaderworkersets
1256+
verbs:
1257+
- create
1258+
- get
1259+
- list
1260+
- watch
1261+
- delete
1262+
- patch
1263+
- update
12521264
deployments:
12531265
- name: k8s-nim-operator
12541266
spec:

cmd/main.go

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -21,9 +21,12 @@ import (
2121
"flag"
2222
"os"
2323

24+
kservev1beta1 "github.com/kserve/kserve/pkg/apis/serving/v1beta1"
2425
monitoring "github.com/prometheus-operator/prometheus-operator/pkg/apis/monitoring/v1"
26+
2527
// Import all Kubernetes client auth plugins (e.g. Azure, GCP, OIDC, etc.)
2628
// to ensure that exec-entrypoint and run can make use of them.
29+
2730
"k8s.io/apimachinery/pkg/runtime"
2831
utilruntime "k8s.io/apimachinery/pkg/util/runtime"
2932
discovery "k8s.io/client-go/discovery"
@@ -35,6 +38,7 @@ import (
3538
"sigs.k8s.io/controller-runtime/pkg/metrics/filters"
3639
metricsserver "sigs.k8s.io/controller-runtime/pkg/metrics/server"
3740
"sigs.k8s.io/controller-runtime/pkg/webhook"
41+
lws "sigs.k8s.io/lws/api/leaderworkerset/v1"
3842

3943
appsv1alpha1 "github.com/NVIDIA/k8s-nim-operator/api/apps/v1alpha1"
4044
"github.com/NVIDIA/k8s-nim-operator/internal/conditions"
@@ -55,6 +59,8 @@ func init() {
5559
utilruntime.Must(clientgoscheme.AddToScheme(scheme))
5660
utilruntime.Must(appsv1alpha1.AddToScheme(scheme))
5761
utilruntime.Must(monitoring.AddToScheme(scheme))
62+
utilruntime.Must(lws.AddToScheme(scheme))
63+
utilruntime.Must(kservev1beta1.AddToScheme(scheme))
5864
// +kubebuilder:scaffold:scheme
5965
}
6066

config/crd/bases/apps.nvidia.com_nimpipelines.yaml

Lines changed: 44 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -103,7 +103,8 @@ spec:
103103
type: array
104104
draResources:
105105
description: DRAResources is the list of DRA resource claims
106-
to be used for the NIMService deployment.
106+
to be used for the NIMService deployment or leader worker
107+
set.
107108
items:
108109
description: |-
109110
DRAResource references exactly one ResourceClaim, either directly
@@ -829,6 +830,42 @@ spec:
829830
type: string
830831
type: object
831832
type: object
833+
multiNode:
834+
description: NimServiceMultiNodeConfig defines the configuration
835+
for multi-node NIMService.
836+
properties:
837+
backendType:
838+
default: lws
839+
description: BackendType specifies the backend type
840+
for the multi-node NIMService. Currently only LWS
841+
is supported.
842+
enum:
843+
- lws
844+
type: string
845+
gpusPerPod:
846+
default: 1
847+
description: GPUSPerPod specifies the number of GPUs
848+
for each instance. In most cases, this should match
849+
`resources.limits.nvidia.com/gpu`.
850+
type: integer
851+
mpi:
852+
description: MPI config for NIMService using LeaderWorkerSet
853+
properties:
854+
mpiStartTimeout:
855+
default: 300
856+
description: MPIStartTimeout specifies the timeout
857+
in seconds for starting the cluster.
858+
type: integer
859+
required:
860+
- mpiStartTimeout
861+
type: object
862+
size:
863+
default: 1
864+
description: Size specifies the number of pods to create
865+
for the multi-node NIMService.
866+
minimum: 1
867+
type: integer
868+
type: object
832869
nodeSelector:
833870
additionalProperties:
834871
type: string
@@ -1372,7 +1409,7 @@ spec:
13721409
type: integer
13731410
resources:
13741411
description: |-
1375-
Resources is the resource requirements for the NIMService deployment.
1412+
Resources is the resource requirements for the NIMService deployment or leader worker set.
13761413
13771414
Note: Only traditional resources like cpu/memory and custom device plugin resources are supported here.
13781415
Any DRA claim references are ignored. Use DRAResources instead for those.
@@ -2381,6 +2418,11 @@ spec:
23812418
- authSecret
23822419
- image
23832420
type: object
2421+
x-kubernetes-validations:
2422+
- message: autoScaling must be nil or disabled when multiNode
2423+
is set
2424+
rule: '!(has(self.multiNode) && has(self.scale) && has(self.scale.enabled)
2425+
&& self.scale.enabled)'
23842426
type: object
23852427
type: array
23862428
type: object

config/crd/bases/apps.nvidia.com_nimservices.yaml

Lines changed: 40 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -64,7 +64,7 @@ spec:
6464
type: array
6565
draResources:
6666
description: DRAResources is the list of DRA resource claims to be
67-
used for the NIMService deployment.
67+
used for the NIMService deployment or leader worker set.
6868
items:
6969
description: |-
7070
DRAResource references exactly one ResourceClaim, either directly
@@ -775,6 +775,40 @@ spec:
775775
type: string
776776
type: object
777777
type: object
778+
multiNode:
779+
description: NimServiceMultiNodeConfig defines the configuration for
780+
multi-node NIMService.
781+
properties:
782+
backendType:
783+
default: lws
784+
description: BackendType specifies the backend type for the multi-node
785+
NIMService. Currently only LWS is supported.
786+
enum:
787+
- lws
788+
type: string
789+
gpusPerPod:
790+
default: 1
791+
description: GPUSPerPod specifies the number of GPUs for each
792+
instance. In most cases, this should match `resources.limits.nvidia.com/gpu`.
793+
type: integer
794+
mpi:
795+
description: MPI config for NIMService using LeaderWorkerSet
796+
properties:
797+
mpiStartTimeout:
798+
default: 300
799+
description: MPIStartTimeout specifies the timeout in seconds
800+
for starting the cluster.
801+
type: integer
802+
required:
803+
- mpiStartTimeout
804+
type: object
805+
size:
806+
default: 1
807+
description: Size specifies the number of pods to create for the
808+
multi-node NIMService.
809+
minimum: 1
810+
type: integer
811+
type: object
778812
nodeSelector:
779813
additionalProperties:
780814
type: string
@@ -1309,7 +1343,7 @@ spec:
13091343
type: integer
13101344
resources:
13111345
description: |-
1312-
Resources is the resource requirements for the NIMService deployment.
1346+
Resources is the resource requirements for the NIMService deployment or leader worker set.
13131347
13141348
Note: Only traditional resources like cpu/memory and custom device plugin resources are supported here.
13151349
Any DRA claim references are ignored. Use DRAResources instead for those.
@@ -2295,6 +2329,10 @@ spec:
22952329
- authSecret
22962330
- image
22972331
type: object
2332+
x-kubernetes-validations:
2333+
- message: autoScaling must be nil or disabled when multiNode is set
2334+
rule: '!(has(self.multiNode) && has(self.scale) && has(self.scale.enabled)
2335+
&& self.scale.enabled)'
22982336
status:
22992337
description: NIMServiceStatus defines the observed state of NIMService.
23002338
properties:

0 commit comments

Comments
 (0)