Skip to content

Cleanup Multi Node API to avoid redundant GPU input#631

Closed
visheshtanksale wants to merge 19 commits intoNVIDIA:mainfrom
visheshtanksale:cleanup-gpu-per-node
Closed

Cleanup Multi Node API to avoid redundant GPU input#631
visheshtanksale wants to merge 19 commits intoNVIDIA:mainfrom
visheshtanksale:cleanup-gpu-per-node

Conversation

@visheshtanksale
Copy link
Copy Markdown
Collaborator

@visheshtanksale visheshtanksale commented Aug 22, 2025

  • Removed field spec.multiNode.GPUSPerPod
  • The GPUs per node will be determined by the gpu specified in the spec.Resources
  • Added webhook validation to check that GPU values is set on spec.Resources
  • TODO: if GPU is not specified on spec.Resources then need to auto determine base on tp*pp/(.spec.multiNode.size)

@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot bot commented Aug 22, 2025

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.


// GetMultiNodeGPUsPerPod returns the number of GPUs per pod for the multi-node NIMService.
func (n *NIMService) GetMultiNodeGPUsPerPod() int {
gpuQuantity, ok := n.Spec.Resources.Requests["nvidia.com/gpu"]
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need to support DRA here?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added DRA support. I am assuming for ResourceClaim and ResourceClaimTemplate that there might be only one request for GPUs under Spec.Devices.Requests. Is this correct assumption?

// Constants for GPU device detection.
const (
// NVIDIA identifiers used to detect GPU devices.
NVIDIAIdentifier = "nvidia"
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These might be used for non-gpu devices too for e.g. rdma.nvidia.com. We should just check for a well known device-class gpu.nvidia.com or if custom ones are set, then make sure the Cel expression exist in those classes

      expression: "device.driver == 'gpu.nvidia.com' && device.attributes['gpu.nvidia.com'].type == 'gpu'"

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed this to look for gpu.nvidia.com

Comment on lines +263 to +276
switch {
case resource.ResourceClaimName != nil:
gpuCount, err = GetGPUDeviceCountForClaim(ctx, client, *resource.ResourceClaimName, namespace)
case resource.ResourceClaimTemplateName != nil:
gpuCount, err = GetGPUDeviceCountForClaimTemplate(ctx, client, *resource.ResourceClaimTemplateName, namespace)
case resource.ClaimCreationSpec != nil:
gpuCount, err = GetGPUDeviceCountForClaimCreationSpec(ctx, client, *resource.ClaimCreationSpec, namespace)
}
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am wondering do we have a use case for users to specify two of (resourceClaimName, resoruceClaimTemplateName, claimCreationSpec)...?

I thought users would only specify one of three?
If that's the case, we should consider adding to the validation webhook

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@shengnuo this is possible across the spec.draResources[] array. 1 entry may have a resourceClaimName while another one may have a claimCreationSpec. We would need to aggregate them

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Its possible for users to specify all three resource specification types

// TODO auto determine base on tp*pp/(.spec.multiNode.size)
if nimService.Spec.MultiNode != nil {
gpuQuantity, err = apiResource.ParseQuantity(fmt.Sprintf("%d", nimService.Spec.MultiNode.GPUSPerPod))
gpuQuantity, err = apiResource.ParseQuantity(fmt.Sprintf("%d", multiNodeGPUsPerPod))
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm confused why this change is needed. multiNodeGPUsPerPod is computed based on the count from nvidia.com/gpu resource requests/limits. It doesnt make sense to do it here which only executes when resource requests/limits is missing.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed this

// GetGPUCountPerPod returns the number of GPUs per pod for the NIMService.
func GetGPUCountPerPod(ctx context.Context, client client.Client, nimService *appsv1alpha1.NIMService) (int, error) {

if nimService.Spec.DRAResources == nil {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit

Suggested change
if nimService.Spec.DRAResources == nil {
if len(nimService.Spec.DRAResources) == 0 {

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed


if nimService.Spec.DRAResources == nil {
if nimService.Spec.Resources == nil {
return 0, nil
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should return an error when no GPUs are specified and fail the NIMService for multiNode.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

Comment on lines +34 to +38
gpuQuantity, ok := nimService.Spec.Resources.Requests["nvidia.com/gpu"]
if !ok {
// return 0 if no GPU limit is specified because auto determine base on tp*pp/(.spec.multiNode.size) is a TODO
return 0, nil
}
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you also check for resource limits here?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

}

hint := strings.ToLower(fmt.Sprint(dc.Spec))
return strings.Contains(hint, NVIDIAIdentifier) || strings.Contains(hint, NVIDIAComIdentifier), nil
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: we should restrict to gpu.nvidia.com device class.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Comment on lines +263 to +276
switch {
case resource.ResourceClaimName != nil:
gpuCount, err = GetGPUDeviceCountForClaim(ctx, client, *resource.ResourceClaimName, namespace)
case resource.ResourceClaimTemplateName != nil:
gpuCount, err = GetGPUDeviceCountForClaimTemplate(ctx, client, *resource.ResourceClaimTemplateName, namespace)
case resource.ClaimCreationSpec != nil:
gpuCount, err = GetGPUDeviceCountForClaimCreationSpec(ctx, client, *resource.ClaimCreationSpec, namespace)
}
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@shengnuo this is possible across the spec.draResources[] array. 1 entry may have a resourceClaimName while another one may have a claimCreationSpec. We would need to aggregate them

}

// Validate that Claims must be empty
if spec.Resources != nil && spec.Resources.Claims != nil && len(spec.Resources.Claims) > 0 {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: if its nil, then len(...) will return 0

Suggested change
if spec.Resources != nil && spec.Resources.Claims != nil && len(spec.Resources.Claims) > 0 {
if spec.Resources != nil && len(spec.Resources.Claims) > 0 {

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

gpuResourceName := corev1.ResourceName("nvidia.com/gpu")

// Check if GPU requests are specified
if spec.Resources == nil || spec.Resources.Requests == nil {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: also check for resource limits

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

errList = append(errList, validateMetricsConfiguration(&spec.Metrics, fldPath.Child("metrics"))...)
errList = append(errList, validateScaleConfiguration(&spec.Scale, fldPath.Child("scale"))...)
errList = append(errList, validateResourcesConfiguration(spec.Resources, fldPath.Child("resources"))...)
errList = append(errList, validateResourcesConfiguration(spec, fldPath.Child("resources"))...)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
errList = append(errList, validateResourcesConfiguration(spec, fldPath.Child("resources"))...)
errList = append(errList, validateResourcesConfiguration(spec, fldPath)...)

nit: its better to match the fieldPath with the input in the caller.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed


logger.Info("Reconciling", "NIMService", nimService.Name)

if nimService.Spec.MultiNode != nil && nimService.Annotations != nil {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

annotations could be nil initially, we would need to initialize the map for the first time.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also, this should be done in the platform specific reconciler where we render the LWS resource. That way we can consider the case of auto GPU assignment too based on the tp size in the optimized profile, otherwise there would be mismatch and we end up setting this annotation to 0.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated the annotation nil check.
Auto assign of GPU is currently not supported for Multi Node. We need to discuss this scenario.

// Get tensorParallelism from the profile
tensorParallelism, err := utils.GetTensorParallelismByProfileTags(profile.Config)
if err != nil {
logger.Error(err, "Failed to retrieve tensorParallelism")
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
logger.Error(err, "Failed to retrieve tensorParallelism")
logger.Error(err, "Missing nvidia.com/gpu resource request and unable to retrieve tensorParallelism for NIM profile")

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed


isGPU, err := isNVIDIAGPU(ctx, client, req.Exactly.DeviceClassName)
if err != nil {
// This allows partial success scenarios
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we would need to fail in all cases here as GPU count might be incorrect, if there is an error getting the device-class for e.g.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

_, hasRequests := spec.Resources.Requests[gpuResourceName]
_, hasLimits := spec.Resources.Limits[gpuResourceName]

if !hasRequests && !hasLimits {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this will fail when an optimized profile is selected and we auto-retrieve GPUs based on the tp size.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This check is only for multi node NIM Service. Currently, we dont have support for multi node NIMService to assign GPU Resources automatically here. Since I have removed the .spec.multiNode.gpuPerWorker, user has to provide GPU resource limits for multi node

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but we auto assign based on tp size for optimized profile as well with changes in the assignGPUResources call.

Comment on lines +577 to +595
// If no user-provided GPU resource is found, proceed with auto-assignment
// Get tensorParallelism from the profile
tensorParallelism, err := utils.GetTensorParallelismByProfileTags(profile.Config)
if err != nil {
logger.Error(err, "Missing nvidia.com/gpu resource request/limit and unable to retrieve tensorParallelism for NIM profile")
return nil, err
}
if tensorParallelism != "" {
gpuQuantity, err = apiResource.ParseQuantity(tensorParallelism)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the addGPUResources function look exactly the same between standadlone and kserve controllers...

Consider adding it to internal/controller/shared

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this can be done as a follow up as this applies to many functions between those two packages.

gpuResourceName := corev1.ResourceName("nvidia.com/gpu")

// Check if GPU requests or limits are specified
if spec.Resources == nil {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should also consider spec.DRAResources

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

@visheshtanksale visheshtanksale enabled auto-merge (squash) August 29, 2025 22:07
Signed-off-by: Vishesh Tanksale <vtanksale@nvidia.com>
Signed-off-by: Vishesh Tanksale <vtanksale@nvidia.com>
Signed-off-by: Vishesh Tanksale <vtanksale@nvidia.com>
Signed-off-by: Vishesh Tanksale <vtanksale@nvidia.com>
Signed-off-by: Vishesh Tanksale <vtanksale@nvidia.com>
Signed-off-by: Vishesh Tanksale <vtanksale@nvidia.com>
Signed-off-by: Vishesh Tanksale <vtanksale@nvidia.com>
Signed-off-by: Vishesh Tanksale <vtanksale@nvidia.com>
Signed-off-by: Vishesh Tanksale <vtanksale@nvidia.com>
Signed-off-by: Vishesh Tanksale <vtanksale@nvidia.com>
Signed-off-by: Vishesh Tanksale <vtanksale@nvidia.com>
Signed-off-by: Vishesh Tanksale <vtanksale@nvidia.com>
Signed-off-by: Vishesh Tanksale <vtanksale@nvidia.com>
Signed-off-by: Vishesh Tanksale <vtanksale@nvidia.com>
return errList
}

if spec.Resources != nil {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@visheshtanksale This could just include CPU/Memory requests, while spec.draResources can include GPU requests. First we need to check if draResources are specified then rely on spec.resources.

Copy link
Copy Markdown
Collaborator Author

@visheshtanksale visheshtanksale Sep 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This check is primary on spec.resouces and we dont have any checks under spec.draResources because for that we need to make an API request. Since this includes only checks under spec.resources I feel it better to have the implementation here

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed


// +kubebuilder:default:=1
// Size specifies the number of pods to create for the multi-node NIMService.
// PipelineParallelism specifies the number of pods to create for the multi-node NIMService.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i am inclined to add a parallelism type for this if we need to add additional methods in the future.

parallelism:
  pp: 1 (default/minimum)

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

Signed-off-by: Vishesh Tanksale <vtanksale@nvidia.com>
Signed-off-by: Vishesh Tanksale <vtanksale@nvidia.com>
Signed-off-by: Vishesh Tanksale <vtanksale@nvidia.com>
Comment on lines +138 to +141
if nimService.Annotations == nil {
nimService.Annotations = map[string]string{}
}
if _, ok := nimService.Annotations[utils.GPUCountPerPodAnnotationKey]; !ok {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: can you use GetGPUCountPerPod here? its already reading that annotation value

// GPUSPerPod specifies the number of GPUs for each instance. In most cases, this should match `resources.limits.nvidia.com/gpu`.
GPUSPerPod int `json:"gpusPerPod,omitempty"`
// Parallelism specifies the parallelism strategy for the multi-node NIMService.
Parallelism Parallelism `json:"parallelism,omitempty"`
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Parallelism Parallelism `json:"parallelism,omitempty"`
Parallelism *Parallelism `json:"parallelism"`

I'd expect this to be a required field

// +kubebuilder:default:=1
// PP specifies the number of pods to create for the multi-node NIMService.
// +kubebuilder:validation:Minimum=1
PP int `json:"pp,omitempty"`
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
PP int `json:"pp,omitempty"`
Pipeline *uint32 `json:"pp,omitempty"`

can you call this Pipeline? PP doesnt sound good

}

type Parallelism struct {
// +kubebuilder:default:=1
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we shouldnt have defaults here. This wouldnt work for multi-node anyway

}
}
return int(gpuQuantity.Value()), nil
} else {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

style nit: you dont need an else block here.

Comment on lines +258 to +259
hint := strings.ToLower(fmt.Sprint(dc.Spec))
return strings.Contains(hint, NVIDIAGPUComIdentifier), nil
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is kinda flaky because it may just contain the term anywhere in the spec. Its better to be safe and iterating over all the selectors and looking only at the expressions.

As a follow-up we can also be careful in only checking if it contains device.driver == "gpu.nvidia.com" OR has device.attributes["gpu.nvidia.com"] OR has device.capacity["gpu.nvidia.com"]


// Validate that Claims must be empty
if spec.Resources != nil && len(spec.Resources.Claims) > 0 {
errList = append(errList, field.Forbidden(fldPath.Child("claims"), "must be empty"))
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
errList = append(errList, field.Forbidden(fldPath.Child("claims"), "must be empty"))
errList = append(errList, field.Forbidden(fldPath.Child("resources").Child("claims"), "must be empty"))

}

// validateGPURequirements ensures that GPU resources are properly configured for MultiNode deployments.
func validateGPURequirements(spec *appsv1alpha1.NIMServiceSpec, fldPath *field.Path) field.ErrorList {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: call this validateMultiNodeGPURequirements?

if hasRequests {
gpuRequests := spec.Resources.Requests[gpuResourceName]
if gpuRequests.IsZero() || gpuRequests.Value() <= 0 {
errList = append(errList, field.Invalid(fldPath.Child("requests").Child("nvidia.com/gpu"), gpuRequests.String(), "must be greater than 0"))
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
errList = append(errList, field.Invalid(fldPath.Child("requests").Child("nvidia.com/gpu"), gpuRequests.String(), "must be greater than 0"))
errList = append(errList, field.Invalid(fldPath.Child("requests").Key("nvidia.com/gpu"), gpuRequests.String(), "must be greater than 0"))

if hasLimits {
gpuLimits := spec.Resources.Limits[gpuResourceName]
if gpuLimits.IsZero() || gpuLimits.Value() <= 0 {
errList = append(errList, field.Invalid(fldPath.Child("limits").Child("nvidia.com/gpu"), gpuLimits.String(), "must be greater than 0"))
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
errList = append(errList, field.Invalid(fldPath.Child("limits").Child("nvidia.com/gpu"), gpuLimits.String(), "must be greater than 0"))
errList = append(errList, field.Invalid(fldPath.Child("limits").Key("nvidia.com/gpu"), gpuLimits.String(), "must be greater than 0"))

@shivamerla
Copy link
Copy Markdown
Collaborator

can we close this as the API change was merged. We need a follow up on adding validations in the webhook.

@visheshtanksale
Copy link
Copy Markdown
Collaborator Author

Closing this. Follow with webhook validations for next release

auto-merge was automatically disabled September 5, 2025 22:34

Pull request was closed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants