Skip to content

Adding support for building engine for buildable profiles#526

Merged
visheshtanksale merged 18 commits intoNVIDIA:mainfrom
visheshtanksale:buildable-profile
Jun 27, 2025
Merged

Adding support for building engine for buildable profiles#526
visheshtanksale merged 18 commits intoNVIDIA:mainfrom
visheshtanksale:buildable-profile

Conversation

@visheshtanksale
Copy link
Copy Markdown
Collaborator

@visheshtanksale visheshtanksale commented Jun 9, 2025

Added support for NIMBuild CRD to start pod that can build a engine

To Do

  • Add Unit test Coverage
  • Add support for adding details of the new local build engine on NIMCache Status

@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot bot commented Jun 9, 2025

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@visheshtanksale visheshtanksale marked this pull request as draft June 13, 2025 00:14
@visheshtanksale visheshtanksale force-pushed the buildable-profile branch 3 times, most recently from d7acc65 to 04132a2 Compare June 13, 2025 08:11
@visheshtanksale visheshtanksale marked this pull request as ready for review June 13, 2025 08:13
Signed-off-by: Vishesh Tanksale <vtanksale@nvidia.com>
@visheshtanksale visheshtanksale force-pushed the buildable-profile branch 3 times, most recently from 1e88505 to 5d58673 Compare June 17, 2025 00:05
Signed-off-by: Vishesh Tanksale <vtanksale@nvidia.com>
Signed-off-by: Vishesh Tanksale <vtanksale@nvidia.com>
Signed-off-by: Vishesh Tanksale <vtanksale@nvidia.com>
Copy link
Copy Markdown
Collaborator

@varunrsekar varunrsekar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some more minor comments...

Signed-off-by: Vishesh Tanksale <vtanksale@nvidia.com>
Signed-off-by: Vishesh Tanksale <vtanksale@nvidia.com>
Signed-off-by: Vishesh Tanksale <vtanksale@nvidia.com>
Signed-off-by: Vishesh Tanksale <vtanksale@nvidia.com>
Copy link
Copy Markdown
Collaborator

@varunrsekar varunrsekar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Follow-up: How do we account for a NIMService referencing a NIMCache for a buildable profile that is currently being built via a NIMBuild CR? In NIMService, we only check if the NIMCache is Ready before starting, but in this case we'd likely need more fine-grained checks on the profile status. And thinking crudely, we'd probably need the nimcache status to reflect the state of new profiles being built.

Signed-off-by: Vishesh Tanksale <vtanksale@nvidia.com>
@visheshtanksale
Copy link
Copy Markdown
Collaborator Author

Follow-up: How do we account for a NIMService referencing a NIMCache for a buildable profile that is currently being built via a NIMBuild CR? In NIMService, we only check if the NIMCache is Ready before starting, but in this case we'd likely need more fine-grained checks on the profile status. And thinking crudely, we'd probably need the nimcache status to reflect the state of new profiles being built.

We cannot block NIMService from using NIMCache if there is a NIMBuild action pending. NIMCache is ready to use when its downloads the profile. There can be profile in the cache which user wants to run without building them. NIMBuild is an optional action being performed on the cache. Adding details of currently running NIMBuilds is mostly redundant because the details about relation between NIMCache and NIMBuild can also be obtained by querying NIMBuild. We add the details of successfully builds because those profile are part of the cache.

@varunrsekar
Copy link
Copy Markdown
Collaborator

Follow-up: How do we account for a NIMService referencing a NIMCache for a buildable profile that is currently being built via a NIMBuild CR? In NIMService, we only check if the NIMCache is Ready before starting, but in this case we'd likely need more fine-grained checks on the profile status. And thinking crudely, we'd probably need the nimcache status to reflect the state of new profiles being built.

We cannot block NIMService from using NIMCache if there is a NIMBuild action pending. NIMCache is ready to use when its downloads the profile. There can be profile in the cache which user wants to run without building them. NIMBuild is an optional action being performed on the cache. Adding details of currently running NIMBuilds is mostly redundant because the details about relation between NIMCache and NIMBuild can also be obtained by querying NIMBuild. We add the details of successfully builds because those profile are part of the cache.

So I was thinking of this scenario:

  • a NIMService is created without specifying any model profiles on a node that has no matching optimized profiles
  • The NIM would automatically choose the buildable profile for the node's GPU Type and attempt to build it - with insufficient resources, the NIMService would go to Failed state.
  • In parallel, there is a NIMBuild running to build the profile for the node's GPU Type.
  • Once the NIMBuild completes, if the exact same NIMService spec is attempted, this time it will go to Ready state.

I was thinking that this might cause confusion. But I agree that we shouldn't complicate the design and keep your current expectation. Thanks for the clarification.

Signed-off-by: Vishesh Tanksale <vtanksale@nvidia.com>
Signed-off-by: Vishesh Tanksale <vtanksale@nvidia.com>
Signed-off-by: Vishesh Tanksale <vtanksale@nvidia.com>
Signed-off-by: Vishesh Tanksale <vtanksale@nvidia.com>
Signed-off-by: Vishesh Tanksale <vtanksale@nvidia.com>
Signed-off-by: Vishesh Tanksale <vtanksale@nvidia.com>
Signed-off-by: Vishesh Tanksale <vtanksale@nvidia.com>
@visheshtanksale visheshtanksale merged commit ccff6e0 into NVIDIA:main Jun 27, 2025
9 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants