Adding support for building engine for buildable profiles by visheshtanksale · Pull Request #526 · NVIDIA/k8s-nim-operator

visheshtanksale · 2025-06-09T06:59:07Z

Added support for NIMBuild CRD to start pod that can build a engine

To Do

Add Unit test Coverage
Add support for adding details of the new local build engine on NIMCache Status

copy-pr-bot · 2025-06-09T06:59:11Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

internal/controller/nimcache_controller.go

internal/controller/nimbuild_controller.go

Signed-off-by: Vishesh Tanksale <vtanksale@nvidia.com>

internal/controller/nimbuild_controller.go

api/apps/v1alpha1/nimbuild_types.go

internal/controller/nimbuild_controller.go

api/apps/v1alpha1/nimbuild_types.go

internal/controller/nimbuild_controller.go

Signed-off-by: Vishesh Tanksale <vtanksale@nvidia.com>

varunrsekar

Some more minor comments...

api/apps/v1alpha1/nimbuild_types.go

Signed-off-by: Vishesh Tanksale <vtanksale@nvidia.com>

internal/controller/nimbuild_controller.go

api/apps/v1alpha1/nimbuild_types.go

Signed-off-by: Vishesh Tanksale <vtanksale@nvidia.com>

internal/controller/nimbuild_controller.go

Signed-off-by: Vishesh Tanksale <vtanksale@nvidia.com>

api/apps/v1alpha1/nimbuild_types.go

Signed-off-by: Vishesh Tanksale <vtanksale@nvidia.com>

varunrsekar

Follow-up: How do we account for a NIMService referencing a NIMCache for a buildable profile that is currently being built via a NIMBuild CR? In NIMService, we only check if the NIMCache is Ready before starting, but in this case we'd likely need more fine-grained checks on the profile status. And thinking crudely, we'd probably need the nimcache status to reflect the state of new profiles being built.

api/apps/v1alpha1/nimbuild_types.go

internal/controller/nimbuild_controller.go

Signed-off-by: Vishesh Tanksale <vtanksale@nvidia.com>

visheshtanksale · 2025-06-25T00:12:22Z

Follow-up: How do we account for a NIMService referencing a NIMCache for a buildable profile that is currently being built via a NIMBuild CR? In NIMService, we only check if the NIMCache is Ready before starting, but in this case we'd likely need more fine-grained checks on the profile status. And thinking crudely, we'd probably need the nimcache status to reflect the state of new profiles being built.

We cannot block NIMService from using NIMCache if there is a NIMBuild action pending. NIMCache is ready to use when its downloads the profile. There can be profile in the cache which user wants to run without building them. NIMBuild is an optional action being performed on the cache. Adding details of currently running NIMBuilds is mostly redundant because the details about relation between NIMCache and NIMBuild can also be obtained by querying NIMBuild. We add the details of successfully builds because those profile are part of the cache.

varunrsekar · 2025-06-25T07:13:37Z

Follow-up: How do we account for a NIMService referencing a NIMCache for a buildable profile that is currently being built via a NIMBuild CR? In NIMService, we only check if the NIMCache is Ready before starting, but in this case we'd likely need more fine-grained checks on the profile status. And thinking crudely, we'd probably need the nimcache status to reflect the state of new profiles being built.

We cannot block NIMService from using NIMCache if there is a NIMBuild action pending. NIMCache is ready to use when its downloads the profile. There can be profile in the cache which user wants to run without building them. NIMBuild is an optional action being performed on the cache. Adding details of currently running NIMBuilds is mostly redundant because the details about relation between NIMCache and NIMBuild can also be obtained by querying NIMBuild. We add the details of successfully builds because those profile are part of the cache.

So I was thinking of this scenario:

a NIMService is created without specifying any model profiles on a node that has no matching optimized profiles
The NIM would automatically choose the buildable profile for the node's GPU Type and attempt to build it - with insufficient resources, the NIMService would go to Failed state.
In parallel, there is a NIMBuild running to build the profile for the node's GPU Type.
Once the NIMBuild completes, if the exact same NIMService spec is attempted, this time it will go to Ready state.

I was thinking that this might cause confusion. But I agree that we shouldn't complicate the design and keep your current expectation. Thanks for the clarification.

Signed-off-by: Vishesh Tanksale <vtanksale@nvidia.com>

api/apps/v1alpha1/nimbuild_types.go

internal/controller/nimbuild_controller.go

Signed-off-by: Vishesh Tanksale <vtanksale@nvidia.com>

visheshtanksale requested review from ArangoGutierrez, shengnuo, shivamerla, slu2011 and varunrsekar as code owners June 9, 2025 06:59

visheshtanksale force-pushed the buildable-profile branch from d4ff1c7 to 3212502 Compare June 9, 2025 06:59

shivamerla reviewed Jun 9, 2025

View reviewed changes

internal/controller/nimcache_controller.go Outdated Show resolved Hide resolved

visheshtanksale marked this pull request as draft June 13, 2025 00:14

visheshtanksale force-pushed the buildable-profile branch 3 times, most recently from d7acc65 to 04132a2 Compare June 13, 2025 08:11

visheshtanksale marked this pull request as ready for review June 13, 2025 08:13

visheshtanksale force-pushed the buildable-profile branch from 04132a2 to 11a63ce Compare June 13, 2025 08:17

shivamerla reviewed Jun 13, 2025

View reviewed changes

internal/controller/nimbuild_controller.go Outdated Show resolved Hide resolved

shivamerla reviewed Jun 13, 2025

View reviewed changes

internal/controller/nimbuild_controller.go Outdated Show resolved Hide resolved

shivamerla reviewed Jun 13, 2025

View reviewed changes

internal/controller/nimbuild_controller.go Outdated Show resolved Hide resolved

shivamerla reviewed Jun 13, 2025

View reviewed changes

internal/controller/nimbuild_controller.go Outdated Show resolved Hide resolved

shivamerla reviewed Jun 13, 2025

View reviewed changes

internal/controller/nimbuild_controller.go Show resolved Hide resolved

Adding support for building engine for buildable profiles

5d58673

Signed-off-by: Vishesh Tanksale <vtanksale@nvidia.com>

visheshtanksale force-pushed the buildable-profile branch 3 times, most recently from 1e88505 to 5d58673 Compare June 17, 2025 00:05

visheshtanksale added 3 commits June 17, 2025 07:36

Adding support for NIMBuild CRD for buildable profiles

6070077

Signed-off-by: Vishesh Tanksale <vtanksale@nvidia.com>

Merge branch 'main' into buildable-profile

e372d7f

Fixing merge issues

4449cf2

Signed-off-by: Vishesh Tanksale <vtanksale@nvidia.com>

shivamerla reviewed Jun 17, 2025

View reviewed changes

internal/controller/nimbuild_controller.go Outdated Show resolved Hide resolved

shivamerla reviewed Jun 17, 2025

View reviewed changes

internal/controller/nimbuild_controller.go Outdated Show resolved Hide resolved

shivamerla reviewed Jun 17, 2025

View reviewed changes

internal/controller/nimbuild_controller.go Outdated Show resolved Hide resolved

shivamerla reviewed Jun 17, 2025

View reviewed changes

internal/controller/nimbuild_controller.go Outdated Show resolved Hide resolved

shivamerla reviewed Jun 17, 2025

View reviewed changes

internal/controller/nimbuild_controller.go Show resolved Hide resolved

varunrsekar reviewed Jun 17, 2025

View reviewed changes

Addressing review comments

7998c58

Signed-off-by: Vishesh Tanksale <vtanksale@nvidia.com>

varunrsekar reviewed Jun 18, 2025

View reviewed changes

api/apps/v1alpha1/nimbuild_types.go Outdated Show resolved Hide resolved

api/apps/v1alpha1/nimbuild_types.go Outdated Show resolved Hide resolved

Updating NIMBuild CRD and NIMCache update status logic

4bcae59

Signed-off-by: Vishesh Tanksale <vtanksale@nvidia.com>

varunrsekar reviewed Jun 18, 2025

View reviewed changes

internal/controller/nimbuild_controller.go Show resolved Hide resolved

internal/controller/nimbuild_controller.go Outdated Show resolved Hide resolved

api/apps/v1alpha1/nimbuild_types.go Show resolved Hide resolved

Adding immutable logic to NIMBuild CRD

a49f010

Signed-off-by: Vishesh Tanksale <vtanksale@nvidia.com>

varunrsekar reviewed Jun 18, 2025

View reviewed changes

internal/controller/nimbuild_controller.go Show resolved Hide resolved

internal/controller/nimbuild_controller.go Show resolved Hide resolved

Updating the NIMBuild status to fail when reconilation fails

515e27b

Signed-off-by: Vishesh Tanksale <vtanksale@nvidia.com>

shivamerla reviewed Jun 18, 2025

View reviewed changes

api/apps/v1alpha1/nimbuild_types.go Outdated Show resolved Hide resolved

shivamerla reviewed Jun 18, 2025

View reviewed changes

api/apps/v1alpha1/nimbuild_types.go Outdated Show resolved Hide resolved

Updating the NIMBuild CRD

bf14470

Signed-off-by: Vishesh Tanksale <vtanksale@nvidia.com>

varunrsekar reviewed Jun 24, 2025

View reviewed changes

Updating the NIMBuild CRD

2784117

Signed-off-by: Vishesh Tanksale <vtanksale@nvidia.com>

visheshtanksale added 2 commits June 26, 2025 06:51

Addressing Review comments

f2fc179

Signed-off-by: Vishesh Tanksale <vtanksale@nvidia.com>

Addressing Review comments

9bc068b

Signed-off-by: Vishesh Tanksale <vtanksale@nvidia.com>

varunrsekar reviewed Jun 26, 2025

View reviewed changes

visheshtanksale added 3 commits June 26, 2025 17:41

Addressing Review comments

389a413

Signed-off-by: Vishesh Tanksale <vtanksale@nvidia.com>

Addressing Review comments

d571d3f

Signed-off-by: Vishesh Tanksale <vtanksale@nvidia.com>

Addressing review comments

ca9ca78

Signed-off-by: Vishesh Tanksale <vtanksale@nvidia.com>

visheshtanksale force-pushed the buildable-profile branch from 674d9bd to ca9ca78 Compare June 27, 2025 11:04

visheshtanksale added 2 commits June 27, 2025 11:24

Addressing review comments

683906d

Signed-off-by: Vishesh Tanksale <vtanksale@nvidia.com>

Fixing Kubebuilder PROJECT file

06bb936

Signed-off-by: Vishesh Tanksale <vtanksale@nvidia.com>

visheshtanksale force-pushed the buildable-profile branch from e607032 to 06bb936 Compare June 27, 2025 19:44

Merge branch 'main' into buildable-profile

1a62ecf

shivamerla approved these changes Jun 27, 2025

View reviewed changes

visheshtanksale merged commit ccff6e0 into NVIDIA:main Jun 27, 2025
9 checks passed

Conversation

visheshtanksale commented Jun 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

copy-pr-bot bot commented Jun 9, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

varunrsekar left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

varunrsekar left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

visheshtanksale commented Jun 25, 2025

Uh oh!

varunrsekar commented Jun 25, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

visheshtanksale commented Jun 9, 2025 •

edited

Loading