Skip to content

Release workload on a non-MNNVL node in a CD#646

Merged
jgehrcke merged 10 commits intokubernetes-sigs:mainfrom
jgehrcke:jp/no-clique-update-cd-node-status
Oct 11, 2025
Merged

Release workload on a non-MNNVL node in a CD#646
jgehrcke merged 10 commits intokubernetes-sigs:mainfrom
jgehrcke:jp/no-clique-update-cd-node-status

Conversation

@jgehrcke
Copy link
Copy Markdown
Contributor

@jgehrcke jgehrcke commented Oct 7, 2025

Resolves #625.

Slightly less trivial than I thought.

  • Needs a special case maxNodesPerIMEXDomain for the group of non-MNNVL nodes (can become large)
  • Always needs injection of CD details into CD daemon container (via CDI)

The group of nodes that does not perform MNNVL may be large. With this change, all of these nodes will contribute to the update-with-retry-on-conflict race, which is important to consider. I added a few code comments along the way.

I tested this manually, with this patch to simulate the scenario:

--- a/cmd/compute-domain-kubelet-plugin/nvlib.go
+++ b/cmd/compute-domain-kubelet-plugin/nvlib.go
@@ -197,6 +197,9 @@ func (l deviceLib) enumerateComputeDomainDaemons(config *Config) (AllocatableDev
 }
 
 func (l deviceLib) getCliqueID() (string, error) {
+
+       return "", nil
+

CD daemon log excerpt, from start-to-shutdown:

± kubectl logs -n nvidia-dra-driver-gpu   imex-channel-injection-p7m8n-2ghhj
I1007 11:24:04.713698       1 main.go:200] config: &{gb-nvl-043-compute06 49426eaf-63f5-4610-8928-306a99de8777 imex-channel-injection default  192.168.35.115 imex-channel-injection-p7m8n-2ghhj nvidia-dra-driver-gpu 18}
I1007 11:24:04.713783       1 main.go:209] no cliqueID: register with ComputeDomain, but do not run IMEX daemon
I1007 11:24:04.714026       1 dnsnames.go:199] Created static nodes config file with 18 DNS names using format compute-domain-daemon-%d
[...]
I1007 11:24:04.715152       1 process.go:175] Start watchdog
I1007 11:24:04.715185       1 main.go:338] wait for nodes update
I1007 11:24:04.715330       1 reflector.go:358] "Starting reflector" type="*v1beta1.ComputeDomain" resyncPeriod="10m0s" reflector="pkg/nvidia.com/informers/externalversions/factory.go:141"
I1007 11:24:04.715346       1 reflector.go:404] "Listing and watching" type="*v1beta1.ComputeDomain" reflector="pkg/nvidia.com/informers/externalversions/factory.go:141"
[...]
I1007 11:24:04.916398       1 computedomain.go:266] CD status does not contain node name 'gb-nvl-043-compute06' yet, try to insert myself: &{gb-nvl-043-compute06   0 NotReady}
[...]
I1007 11:24:04.919998       1 computedomain.go:288] Successfully updated CD
[...]
I1007 11:24:04.920268       1 main.go:350] empty cliqueID: do not start IMEX daemon
[...]
I1007 11:24:06.092166       1 podmanager.go:194] Successfully updated node gb-nvl-043-compute06 status to Ready
[...]
I1007 11:58:36.795592       1 main.go:89] Received SIGTERM, initiate shutdown
I1007 11:58:36.795671       1 main.go:341] shutdown: stop IMEXDaemonUpdateLoopWithDNSNames
I1007 11:58:36.795696       1 main.go:278] Terminated: IMEX daemon update task
I1007 11:58:36.795690       1 process.go:179] Watchdog: context canceled, attempt to stop child process
E1007 11:58:36.795727       1 main.go:288] watch failed, initiate shutdown: watchdog: stop failed: pm: stop failed: not started
I1007 11:58:36.795737       1 main.go:291] Terminated: process manager
[...]
I1007 11:58:36.803227       1 main.go:257] Terminated: controller task
I1007 11:58:36.803241       1 main.go:297] Exiting

With the getCliqueID() patch removed, I saw a passing test suite:

15 tests, 0 failures in 118 seconds

@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented Oct 7, 2025

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

}

// Update node info in ComputeDomain
// Why are we doing this (only) upon receiving another update?
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think a question I had here really is:

If everyone only sends their own update after receiving an update -- who performs the initial update to start the cascade?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

onAddOrUpdate() gets called on any addition / update of the CD. When the informer first comes online it triggers an "add" on all of the existing objects that it receives from the API server. This is the intial callback that allows this component to insert its NodeInfo in the Status.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When the informer first comes online it triggers an "add" on all of the existing objects that it receives from the API server.

Thank you! (I now remember that you've described this before)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added commit, removed comment (clarified this in the comment for onAddOrUpdate though that I won't forget again so quickly)

Comment thread cmd/compute-domain-daemon/computedomain.go Outdated
// Try to write update. May be in conflict with updates to the same list
// coming in from other nodes (in which case this update will be retried
// later). TODO: review conflict resolution methodology for the
// 10000-nodes-CD use case.
Copy link
Copy Markdown
Contributor Author

@jgehrcke jgehrcke Oct 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How many update attempts overall does it take along the happy path for say N=1000 nodes each trying to insert themselves at about the same time? The individual node has to expect a large number of conflict errors E before its own update succeeds. If E=100 (just making something up, this could be larger) then we make N*E=100000 update attempts in total. Is that a correct consideration? If it is, we should think about finding a smarter strategy that scales better with N.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After each attempt there is an exponential backoff with some random jitter to stagger attempts from different nodes. We also (re)read the CD object just before attempting to write, in order to make the window as small as possible where a conflict might slip in. We would need to measure it, but I don't think the number of average conflicts would be nearly as large as you are suggesting.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We would need to measure it

Yes, that's the only productive approach!

We also (re)read the CD object just before attempting to write

Yes! But we understand this doesn't necessarily yield the latest -- we read it from the informer stream/cache (which might be subject to lag, right?).

After each attempt there is an exponential backoff with some random jitter to stagger attempts from different nodes.

I understand :)

I also do think we agree, and I am not planning to change anything here or criticize what we have.

I am merely trying to rationally reason through current scaling behavior.

In terms of that scaling behavior, we want to understand not only how the number of global write attempts scales with the number of nodes N (that might be a measure for API server load generated).

But (maybe even more importantly) we also want to understand how long the overall convergence duration D scales with N. And we probably eventually need to set ourselves a goal (as in: an upper bound that we find tolerable), and assess on that basis if we need to consider a more sophisticated approach.

nearly as large as you are suggesting.

I want to say: I did not suggest E=100, I was mainly suggesting somewhat linear scaling behavior with N. That might even be best case; the convergence duration might after all scale non-linearly (say, with O(N^2) or so) -- precisely because of the applied back-off during retrying.

We will see.

Supporting "Uber-sized Compute Domains" will be fun in that regard!

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed the comment as part of rebasing on top of head of main -- thanks for discussion

Comment thread cmd/compute-domain-daemon/computedomain.go
@jgehrcke jgehrcke self-assigned this Oct 7, 2025
@jgehrcke jgehrcke added the kind/bug Categorizes issue or PR as related to a bug. label Oct 7, 2025
@jgehrcke jgehrcke added this to the v25.8.0 milestone Oct 7, 2025
@jgehrcke jgehrcke requested a review from klueska October 7, 2025 12:07
@jgehrcke jgehrcke added the robustness issue/pr: edge cases & fault tolerance label Oct 7, 2025
Comment thread cmd/compute-domain-daemon/computedomain.go Outdated
Comment thread cmd/compute-domain-daemon/computedomain.go
Comment thread cmd/compute-domain-kubelet-plugin/computedomain.go Outdated
Comment thread cmd/compute-domain-kubelet-plugin/device_state.go Outdated
Comment thread cmd/compute-domain-daemon/main.go
@jgehrcke
Copy link
Copy Markdown
Contributor Author

jgehrcke commented Oct 7, 2025

@klueska thanks for review. I added responded to comments and added commits. Can you have another look? Before merge, I would squash commits.

@jgehrcke jgehrcke force-pushed the jp/no-clique-update-cd-node-status branch 2 times, most recently from e6b364c to 1a259f4 Compare October 8, 2025 15:21
@jgehrcke
Copy link
Copy Markdown
Contributor Author

jgehrcke commented Oct 8, 2025

I have rebased this branch on current head of main, resolved conficts, and also squashed commits.

To be sure the merge-to-main wouldn't be bad, I just tested this again manually with

--- a/cmd/compute-domain-kubelet-plugin/nvlib.go
+++ b/cmd/compute-domain-kubelet-plugin/nvlib.go
@@ -197,6 +197,7 @@ func (l deviceLib) enumerateComputeDomainDaemons(config *Config) (AllocatableDev
 }
 
 func (l deviceLib) getCliqueID() (string, error) {
+       return "", nil

Works (CD daemon log):

...
15:32:24 ± kubectl logs -n nvidia-dra-driver-gpu   imex-channel-injection-rdx24-k9qg5
I1008 15:32:22.570420       1 main.go:200] config: &{gb-nvl-043-compute07 91d00090-848c-4953-8a25-ffe23d409e90 imex-channel-injection default  192.168.34.186 imex-channel-injection-rdx24-k9qg5 nvidia-dra-driver-gpu 18}
I1008 15:32:22.570486       1 main.go:209] no cliqueID: register with ComputeDomain, but do not run IMEX daemon
...
I1008 15:32:24.173219       1 podmanager.go:201] Successfully updated node gb-nvl-043-compute07 status to Ready
...

Workload got released:

$ kubectl get pods
NAME                     READY   STATUS    RESTARTS   AGE
imex-channel-injection   1/1     Running   0          3m20s

Is of course half-broken because we didn't perform CDI-based injection:

$ kubectl  logs imex-channel-injection 
ls: cannot access '/dev/nvidia-caps-imex-channels': No such file or directory

Let's land this now, (unless there's a rough problem in here). Let's do all non-functional changes in a follow-up. Specifically, let us sort out finding a better encapsulation for the non-imex-daemon-mode later.

Signed-off-by: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com>
Signed-off-by: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com>
Signed-off-by: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com>
Signed-off-by: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com>
@jgehrcke jgehrcke force-pushed the jp/no-clique-update-cd-node-status branch from 1a259f4 to c5636d9 Compare October 10, 2025 11:29
Comment thread cmd/compute-domain-daemon/main.go Outdated
Comment on lines +214 to +216
if err := writeIMEXConfig(flags.podIP); err != nil {
return fmt.Errorf("writeIMEXConfig failed: %w", err)
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add this to the function to avoid this else statement (and just always do this unconditionally):

// writeIMEXConfig renders the config template with the pod IP and writes it to the final config file.
func writeIMEXConfig(podIP string) error {
        // Ensure the directory exists
        dir := filepath.Dir(nodesConfigPath)
        if err := os.MkdirAll(dir, 0755); err != nil {
                return fmt.Errorf("failed to create directory %s: %w", dir, err)
        }
...
}

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, now calling os.MkdirAll() in writeIMEXConfig().

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After these changes, I once again needed to provoke & test that scenario. Due diligence. By mocking the cliqueID acquisition function (returning an empty string).

Glad I did, because the above's change resulted in:

Error: writeIMEXConfig failed: error parsing template file: open /etc/nvidia-imex/config.tmpl.cfg: no such file or directory

The problem with not mounting in /etc/nvidia-imex is not just "oh this dir doesn't exist".

The problem is: not mounting in the template to render the config file from.

But mounting that in requires writing it to the host filesystem in the first place. Otherwise:

Error: failed to create containerd task: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error mounting "/var/lib/kubelet/plugins/compute-domain.nvidia.com/domains/885e0f1b-f319-40dd-81c6-15ae6c690130" to rootfs at "/etc/nvidia-imex": create mount destination for /etc/nvidia-imex mount: bind mount source stat: stat /var/lib/kubelet/plugins/compute-domain.nvidia.com/domains/885e0f1b-f319-40dd-81c6-15ae6c690130: no such file or directory: unknown

I have now added more changes to this PR to always write this IMEX daemon configuration directory tree to the host, and to always mount it in (to the CD daemon). I think it makes sense!

Now, we only do not inject any IMEX-related dev nodes in non-IMEX mode.

So, with this change, we've created even more symmetry between non-IMEX and IMEX mode.

Specifically, we now always create

/var/lib/kubelet/plugins/compute-domain.nvidia.com/domains/<domainID>

on the host (that's the mount source).

One might argue that this host filesystem interaction is superfluous and only introduces more failure paths -- because if this fails, we can't even launch the CD daemon in 'pretend mode'. But it's very unlikely to ever hurt in practice.

I do believe that for now we should cherish the simplicity of "let's do everything except for actually spawning the IMEX daemon process".

Comment thread cmd/compute-domain-daemon/main.go Outdated
Comment thread cmd/compute-domain-daemon/main.go Outdated
Comment thread cmd/compute-domain-daemon/main.go Outdated
Comment thread cmd/compute-domain-kubelet-plugin/device_state.go Outdated
Copy link
Copy Markdown
Contributor

@klueska klueska left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good, just a few small changes

@jgehrcke
Copy link
Copy Markdown
Contributor Author

  Error: cmd/compute-domain-kubelet-plugin/device_state.go:496:14: `emtpy` is a misspelling of `empty` (misspell)
  	// ID being emtpy or not).
  	            ^
  1 issues:
  * misspell: 1

Of course. Time to get a spell check into my IDE, too.

Signed-off-by: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com>

Break out of select/case, MkdirAll() before writing file

Signed-off-by: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com>
Signed-off-by: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com>
Signed-off-by: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com>

Rename 'domain' to 'domainID'

Signed-off-by: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com>

squash: review feedback

Signed-off-by: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com>

shorten comment

Signed-off-by: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com>
Signed-off-by: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com>
@jgehrcke jgehrcke force-pushed the jp/no-clique-update-cd-node-status branch from e534ed3 to 02ff3c0 Compare October 10, 2025 16:50
Signed-off-by: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com>

CD plugin: always prepare IMEX config on the host and mount it in

Signed-off-by: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com>
@jgehrcke jgehrcke force-pushed the jp/no-clique-update-cd-node-status branch from 02ff3c0 to f7e4a45 Compare October 10, 2025 17:15
@jgehrcke
Copy link
Copy Markdown
Contributor Author

jgehrcke commented Oct 10, 2025

I've updated the PR with the changes discussed, and made more adjustments described here to further increase symmetry between "non-IMEX" and "actual-IMEX" modes (for a lack of better words). I've also worked on cleaning up the commit history a bit (squashed what made sense). I performed another manual test for cliqueID == "" and saw the desired behavior. I think I'd still like to land this tonight or on the weekend so that we can do a bit of QAing before next week.

Comment thread cmd/compute-domain-kubelet-plugin/computedomain.go Outdated
fmt.Sprintf("CLIQUE_ID=%s", s.manager.cliqueID),
fmt.Sprintf("COMPUTE_DOMAIN_UUID=%s", cd.UID),
fmt.Sprintf("COMPUTE_DOMAIN_NAME=%s", cd.Name),
fmt.Sprintf("COMPUTE_DOMAIN_NAMESPACE=%s", cd.Namespace),
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

note to self: we could have made the diff smaller by moving Env up again -- but that's okay.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

}
// Always inject CD config details into the CD daemon (regardless of clique
// ID being empty or not).
edits, err := computeDomainDaemonSettings.GetCommonCDIContainerEdits(ctx)
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit for symmetry, this could be callled GetCDIContainerEditsCommon instead of GetCommonCDIContainerEdits so that we have:

GetCDIContainerEditsForImex
GetCDIContainerEditsCommon

Personally, I like grouping by prefix.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Signed-off-by: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com>

make diff smaller, rename func

Signed-off-by: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com>
@jgehrcke jgehrcke force-pushed the jp/no-clique-update-cd-node-status branch from de09926 to 6e56823 Compare October 11, 2025 11:26
@jgehrcke
Copy link
Copy Markdown
Contributor Author

Great discussion and feedback.

Post-approval, I took liberty to add this cosmetic cleanup patch (squashed into the last commit on this branch):

From de09926e03d39322c7de47ecafcc2da460671283 Mon Sep 17 00:00:00 2001
From: "Dr. Jan-Philip Gehrcke" <jgehrcke@nvidia.com>
Date: Sat, 11 Oct 2025 11:25:07 +0000
Subject: [PATCH] make diff smaller, rename func

Signed-off-by: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com>
---
 .../computedomain.go                             | 16 ++++++++--------
 .../device_state.go                              |  2 +-
 2 files changed, 9 insertions(+), 9 deletions(-)

diff --git a/cmd/compute-domain-kubelet-plugin/computedomain.go b/cmd/compute-domain-kubelet-plugin/computedomain.go
index 76baba8d..5ea94db1 100644
--- a/cmd/compute-domain-kubelet-plugin/computedomain.go
+++ b/cmd/compute-domain-kubelet-plugin/computedomain.go
@@ -154,10 +154,10 @@ func (m *ComputeDomainManager) GetComputeDomainChannelContainerEdits(devRoot str
 	}
 }
 
-// GetCommonCDIContainerEdits() returns the CDI spec edits always required for
+// GetCDIContainerEditsCommon() returns the CDI spec edits always required for
 // launching the CD Daemon (whether or not it tries to launch an IMEX daemon
 // internally).
-func (s *ComputeDomainDaemonSettings) GetCommonCDIContainerEdits(ctx context.Context) (*cdiapi.ContainerEdits, error) {
+func (s *ComputeDomainDaemonSettings) GetCDIContainerEditsCommon(ctx context.Context) (*cdiapi.ContainerEdits, error) {
 	cd, err := s.manager.GetComputeDomain(ctx, s.domainID)
 	if err != nil {
 		return nil, fmt.Errorf("error getting compute domain %s: %w", s.domainID, err)
@@ -168,6 +168,12 @@ func (s *ComputeDomainDaemonSettings) GetCommonCDIContainerEdits(ctx context.Con
 
 	edits := &cdiapi.ContainerEdits{
 		ContainerEdits: &cdispec.ContainerEdits{
+			Env: []string{
+				fmt.Sprintf("CLIQUE_ID=%s", s.manager.cliqueID),
+				fmt.Sprintf("COMPUTE_DOMAIN_UUID=%s", cd.UID),
+				fmt.Sprintf("COMPUTE_DOMAIN_NAME=%s", cd.Name),
+				fmt.Sprintf("COMPUTE_DOMAIN_NAMESPACE=%s", cd.Namespace),
+			},
 			Mounts: []*cdispec.Mount{
 				{
 					ContainerPath: "/etc/nvidia-imex",
@@ -175,12 +181,6 @@ func (s *ComputeDomainDaemonSettings) GetCommonCDIContainerEdits(ctx context.Con
 					Options:       []string{"rw", "nosuid", "nodev", "bind"},
 				},
 			},
-			Env: []string{
-				fmt.Sprintf("CLIQUE_ID=%s", s.manager.cliqueID),
-				fmt.Sprintf("COMPUTE_DOMAIN_UUID=%s", cd.UID),
-				fmt.Sprintf("COMPUTE_DOMAIN_NAME=%s", cd.Name),
-				fmt.Sprintf("COMPUTE_DOMAIN_NAMESPACE=%s", cd.Namespace),
-			},
 		},
 	}
 	return edits, nil
diff --git a/cmd/compute-domain-kubelet-plugin/device_state.go b/cmd/compute-domain-kubelet-plugin/device_state.go
index 9f82e866..d6acf3e4 100644
--- a/cmd/compute-domain-kubelet-plugin/device_state.go
+++ b/cmd/compute-domain-kubelet-plugin/device_state.go
@@ -506,7 +506,7 @@ func (s *DeviceState) applyComputeDomainDaemonConfig(ctx context.Context, config
 
 	// Always inject CD config details into the CD daemon (regardless of clique
 	// ID being empty or not).
-	edits, err := computeDomainDaemonSettings.GetCommonCDIContainerEdits(ctx)
+	edits, err := computeDomainDaemonSettings.GetCDIContainerEditsCommon(ctx)
 	if err != nil {
 		return nil, fmt.Errorf("error getting common container edits for ComputeDomain daemon '%s': %w", config.DomainID, err)
 	}

@jgehrcke jgehrcke merged commit a79a9fd into kubernetes-sigs:main Oct 11, 2025
7 checks passed
asymingt added a commit to asymingt/k8s-dra-driver-gpu that referenced this pull request Nov 14, 2025
commit 55fc7b0
Merge: 5443e0f ef23484
Author: Shiva Krishna Merla <smerla@nvidia.com>
Date:   Thu Nov 6 16:04:43 2025 -0800

    Merge pull request kubernetes-sigs#668 from varunrsekar/vfio-support-1.33

    Support VFIO passthrough

commit ef23484
Author: Varun Ramachandra Sekar <vsekar@nvidia.com>
Date:   Tue Oct 14 17:29:18 2025 -0700

    vfio passthrough support

    Signed-off-by: Varun Ramachandra Sekar <vsekar@nvidia.com>

    use chroot to run modprobe

    Signed-off-by: Varun Ramachandra Sekar <vsekar@nvidia.com>

    deadvertise sibling devices on preparation

    Signed-off-by: Varun Ramachandra Sekar <vsekar@nvidia.com>

    soft check for VFs before attempting unbind

    Signed-off-by: Varun Ramachandra Sekar <vsekar@nvidia.com>

    address review comments

    Signed-off-by: Varun Ramachandra Sekar <vsekar@nvidia.com>

    address comments (2)

    Signed-off-by: Varun Ramachandra Sekar <vsekar@nvidia.com>

    use fuser to check if gpu is free

    Signed-off-by: Varun Ramachandra Sekar <vsekar@nvidia.com>

    remove unnecessary securityContext

    Signed-off-by: Varun Ramachandra Sekar <vsekar@nvidia.com>

    don't mix vfio and mig devices

    Signed-off-by: Varun Ramachandra Sekar <vsekar@nvidia.com>

commit 5443e0f
Merge: 59d775b 3babfe5
Author: Shiva Krishna Merla <smerla@nvidia.com>
Date:   Tue Nov 4 12:48:00 2025 -0800

    Merge pull request kubernetes-sigs#711 from shivamerla/add_gpu_stress_tests

    tests: Add separate targets for GPU plugin tests + add stress tests

commit 3babfe5
Author: Shiva Krishna, Merla <smerla@nvidia.com>
Date:   Tue Nov 4 11:47:01 2025 -0800

    tests: Use BATS_TEST_TMPDIR and failfast on errors during cleanup

    Signed-off-by: Shiva Krishna, Merla <smerla@nvidia.com>

commit 2b3e70b
Author: Shiva Krishna, Merla <smerla@nvidia.com>
Date:   Tue Nov 4 11:07:19 2025 -0800

    tests: Add separate targets for GPU plugin tests + add stress tests

    * Add separate make targets to run GPU and CD specific tests
    * Add a stress test for GPU allocation
    * Refactor Makefile to share common docker setup between targets

    Signed-off-by: Shiva Krishna, Merla <smerla@nvidia.com>

commit 59d775b
Merge: 852b56f 1e79179
Author: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com>
Date:   Mon Nov 3 19:38:02 2025 +0100

    Merge pull request kubernetes-sigs#709 from jgehrcke/jp/basic-gpu-tests

    tests: cover basic GPU allocation, misc improvements

commit 852b56f
Merge: 1ee1b4a e8fa8e6
Author: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com>
Date:   Mon Nov 3 19:21:53 2025 +0100

    Merge pull request kubernetes-sigs#706 from Gacko/vkptt

    kubelet plugins: add /opt/bin to binary search paths

commit 1ee1b4a
Merge: f4d11e3 068bb76
Author: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com>
Date:   Mon Nov 3 19:10:16 2025 +0100

    Merge pull request kubernetes-sigs#710 from NVIDIA/dependabot/docker/deployments/container/main/nvidia/distroless/cc-v3.2.1-dev

    build(deps): bump nvidia/distroless/cc from v3.2.0-dev to v3.2.1-dev in /deployments/container

commit 1e79179
Author: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com>
Date:   Sat Nov 1 11:44:03 2025 -0700

    tests: cover basic GPU allocation

    Signed-off-by: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com>

    misc fixes

    Signed-off-by: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com>

    remove cdi spec removal again

    Signed-off-by: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com>

commit 068bb76
Author: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Date:   Mon Nov 3 17:59:31 2025 +0000

    build(deps): bump nvidia/distroless/cc in /deployments/container

    Bumps nvidia/distroless/cc from v3.2.0-dev to v3.2.1-dev.

    ---
    updated-dependencies:
    - dependency-name: nvidia/distroless/cc
      dependency-version: v3.2.1-dev
      dependency-type: direct:production
    ...

    Signed-off-by: dependabot[bot] <support@github.com>

commit fcd74d1
Author: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com>
Date:   Sat Nov 1 11:42:35 2025 -0700

    tests: add nvmm helper

    Signed-off-by: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com>

commit 977f421
Author: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com>
Date:   Sat Nov 1 11:42:10 2025 -0700

    tests: per-user tmp dir (relevant on shared machines)

    Signed-off-by: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com>

commit 1c2da2c
Author: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com>
Date:   Sat Nov 1 11:41:09 2025 -0700

    tests: parallelize per-node state dir cleanup

    Signed-off-by: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com>

commit e8fa8e6
Author: Marco Ebert <marco_ebert@icloud.com>
Date:   Wed Oct 29 09:52:34 2025 +0100

    kubelet plugins: add /opt/bin to binary search paths

    Signed-off-by: Marco Ebert <marco_ebert@icloud.com>

commit f4d11e3
Merge: 89c8258 9b20929
Author: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com>
Date:   Wed Oct 29 13:09:19 2025 +0100

    Merge pull request kubernetes-sigs#707 from jgehrcke/jp/version25120

    Increment version to 25.12.0-dev

commit 89c8258
Merge: a772441 de830d3
Author: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com>
Date:   Wed Oct 29 13:07:49 2025 +0100

    Merge pull request kubernetes-sigs#703 from NVIDIA/dependabot/docker/deployments/container/main/nvidia/distroless/cc-v3.2.0-dev

    build(deps): bump nvidia/distroless/cc from v3.1.13-dev to v3.2.0-dev in /deployments/container

commit a772441
Merge: 7f591c2 2a2eeec
Author: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com>
Date:   Wed Oct 29 13:07:04 2025 +0100

    Merge pull request kubernetes-sigs#705 from NVIDIA/dependabot/go_modules/main/github.com/NVIDIA/nvidia-container-toolkit-1.18.0

    build(deps): bump github.com/NVIDIA/nvidia-container-toolkit from 1.18.0-rc.6 to 1.18.0

commit 9b20929
Author: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com>
Date:   Wed Oct 29 12:47:26 2025 +0100

    Increment version to 25.12.0-dev

    Signed-off-by: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com>

commit 2a2eeec
Author: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Date:   Sun Oct 26 17:02:01 2025 +0000

    build(deps): bump github.com/NVIDIA/nvidia-container-toolkit

    Bumps [github.com/NVIDIA/nvidia-container-toolkit](https://github.com/NVIDIA/nvidia-container-toolkit) from 1.18.0-rc.6 to 1.18.0.
    - [Release notes](https://github.com/NVIDIA/nvidia-container-toolkit/releases)
    - [Changelog](https://github.com/NVIDIA/nvidia-container-toolkit/blob/main/CHANGELOG.md)
    - [Commits](NVIDIA/nvidia-container-toolkit@v1.18.0-rc.6...v1.18.0)

    ---
    updated-dependencies:
    - dependency-name: github.com/NVIDIA/nvidia-container-toolkit
      dependency-version: 1.18.0
      dependency-type: direct:production
      update-type: version-update:semver-patch
    ...

    Signed-off-by: dependabot[bot] <support@github.com>

commit de830d3
Author: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Date:   Fri Oct 24 17:13:23 2025 +0000

    build(deps): bump nvidia/distroless/cc in /deployments/container

    Bumps nvidia/distroless/cc from v3.1.13-dev to v3.2.0-dev.

    ---
    updated-dependencies:
    - dependency-name: nvidia/distroless/cc
      dependency-version: v3.2.0-dev
      dependency-type: direct:production
    ...

    Signed-off-by: dependabot[bot] <support@github.com>

commit 7f591c2
Merge: cfe35ff 70fbda6
Author: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com>
Date:   Wed Oct 22 16:56:21 2025 +0200

    Merge pull request kubernetes-sigs#699 from jgehrcke/jp/readme-installation-instruction

    README: refer to external install instructions

commit 70fbda6
Author: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com>
Date:   Tue Oct 21 14:26:18 2025 +0200

    README: refer to external install instructions

    Signed-off-by: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com>

commit cfe35ff
Merge: 2762688 151c766
Author: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com>
Date:   Fri Oct 17 14:56:42 2025 +0200

    Merge pull request kubernetes-sigs#687 from jgehrcke/jp/unbreak-ci

    ci: fix downstream pipeline issues

commit 151c766
Author: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com>
Date:   Fri Oct 17 14:26:43 2025 +0200

    ci: bump regctl conservatively

    Signed-off-by: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com>

commit 7238e5d
Author: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com>
Date:   Fri Oct 17 14:26:24 2025 +0200

    ci: rename gl pipeline stages

    Signed-off-by: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com>

commit 87b7915
Author: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com>
Date:   Wed Sep 3 17:04:33 2025 +0200

    ci: push image w/o version prefix

    Signed-off-by: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com>

commit 24e765d
Author: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com>
Date:   Fri Oct 17 12:27:52 2025 +0200

    ci: remove scan-images step

    Signed-off-by: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com>

commit 2762688
Merge: 1516ec7 784ba18
Author: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com>
Date:   Thu Oct 16 20:12:17 2025 +0200

    Merge pull request kubernetes-sigs#685 from jgehrcke/jp/tests-v1-exactly

    tests: construct ResourceClaim differently on v1

commit 784ba18
Author: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com>
Date:   Thu Oct 16 17:30:44 2025 +0000

    tests: construct ResourceClaim differently on v1

    Signed-off-by: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com>

commit 1516ec7
Merge: 38b42bb e14beed
Author: Shiva Krishna Merla <smerla@nvidia.com>
Date:   Thu Oct 16 10:01:55 2025 -0700

    Merge pull request kubernetes-sigs#682 from shivamerla/fix_attestations

    Ensure attestation parameters are passed only for multi-arch builds using buildx.

commit 38b42bb
Merge: 0d83254 6cef363
Author: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com>
Date:   Thu Oct 16 11:27:20 2025 +0200

    Merge pull request kubernetes-sigs#679 from jgehrcke/jp/tests-split-into-modules-add-failover

    tests: split into modules, add CD failover coverage

commit 6cef363
Author: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com>
Date:   Thu Oct 16 08:53:23 2025 +0000

    tests: explicit log on launcher container start, misc

    Signed-off-by: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com>

commit db70cd7
Author: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com>
Date:   Wed Oct 15 15:27:53 2025 +0000

    tests: add test_cd_failover.bats and support

    Signed-off-by: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com>

commit 38036ac
Author: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com>
Date:   Wed Oct 15 15:16:21 2025 +0000

    tests: split tests.bats into modules

    Signed-off-by: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com>

commit e14beed
Author: Shiva Krishna, Merla <smerla@nvidia.com>
Date:   Wed Oct 15 11:52:42 2025 -0700

    Ensure attestation parameters are passed only for multi-arch builds using buildx.

    Signed-off-by: Shiva Krishna, Merla <smerla@nvidia.com>

commit 0d83254
Merge: 65cd2c5 f8ace2e
Author: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com>
Date:   Wed Oct 15 18:06:27 2025 +0200

    Merge pull request kubernetes-sigs#676 from jgehrcke/jp/curl-retry-tcp-rst

    build: retry TCP RST when curling bash source

commit 65cd2c5
Merge: b3f4e07 c40b44b
Author: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com>
Date:   Wed Oct 15 16:59:35 2025 +0200

    Merge pull request kubernetes-sigs#677 from jgehrcke/jp/test-abort-on-failure

    tests: abort suite on first failure, misc

commit c40b44b
Author: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com>
Date:   Wed Oct 15 11:24:06 2025 +0000

    tests: adjust readme

    Signed-off-by: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com>

commit 6e783bf
Author: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com>
Date:   Wed Oct 15 10:57:25 2025 +0000

    tests: rundir in /tmp (too much cruft in home dir)

    Signed-off-by: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com>

commit dafa4f5
Author: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com>
Date:   Wed Oct 15 11:15:32 2025 +0000

    tests: merge two simple tests into one

    Signed-off-by: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com>

commit c14c2ef
Author: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com>
Date:   Wed Oct 15 11:14:09 2025 +0000

    tests: add on_failure hook to emit debug info

    Signed-off-by: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com>

commit 89bb88a
Author: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com>
Date:   Wed Oct 15 11:12:22 2025 +0000

    tests: use new --abort flag for bats (fail suite fast)

    Signed-off-by: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com>

commit f8ace2e
Author: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com>
Date:   Wed Oct 15 10:42:05 2025 +0000

    build: retry TCP RST when curling bash source

    Error seen:

    curl: (7) Failed to connect to mirror.cs.odu.edu port 443 after 306 ms: Connection refused

    By default, a TCP connection rejection (RST) is not treated
    by curl as a transient error, see

    https://curl.se/docs/manpage.html#--retry-connrefused

    It's a transient error in the sense that it's often
    a way to implement backpressure. We retry at slow rate.

    `--retry-all-errors` is what we want here, it includes
    `--retry-connrefused`.

    Signed-off-by: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com>

commit b3f4e07
Merge: ab5a2b3 4e5cdf2
Author: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com>
Date:   Wed Oct 15 12:31:08 2025 +0200

    Merge pull request kubernetes-sigs#669 from NVIDIA/dependabot/go_modules/main/google.golang.org/grpc-1.76.0

    build(deps): bump google.golang.org/grpc from 1.75.1 to 1.76.0

commit ab5a2b3
Merge: 23ccbd2 803a35a
Author: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com>
Date:   Tue Oct 14 19:56:45 2025 +0200

    Merge pull request kubernetes-sigs#675 from NVIDIA/dependabot/docker/deployments/devel/main/golang-1.25.3

    build(deps): bump golang from 1.25.2 to 1.25.3 in /deployments/devel

commit 803a35a
Author: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Date:   Tue Oct 14 17:16:27 2025 +0000

    build(deps): bump golang from 1.25.2 to 1.25.3 in /deployments/devel

    Bumps golang from 1.25.2 to 1.25.3.

    ---
    updated-dependencies:
    - dependency-name: golang
      dependency-version: 1.25.3
      dependency-type: direct:production
      update-type: version-update:semver-patch
    ...

    Signed-off-by: dependabot[bot] <support@github.com>

commit 23ccbd2
Merge: 83b8249 9d02cea
Author: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com>
Date:   Tue Oct 14 17:30:58 2025 +0200

    Merge pull request kubernetes-sigs#672 from jgehrcke/jp/periodic-cleanup-partially-prepared-rcs

    CD kubelet plugin: add state reconciliation for partially prepared claims

commit 9d02cea
Author: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com>
Date:   Mon Oct 13 13:03:11 2025 +0000

    tests: cover cleanup for stale partially prepared claims

    Signed-off-by: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com>

commit f7a3310
Author: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com>
Date:   Sun Oct 12 22:06:38 2025 +0000

    CD plugin: handle stale partially prepared claims

    Add a fundamentally required state reconciliation:

    Periodically, perform a self-initiated Unprepare() of previously
    partially prepared claims.

    Perform periodically:

    - Read checkpoint
    - Iterate through RCs in PrepareStarted state
    - For each: RC still known in API server?
        If not:
          1) initiate an Unprepare
          2) Remove from checkpoint file if unprepr was successful

    Relevance:

    Unpreparing any partially performed claim preparation might revert
    a state mutation that would otherwise be permanently inconsistent with
    API server state (e.g., this could remove a node label).

    Signed-off-by: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com>

commit 83b8249
Merge: 5235bed e22cdba
Author: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com>
Date:   Tue Oct 14 15:11:35 2025 +0200

    Merge pull request kubernetes-sigs#674 from jgehrcke/jp/use-custom-config-dir-for-daemon

    CD daemon: /imexd instead of /etc/nvidia-imex

commit e22cdba
Author: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com>
Date:   Tue Oct 14 07:15:09 2025 +0000

    CD daemon: /imexd instead of /etc/nvidia-imex

    Signed-off-by: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com>

commit 5235bed
Merge: 7b5e2cd aa15924
Author: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com>
Date:   Tue Oct 14 12:50:05 2025 +0200

    Merge pull request kubernetes-sigs#658 from jgehrcke/jp/log-full-component-config-on-startup

    Log full startup config in all CLIs in `Before` hook

commit aa15924
Author: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com>
Date:   Sat Oct 11 21:09:36 2025 +0000

    tests: confirm startup config logged on lvl 0

    Signed-off-by: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com>

commit e2ea590
Author: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com>
Date:   Mon Sep 29 13:24:00 2025 +0000

    Introduce LogStartupConfig(), use in all CLIs in Before() hook

    Signed-off-by: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com>

commit 4e5cdf2
Author: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Date:   Mon Oct 13 08:53:56 2025 +0000

    build(deps): bump google.golang.org/grpc from 1.75.1 to 1.76.0

    Bumps [google.golang.org/grpc](https://github.com/grpc/grpc-go) from 1.75.1 to 1.76.0.
    - [Release notes](https://github.com/grpc/grpc-go/releases)
    - [Commits](grpc/grpc-go@v1.75.1...v1.76.0)

    ---
    updated-dependencies:
    - dependency-name: google.golang.org/grpc
      dependency-version: 1.76.0
      dependency-type: direct:production
      update-type: version-update:semver-minor
    ...

    Signed-off-by: dependabot[bot] <support@github.com>

commit 7b5e2cd
Merge: a1d2fd7 11f6c02
Author: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com>
Date:   Mon Oct 13 10:37:08 2025 +0200

    Merge pull request kubernetes-sigs#670 from NVIDIA/dependabot/go_modules/main/golang.org/x/time-0.14.0

    build(deps): bump golang.org/x/time from 0.9.0 to 0.14.0

commit a1d2fd7
Merge: c614e61 6b2af09
Author: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com>
Date:   Mon Oct 13 10:32:50 2025 +0200

    Merge pull request kubernetes-sigs#671 from NVIDIA/dependabot/go_modules/main/github.com/NVIDIA/nvidia-container-toolkit-1.18.0-rc.6

    build(deps): bump github.com/NVIDIA/nvidia-container-toolkit from 1.18.0-rc.5 to 1.18.0-rc.6

commit 6b2af09
Author: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Date:   Sun Oct 12 17:02:23 2025 +0000

    build(deps): bump github.com/NVIDIA/nvidia-container-toolkit

    Bumps [github.com/NVIDIA/nvidia-container-toolkit](https://github.com/NVIDIA/nvidia-container-toolkit) from 1.18.0-rc.5 to 1.18.0-rc.6.
    - [Release notes](https://github.com/NVIDIA/nvidia-container-toolkit/releases)
    - [Changelog](https://github.com/NVIDIA/nvidia-container-toolkit/blob/main/CHANGELOG.md)
    - [Commits](NVIDIA/nvidia-container-toolkit@v1.18.0-rc.5...v1.18.0-rc.6)

    ---
    updated-dependencies:
    - dependency-name: github.com/NVIDIA/nvidia-container-toolkit
      dependency-version: 1.18.0-rc.6
      dependency-type: direct:production
      update-type: version-update:semver-patch
    ...

    Signed-off-by: dependabot[bot] <support@github.com>

commit 11f6c02
Author: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Date:   Sun Oct 12 17:02:18 2025 +0000

    build(deps): bump golang.org/x/time from 0.9.0 to 0.14.0

    Bumps [golang.org/x/time](https://github.com/golang/time) from 0.9.0 to 0.14.0.
    - [Commits](golang/time@v0.9.0...v0.14.0)

    ---
    updated-dependencies:
    - dependency-name: golang.org/x/time
      dependency-version: 0.14.0
      dependency-type: direct:production
      update-type: version-update:semver-minor
    ...

    Signed-off-by: dependabot[bot] <support@github.com>

commit c614e61
Merge: a79a9fd 4ced422
Author: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com>
Date:   Sat Oct 11 16:54:20 2025 +0200

    Merge pull request kubernetes-sigs#633 from jgehrcke/jp/verbosity-vs-debuggability-improvements

    Add `logVerbosity` Helm chart parameter, reduce default log verbosity

commit 4ced422
Author: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com>
Date:   Sat Oct 11 14:45:52 2025 +0000

    Remove newline, document env-based log verb flip

    Signed-off-by: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com>

commit 4cf3d9b
Author: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com>
Date:   Fri Oct 10 17:57:00 2025 +0000

    Fix a typo in an error message

    Signed-off-by: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com>

commit 2c943f7
Author: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com>
Date:   Sat Oct 11 13:48:35 2025 +0000

    tests: remove sinful duplicate env strategy

    This also had a side effect on subsequent tests, with the
    controller starting with _no_ LOG_VERBOSITY environment variable
    set. I don't understand that, but that must be a funky Helm-ism.

    Signed-off-by: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com>

commit 3d5c51f
Author: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com>
Date:   Sat Oct 11 13:01:21 2025 +0000

    tests: fix: wait for controller flip

    Signed-off-by: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com>

commit b172342
Author: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com>
Date:   Sat Oct 11 12:21:07 2025 +0000

    tests: replace hard-coded sleep with dynamic wait

    Signed-off-by: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com>

commit 9748095
Author: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com>
Date:   Fri Oct 10 19:36:24 2025 +0000

    tests: cover CD daemon log levels

    Signed-off-by: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com>

commit 4767092
Author: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com>
Date:   Thu Oct 9 11:38:40 2025 +0000

    Helm logVerbosity param: add docs, start building tests

    Helm values.yaml: defaultLogVerbosity incl. docs

    Signed-off-by: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com>

    values.yaml: tweak, based on in log level insights

    Signed-off-by: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com>

    improve helm chart artifact commentary

    Signed-off-by: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com>

    squash: tweak docs

    Signed-off-by: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com>

    Rename chart var, start building tests

    Signed-off-by: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com>

    tests: cover log verbosity set per-component via env

    Signed-off-by: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com>

    helm: rename defaultLogVerbosity to logVerbosity

    Signed-off-by: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com>

commit 3828da9
Author: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com>
Date:   Fri Oct 10 18:28:50 2025 +0000

    CD daemon: change verbosity of "wait for nodes update" message

    Signed-off-by: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com>

commit 6d35ac1
Author: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com>
Date:   Fri Oct 10 17:56:31 2025 +0000

    CD controller: make CD daemon verbosity a required arg

    Signed-off-by: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com>

commit 84530ab
Author: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com>
Date:   Thu Oct 9 14:07:33 2025 +0000

    CD controller: log manager config on startup

    Signed-off-by: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com>

commit bb16c33
Author: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com>
Date:   Thu Oct 9 11:31:36 2025 +0000

    CD controller/plugins/daemon: introduce LOG_VERBOSITY

    Signed-off-by: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com>

commit 7e89b22
Author: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com>
Date:   Thu Oct 9 11:29:46 2025 +0000

    CD controller: introduce LOG_VERBOSITY_CD_DAEMON

    Signed-off-by: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com>

commit c5b147b
Author: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com>
Date:   Thu Oct 9 14:10:33 2025 +0000

    tests: add note about instability around chart flip

    Signed-off-by: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com>

commit 4cc705a
Author: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com>
Date:   Thu Oct 9 11:46:10 2025 +0000

    Helm: expose kubelet plugin env via chart variables

    Signed-off-by: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com>

commit 5f143b2
Author: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com>
Date:   Thu Oct 9 15:53:07 2025 +0000

    Upper-case log msg, no explicit verb 0

    Signed-off-by: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com>

commit 8321983
Author: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com>
Date:   Tue Sep 30 12:16:17 2025 +0000

    Change log message levels according to new system

    Signed-off-by: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com>

commit a36e214
Author: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com>
Date:   Tue Sep 30 12:12:38 2025 +0000

    Add logVerbosity Helm chart parameter

    Signed-off-by: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com>

commit a79a9fd
Merge: 3903df7 6e56823
Author: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com>
Date:   Sat Oct 11 13:43:17 2025 +0200

    Merge pull request kubernetes-sigs#646 from jgehrcke/jp/no-clique-update-cd-node-status

    Release workload on a non-MNNVL node in a CD

commit 6e56823
Author: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com>
Date:   Fri Oct 10 19:47:48 2025 +0000

    CD plugin: move CDI edit gen into computeDomainDaemonSettings

    Signed-off-by: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com>

    make diff smaller, rename func

    Signed-off-by: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com>

commit f7e4a45
Author: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com>
Date:   Fri Oct 10 16:30:27 2025 +0000

    CD daemon: always mount in IMEX daemon config files

    Signed-off-by: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com>

    CD plugin: always prepare IMEX config on the host and mount it in

    Signed-off-by: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com>

commit c040429
Author: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com>
Date:   Fri Oct 10 15:59:07 2025 +0000

    Fix typos in comments and log message

    Signed-off-by: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com>

commit deccb4d
Author: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com>
Date:   Tue Oct 7 11:39:33 2025 +0000

    CD plugin: always inject CD details via CDI

    Signed-off-by: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com>

    Rename 'domain' to 'domainID'

    Signed-off-by: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com>

    squash: review feedback

    Signed-off-by: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com>

    shorten comment

    Signed-off-by: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com>

commit 023e7f9
Author: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com>
Date:   Tue Oct 7 11:38:23 2025 +0000

    Enrich error message with CD detail when CD not found

    Signed-off-by: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com>

commit 32180ad
Author: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com>
Date:   Tue Oct 7 11:37:43 2025 +0000

    CD daemon: unconditionally write IMEX daemon config

    Signed-off-by: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com>

    Break out of select/case, MkdirAll() before writing file

    Signed-off-by: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com>

commit 13df4da
Author: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com>
Date:   Tue Oct 7 09:58:50 2025 +0000

    CD daemon: init node status as NotReady, misc log msg & comment tweaks

    Signed-off-by: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com>

commit 3cbd5a4
Author: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com>
Date:   Tue Oct 7 09:55:47 2025 +0000

    CD daemon: keep business logic in no-IMEX-daemon noop mode

    Signed-off-by: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com>

commit fffcea2
Author: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com>
Date:   Tue Oct 7 09:50:30 2025 +0000

    Introduce maxNodesPerIMEXDomain special case for empty cliqueID

    Signed-off-by: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com>

commit e0b8990
Author: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com>
Date:   Tue Oct 7 09:49:07 2025 +0000

    Update code comments

    Signed-off-by: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com>

commit 3903df7
Merge: 14dc9fe 72e39e9
Author: Kevin Klues <kklues@nvidia.com>
Date:   Fri Oct 10 13:22:31 2025 +0200

    Merge pull request kubernetes-sigs#661 from jgehrcke/jp/flush-logs-on-shutdown

    Flush logs in CLI app `After` hook

commit 14dc9fe
Merge: 8788dd1 d34a12f
Author: Kevin Klues <kklues@nvidia.com>
Date:   Fri Oct 10 13:16:53 2025 +0200

    Merge pull request kubernetes-sigs#656 from jgehrcke/jp/custom-rate-limiting

    Introduce DefaultPrepUnprepRateLimiter (less aggressive)

commit 8788dd1
Merge: 23d205f 0770c0a
Author: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com>
Date:   Fri Oct 10 12:43:33 2025 +0200

    Merge pull request kubernetes-sigs#666 from klueska/rbac-update

    Separate controller and kubeletplugin into separate RBAC permissions

commit 0770c0a
Author: Kevin Klues <kklues@nvidia.com>
Date:   Thu Oct 9 13:41:03 2025 +0000

    Separate controller and kubeletplugin into separate RBAC permissions

    Signed-off-by: Kevin Klues <kklues@nvidia.com>

commit 23d205f
Merge: fca1c08 816c7a1
Author: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com>
Date:   Thu Oct 9 10:01:28 2025 +0200

    Merge pull request kubernetes-sigs#664 from NVIDIA/dependabot/docker/deployments/container/main/nvidia/distroless/cc-v3.1.13-dev

    build(deps): bump nvidia/distroless/cc from v3.1.12-dev to v3.1.13-dev in /deployments/container

commit fca1c08
Merge: e089759 b15d633
Author: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com>
Date:   Thu Oct 9 09:56:21 2025 +0200

    Merge pull request kubernetes-sigs#665 from NVIDIA/dependabot/docker/deployments/devel/main/golang-1.25.2

    build(deps): bump golang from 1.25.1 to 1.25.2 in /deployments/devel

commit b15d633
Author: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Date:   Wed Oct 8 17:15:07 2025 +0000

    build(deps): bump golang from 1.25.1 to 1.25.2 in /deployments/devel

    Bumps golang from 1.25.1 to 1.25.2.

    ---
    updated-dependencies:
    - dependency-name: golang
      dependency-version: 1.25.2
      dependency-type: direct:production
      update-type: version-update:semver-patch
    ...

    Signed-off-by: dependabot[bot] <support@github.com>

commit 816c7a1
Author: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Date:   Wed Oct 8 17:15:03 2025 +0000

    build(deps): bump nvidia/distroless/cc in /deployments/container

    Bumps nvidia/distroless/cc from v3.1.12-dev to v3.1.13-dev.

    ---
    updated-dependencies:
    - dependency-name: nvidia/distroless/cc
      dependency-version: v3.1.13-dev
      dependency-type: direct:production
    ...

    Signed-off-by: dependabot[bot] <support@github.com>

commit 72e39e9
Author: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com>
Date:   Tue Sep 30 09:42:57 2025 +0000

    Flush logs in CLI app `After` hook

    Signed-off-by: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com>

commit d34a12f
Author: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com>
Date:   Wed Oct 8 15:18:53 2025 +0200

    Adjust go.mod to recent changes

    Signed-off-by: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com>

commit 7e18c33
Author: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com>
Date:   Tue Sep 30 12:18:41 2025 +0000

    Introduce DefaultPrepUnprepRateLimiter (less aggressive)

    Signed-off-by: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com>

commit e089759
Merge: 765892d e9f647e
Author: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com>
Date:   Wed Oct 8 13:09:32 2025 +0200

    Merge pull request kubernetes-sigs#651 from jgehrcke/jp/issue-694

    CD daemon: coordinate CD updates on shutdown via mutation cache

commit e9f647e
Author: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com>
Date:   Tue Oct 7 17:55:26 2025 +0000

    tests: cover CD daemon cleanup-on-shutdown

    Signed-off-by: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com>

commit 980a6a1
Author: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com>
Date:   Tue Oct 7 17:06:42 2025 +0000

    CD daemon: pod mngr: store UpdateStatus return value in mutation cache

    This makes sure that fast incremental mutations on
    the same CD object performed during shutdown are done
    conflict-free (i.e., in actual, incremental fashion
    using intermediate state returned by the API server).

    Without this patch:

    I1007 16:49:01.678050       1 podmanager.go:196] Successfully updated node gb-nvl-043-compute06 status to NotReady
    E1007 16:49:01.681345       1 computedomain.go:161] Failed to remove node from ComputeDomain during shutdown: [...] \
    				"the object has been modified" [...]

    With this patch:

    I1007 16:59:55.350436       1 podmanager.go:200] Successfully updated node gb-nvl-043-compute07 status to NotReady
    I1007 16:59:55.353551       1 computedomain.go:402] Successfully removed node with IP 192.168.34.153 from ComputeDomain default/imex-channel-injection

    Signed-off-by: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com>

commit 4b91fce
Author: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com>
Date:   Tue Oct 7 15:50:06 2025 +0000

    CD daemon: coordinate CD updates on shutdown via mutationcache

    Signed-off-by: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com>

commit 765892d
Merge: 2b7e899 754a758
Author: Kevin Klues <kklues@nvidia.com>
Date:   Wed Oct 8 09:52:51 2025 +0200

    Merge pull request kubernetes-sigs#650 from NVIDIA/dependabot/github_actions/github/codeql-action-4

    build(deps): bump github/codeql-action from 3 to 4

commit 754a758
Author: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Date:   Tue Oct 7 17:08:56 2025 +0000

    build(deps): bump github/codeql-action from 3 to 4

    Bumps [github/codeql-action](https://github.com/github/codeql-action) from 3 to 4.
    - [Release notes](https://github.com/github/codeql-action/releases)
    - [Changelog](https://github.com/github/codeql-action/blob/main/CHANGELOG.md)
    - [Commits](github/codeql-action@v3...v4)

    ---
    updated-dependencies:
    - dependency-name: github/codeql-action
      dependency-version: '4'
      dependency-type: direct:production
      update-type: version-update:semver-major
    ...

    Signed-off-by: dependabot[bot] <support@github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

kind/bug Categorizes issue or PR as related to a bug. robustness issue/pr: edge cases & fault tolerance

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Release workload on a non-MNNVL node in a CD

2 participants