Release workload on a non-MNNVL node in a CD#646
Release workload on a non-MNNVL node in a CD#646jgehrcke merged 10 commits intokubernetes-sigs:mainfrom
Conversation
| } | ||
|
|
||
| // Update node info in ComputeDomain | ||
| // Why are we doing this (only) upon receiving another update? |
There was a problem hiding this comment.
I think a question I had here really is:
If everyone only sends their own update after receiving an update -- who performs the initial update to start the cascade?
There was a problem hiding this comment.
onAddOrUpdate() gets called on any addition / update of the CD. When the informer first comes online it triggers an "add" on all of the existing objects that it receives from the API server. This is the intial callback that allows this component to insert its NodeInfo in the Status.
There was a problem hiding this comment.
When the informer first comes online it triggers an "add" on all of the existing objects that it receives from the API server.
Thank you! (I now remember that you've described this before)
There was a problem hiding this comment.
added commit, removed comment (clarified this in the comment for onAddOrUpdate though that I won't forget again so quickly)
| // Try to write update. May be in conflict with updates to the same list | ||
| // coming in from other nodes (in which case this update will be retried | ||
| // later). TODO: review conflict resolution methodology for the | ||
| // 10000-nodes-CD use case. |
There was a problem hiding this comment.
How many update attempts overall does it take along the happy path for say N=1000 nodes each trying to insert themselves at about the same time? The individual node has to expect a large number of conflict errors E before its own update succeeds. If E=100 (just making something up, this could be larger) then we make N*E=100000 update attempts in total. Is that a correct consideration? If it is, we should think about finding a smarter strategy that scales better with N.
There was a problem hiding this comment.
After each attempt there is an exponential backoff with some random jitter to stagger attempts from different nodes. We also (re)read the CD object just before attempting to write, in order to make the window as small as possible where a conflict might slip in. We would need to measure it, but I don't think the number of average conflicts would be nearly as large as you are suggesting.
There was a problem hiding this comment.
We would need to measure it
Yes, that's the only productive approach!
We also (re)read the CD object just before attempting to write
Yes! But we understand this doesn't necessarily yield the latest -- we read it from the informer stream/cache (which might be subject to lag, right?).
After each attempt there is an exponential backoff with some random jitter to stagger attempts from different nodes.
I understand :)
I also do think we agree, and I am not planning to change anything here or criticize what we have.
I am merely trying to rationally reason through current scaling behavior.
In terms of that scaling behavior, we want to understand not only how the number of global write attempts scales with the number of nodes N (that might be a measure for API server load generated).
But (maybe even more importantly) we also want to understand how long the overall convergence duration D scales with N. And we probably eventually need to set ourselves a goal (as in: an upper bound that we find tolerable), and assess on that basis if we need to consider a more sophisticated approach.
nearly as large as you are suggesting.
I want to say: I did not suggest E=100, I was mainly suggesting somewhat linear scaling behavior with N. That might even be best case; the convergence duration might after all scale non-linearly (say, with O(N^2) or so) -- precisely because of the applied back-off during retrying.
We will see.
Supporting "Uber-sized Compute Domains" will be fun in that regard!
There was a problem hiding this comment.
removed the comment as part of rebasing on top of head of main -- thanks for discussion
|
@klueska thanks for review. I added responded to comments and added commits. Can you have another look? Before merge, I would squash commits. |
e6b364c to
1a259f4
Compare
|
I have rebased this branch on current head of main, resolved conficts, and also squashed commits. To be sure the merge-to-main wouldn't be bad, I just tested this again manually with Works (CD daemon log): Workload got released: Is of course half-broken because we didn't perform CDI-based injection: Let's land this now, (unless there's a rough problem in here). Let's do all non-functional changes in a follow-up. Specifically, let us sort out finding a better encapsulation for the non-imex-daemon-mode later. |
Signed-off-by: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com>
Signed-off-by: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com>
Signed-off-by: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com>
Signed-off-by: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com>
1a259f4 to
c5636d9
Compare
| if err := writeIMEXConfig(flags.podIP); err != nil { | ||
| return fmt.Errorf("writeIMEXConfig failed: %w", err) | ||
| } |
There was a problem hiding this comment.
Add this to the function to avoid this else statement (and just always do this unconditionally):
// writeIMEXConfig renders the config template with the pod IP and writes it to the final config file.
func writeIMEXConfig(podIP string) error {
// Ensure the directory exists
dir := filepath.Dir(nodesConfigPath)
if err := os.MkdirAll(dir, 0755); err != nil {
return fmt.Errorf("failed to create directory %s: %w", dir, err)
}
...
}
There was a problem hiding this comment.
Done, now calling os.MkdirAll() in writeIMEXConfig().
There was a problem hiding this comment.
After these changes, I once again needed to provoke & test that scenario. Due diligence. By mocking the cliqueID acquisition function (returning an empty string).
Glad I did, because the above's change resulted in:
Error: writeIMEXConfig failed: error parsing template file: open /etc/nvidia-imex/config.tmpl.cfg: no such file or directory
The problem with not mounting in /etc/nvidia-imex is not just "oh this dir doesn't exist".
The problem is: not mounting in the template to render the config file from.
But mounting that in requires writing it to the host filesystem in the first place. Otherwise:
Error: failed to create containerd task: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error mounting "/var/lib/kubelet/plugins/compute-domain.nvidia.com/domains/885e0f1b-f319-40dd-81c6-15ae6c690130" to rootfs at "/etc/nvidia-imex": create mount destination for /etc/nvidia-imex mount: bind mount source stat: stat /var/lib/kubelet/plugins/compute-domain.nvidia.com/domains/885e0f1b-f319-40dd-81c6-15ae6c690130: no such file or directory: unknown
I have now added more changes to this PR to always write this IMEX daemon configuration directory tree to the host, and to always mount it in (to the CD daemon). I think it makes sense!
Now, we only do not inject any IMEX-related dev nodes in non-IMEX mode.
So, with this change, we've created even more symmetry between non-IMEX and IMEX mode.
Specifically, we now always create
/var/lib/kubelet/plugins/compute-domain.nvidia.com/domains/<domainID>
on the host (that's the mount source).
One might argue that this host filesystem interaction is superfluous and only introduces more failure paths -- because if this fails, we can't even launch the CD daemon in 'pretend mode'. But it's very unlikely to ever hurt in practice.
I do believe that for now we should cherish the simplicity of "let's do everything except for actually spawning the IMEX daemon process".
klueska
left a comment
There was a problem hiding this comment.
Looking good, just a few small changes
Of course. Time to get a spell check into my IDE, too. |
Signed-off-by: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com> Break out of select/case, MkdirAll() before writing file Signed-off-by: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com>
Signed-off-by: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com>
Signed-off-by: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com> Rename 'domain' to 'domainID' Signed-off-by: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com> squash: review feedback Signed-off-by: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com> shorten comment Signed-off-by: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com>
Signed-off-by: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com>
e534ed3 to
02ff3c0
Compare
Signed-off-by: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com> CD plugin: always prepare IMEX config on the host and mount it in Signed-off-by: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com>
02ff3c0 to
f7e4a45
Compare
|
I've updated the PR with the changes discussed, and made more adjustments described here to further increase symmetry between "non-IMEX" and "actual-IMEX" modes (for a lack of better words). I've also worked on cleaning up the commit history a bit (squashed what made sense). I performed another manual test for |
| fmt.Sprintf("CLIQUE_ID=%s", s.manager.cliqueID), | ||
| fmt.Sprintf("COMPUTE_DOMAIN_UUID=%s", cd.UID), | ||
| fmt.Sprintf("COMPUTE_DOMAIN_NAME=%s", cd.Name), | ||
| fmt.Sprintf("COMPUTE_DOMAIN_NAMESPACE=%s", cd.Namespace), |
There was a problem hiding this comment.
note to self: we could have made the diff smaller by moving Env up again -- but that's okay.
| } | ||
| // Always inject CD config details into the CD daemon (regardless of clique | ||
| // ID being empty or not). | ||
| edits, err := computeDomainDaemonSettings.GetCommonCDIContainerEdits(ctx) |
There was a problem hiding this comment.
nit for symmetry, this could be callled GetCDIContainerEditsCommon instead of GetCommonCDIContainerEdits so that we have:
GetCDIContainerEditsForImex
GetCDIContainerEditsCommon
Personally, I like grouping by prefix.
Signed-off-by: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com> make diff smaller, rename func Signed-off-by: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com>
de09926 to
6e56823
Compare
|
Great discussion and feedback. Post-approval, I took liberty to add this cosmetic cleanup patch (squashed into the last commit on this branch): |
commit 55fc7b0 Merge: 5443e0f ef23484 Author: Shiva Krishna Merla <smerla@nvidia.com> Date: Thu Nov 6 16:04:43 2025 -0800 Merge pull request kubernetes-sigs#668 from varunrsekar/vfio-support-1.33 Support VFIO passthrough commit ef23484 Author: Varun Ramachandra Sekar <vsekar@nvidia.com> Date: Tue Oct 14 17:29:18 2025 -0700 vfio passthrough support Signed-off-by: Varun Ramachandra Sekar <vsekar@nvidia.com> use chroot to run modprobe Signed-off-by: Varun Ramachandra Sekar <vsekar@nvidia.com> deadvertise sibling devices on preparation Signed-off-by: Varun Ramachandra Sekar <vsekar@nvidia.com> soft check for VFs before attempting unbind Signed-off-by: Varun Ramachandra Sekar <vsekar@nvidia.com> address review comments Signed-off-by: Varun Ramachandra Sekar <vsekar@nvidia.com> address comments (2) Signed-off-by: Varun Ramachandra Sekar <vsekar@nvidia.com> use fuser to check if gpu is free Signed-off-by: Varun Ramachandra Sekar <vsekar@nvidia.com> remove unnecessary securityContext Signed-off-by: Varun Ramachandra Sekar <vsekar@nvidia.com> don't mix vfio and mig devices Signed-off-by: Varun Ramachandra Sekar <vsekar@nvidia.com> commit 5443e0f Merge: 59d775b 3babfe5 Author: Shiva Krishna Merla <smerla@nvidia.com> Date: Tue Nov 4 12:48:00 2025 -0800 Merge pull request kubernetes-sigs#711 from shivamerla/add_gpu_stress_tests tests: Add separate targets for GPU plugin tests + add stress tests commit 3babfe5 Author: Shiva Krishna, Merla <smerla@nvidia.com> Date: Tue Nov 4 11:47:01 2025 -0800 tests: Use BATS_TEST_TMPDIR and failfast on errors during cleanup Signed-off-by: Shiva Krishna, Merla <smerla@nvidia.com> commit 2b3e70b Author: Shiva Krishna, Merla <smerla@nvidia.com> Date: Tue Nov 4 11:07:19 2025 -0800 tests: Add separate targets for GPU plugin tests + add stress tests * Add separate make targets to run GPU and CD specific tests * Add a stress test for GPU allocation * Refactor Makefile to share common docker setup between targets Signed-off-by: Shiva Krishna, Merla <smerla@nvidia.com> commit 59d775b Merge: 852b56f 1e79179 Author: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com> Date: Mon Nov 3 19:38:02 2025 +0100 Merge pull request kubernetes-sigs#709 from jgehrcke/jp/basic-gpu-tests tests: cover basic GPU allocation, misc improvements commit 852b56f Merge: 1ee1b4a e8fa8e6 Author: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com> Date: Mon Nov 3 19:21:53 2025 +0100 Merge pull request kubernetes-sigs#706 from Gacko/vkptt kubelet plugins: add /opt/bin to binary search paths commit 1ee1b4a Merge: f4d11e3 068bb76 Author: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com> Date: Mon Nov 3 19:10:16 2025 +0100 Merge pull request kubernetes-sigs#710 from NVIDIA/dependabot/docker/deployments/container/main/nvidia/distroless/cc-v3.2.1-dev build(deps): bump nvidia/distroless/cc from v3.2.0-dev to v3.2.1-dev in /deployments/container commit 1e79179 Author: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com> Date: Sat Nov 1 11:44:03 2025 -0700 tests: cover basic GPU allocation Signed-off-by: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com> misc fixes Signed-off-by: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com> remove cdi spec removal again Signed-off-by: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com> commit 068bb76 Author: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Date: Mon Nov 3 17:59:31 2025 +0000 build(deps): bump nvidia/distroless/cc in /deployments/container Bumps nvidia/distroless/cc from v3.2.0-dev to v3.2.1-dev. --- updated-dependencies: - dependency-name: nvidia/distroless/cc dependency-version: v3.2.1-dev dependency-type: direct:production ... Signed-off-by: dependabot[bot] <support@github.com> commit fcd74d1 Author: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com> Date: Sat Nov 1 11:42:35 2025 -0700 tests: add nvmm helper Signed-off-by: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com> commit 977f421 Author: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com> Date: Sat Nov 1 11:42:10 2025 -0700 tests: per-user tmp dir (relevant on shared machines) Signed-off-by: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com> commit 1c2da2c Author: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com> Date: Sat Nov 1 11:41:09 2025 -0700 tests: parallelize per-node state dir cleanup Signed-off-by: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com> commit e8fa8e6 Author: Marco Ebert <marco_ebert@icloud.com> Date: Wed Oct 29 09:52:34 2025 +0100 kubelet plugins: add /opt/bin to binary search paths Signed-off-by: Marco Ebert <marco_ebert@icloud.com> commit f4d11e3 Merge: 89c8258 9b20929 Author: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com> Date: Wed Oct 29 13:09:19 2025 +0100 Merge pull request kubernetes-sigs#707 from jgehrcke/jp/version25120 Increment version to 25.12.0-dev commit 89c8258 Merge: a772441 de830d3 Author: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com> Date: Wed Oct 29 13:07:49 2025 +0100 Merge pull request kubernetes-sigs#703 from NVIDIA/dependabot/docker/deployments/container/main/nvidia/distroless/cc-v3.2.0-dev build(deps): bump nvidia/distroless/cc from v3.1.13-dev to v3.2.0-dev in /deployments/container commit a772441 Merge: 7f591c2 2a2eeec Author: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com> Date: Wed Oct 29 13:07:04 2025 +0100 Merge pull request kubernetes-sigs#705 from NVIDIA/dependabot/go_modules/main/github.com/NVIDIA/nvidia-container-toolkit-1.18.0 build(deps): bump github.com/NVIDIA/nvidia-container-toolkit from 1.18.0-rc.6 to 1.18.0 commit 9b20929 Author: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com> Date: Wed Oct 29 12:47:26 2025 +0100 Increment version to 25.12.0-dev Signed-off-by: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com> commit 2a2eeec Author: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Date: Sun Oct 26 17:02:01 2025 +0000 build(deps): bump github.com/NVIDIA/nvidia-container-toolkit Bumps [github.com/NVIDIA/nvidia-container-toolkit](https://github.com/NVIDIA/nvidia-container-toolkit) from 1.18.0-rc.6 to 1.18.0. - [Release notes](https://github.com/NVIDIA/nvidia-container-toolkit/releases) - [Changelog](https://github.com/NVIDIA/nvidia-container-toolkit/blob/main/CHANGELOG.md) - [Commits](NVIDIA/nvidia-container-toolkit@v1.18.0-rc.6...v1.18.0) --- updated-dependencies: - dependency-name: github.com/NVIDIA/nvidia-container-toolkit dependency-version: 1.18.0 dependency-type: direct:production update-type: version-update:semver-patch ... Signed-off-by: dependabot[bot] <support@github.com> commit de830d3 Author: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Date: Fri Oct 24 17:13:23 2025 +0000 build(deps): bump nvidia/distroless/cc in /deployments/container Bumps nvidia/distroless/cc from v3.1.13-dev to v3.2.0-dev. --- updated-dependencies: - dependency-name: nvidia/distroless/cc dependency-version: v3.2.0-dev dependency-type: direct:production ... Signed-off-by: dependabot[bot] <support@github.com> commit 7f591c2 Merge: cfe35ff 70fbda6 Author: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com> Date: Wed Oct 22 16:56:21 2025 +0200 Merge pull request kubernetes-sigs#699 from jgehrcke/jp/readme-installation-instruction README: refer to external install instructions commit 70fbda6 Author: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com> Date: Tue Oct 21 14:26:18 2025 +0200 README: refer to external install instructions Signed-off-by: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com> commit cfe35ff Merge: 2762688 151c766 Author: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com> Date: Fri Oct 17 14:56:42 2025 +0200 Merge pull request kubernetes-sigs#687 from jgehrcke/jp/unbreak-ci ci: fix downstream pipeline issues commit 151c766 Author: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com> Date: Fri Oct 17 14:26:43 2025 +0200 ci: bump regctl conservatively Signed-off-by: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com> commit 7238e5d Author: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com> Date: Fri Oct 17 14:26:24 2025 +0200 ci: rename gl pipeline stages Signed-off-by: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com> commit 87b7915 Author: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com> Date: Wed Sep 3 17:04:33 2025 +0200 ci: push image w/o version prefix Signed-off-by: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com> commit 24e765d Author: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com> Date: Fri Oct 17 12:27:52 2025 +0200 ci: remove scan-images step Signed-off-by: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com> commit 2762688 Merge: 1516ec7 784ba18 Author: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com> Date: Thu Oct 16 20:12:17 2025 +0200 Merge pull request kubernetes-sigs#685 from jgehrcke/jp/tests-v1-exactly tests: construct ResourceClaim differently on v1 commit 784ba18 Author: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com> Date: Thu Oct 16 17:30:44 2025 +0000 tests: construct ResourceClaim differently on v1 Signed-off-by: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com> commit 1516ec7 Merge: 38b42bb e14beed Author: Shiva Krishna Merla <smerla@nvidia.com> Date: Thu Oct 16 10:01:55 2025 -0700 Merge pull request kubernetes-sigs#682 from shivamerla/fix_attestations Ensure attestation parameters are passed only for multi-arch builds using buildx. commit 38b42bb Merge: 0d83254 6cef363 Author: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com> Date: Thu Oct 16 11:27:20 2025 +0200 Merge pull request kubernetes-sigs#679 from jgehrcke/jp/tests-split-into-modules-add-failover tests: split into modules, add CD failover coverage commit 6cef363 Author: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com> Date: Thu Oct 16 08:53:23 2025 +0000 tests: explicit log on launcher container start, misc Signed-off-by: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com> commit db70cd7 Author: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com> Date: Wed Oct 15 15:27:53 2025 +0000 tests: add test_cd_failover.bats and support Signed-off-by: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com> commit 38036ac Author: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com> Date: Wed Oct 15 15:16:21 2025 +0000 tests: split tests.bats into modules Signed-off-by: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com> commit e14beed Author: Shiva Krishna, Merla <smerla@nvidia.com> Date: Wed Oct 15 11:52:42 2025 -0700 Ensure attestation parameters are passed only for multi-arch builds using buildx. Signed-off-by: Shiva Krishna, Merla <smerla@nvidia.com> commit 0d83254 Merge: 65cd2c5 f8ace2e Author: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com> Date: Wed Oct 15 18:06:27 2025 +0200 Merge pull request kubernetes-sigs#676 from jgehrcke/jp/curl-retry-tcp-rst build: retry TCP RST when curling bash source commit 65cd2c5 Merge: b3f4e07 c40b44b Author: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com> Date: Wed Oct 15 16:59:35 2025 +0200 Merge pull request kubernetes-sigs#677 from jgehrcke/jp/test-abort-on-failure tests: abort suite on first failure, misc commit c40b44b Author: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com> Date: Wed Oct 15 11:24:06 2025 +0000 tests: adjust readme Signed-off-by: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com> commit 6e783bf Author: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com> Date: Wed Oct 15 10:57:25 2025 +0000 tests: rundir in /tmp (too much cruft in home dir) Signed-off-by: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com> commit dafa4f5 Author: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com> Date: Wed Oct 15 11:15:32 2025 +0000 tests: merge two simple tests into one Signed-off-by: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com> commit c14c2ef Author: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com> Date: Wed Oct 15 11:14:09 2025 +0000 tests: add on_failure hook to emit debug info Signed-off-by: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com> commit 89bb88a Author: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com> Date: Wed Oct 15 11:12:22 2025 +0000 tests: use new --abort flag for bats (fail suite fast) Signed-off-by: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com> commit f8ace2e Author: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com> Date: Wed Oct 15 10:42:05 2025 +0000 build: retry TCP RST when curling bash source Error seen: curl: (7) Failed to connect to mirror.cs.odu.edu port 443 after 306 ms: Connection refused By default, a TCP connection rejection (RST) is not treated by curl as a transient error, see https://curl.se/docs/manpage.html#--retry-connrefused It's a transient error in the sense that it's often a way to implement backpressure. We retry at slow rate. `--retry-all-errors` is what we want here, it includes `--retry-connrefused`. Signed-off-by: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com> commit b3f4e07 Merge: ab5a2b3 4e5cdf2 Author: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com> Date: Wed Oct 15 12:31:08 2025 +0200 Merge pull request kubernetes-sigs#669 from NVIDIA/dependabot/go_modules/main/google.golang.org/grpc-1.76.0 build(deps): bump google.golang.org/grpc from 1.75.1 to 1.76.0 commit ab5a2b3 Merge: 23ccbd2 803a35a Author: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com> Date: Tue Oct 14 19:56:45 2025 +0200 Merge pull request kubernetes-sigs#675 from NVIDIA/dependabot/docker/deployments/devel/main/golang-1.25.3 build(deps): bump golang from 1.25.2 to 1.25.3 in /deployments/devel commit 803a35a Author: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Date: Tue Oct 14 17:16:27 2025 +0000 build(deps): bump golang from 1.25.2 to 1.25.3 in /deployments/devel Bumps golang from 1.25.2 to 1.25.3. --- updated-dependencies: - dependency-name: golang dependency-version: 1.25.3 dependency-type: direct:production update-type: version-update:semver-patch ... Signed-off-by: dependabot[bot] <support@github.com> commit 23ccbd2 Merge: 83b8249 9d02cea Author: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com> Date: Tue Oct 14 17:30:58 2025 +0200 Merge pull request kubernetes-sigs#672 from jgehrcke/jp/periodic-cleanup-partially-prepared-rcs CD kubelet plugin: add state reconciliation for partially prepared claims commit 9d02cea Author: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com> Date: Mon Oct 13 13:03:11 2025 +0000 tests: cover cleanup for stale partially prepared claims Signed-off-by: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com> commit f7a3310 Author: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com> Date: Sun Oct 12 22:06:38 2025 +0000 CD plugin: handle stale partially prepared claims Add a fundamentally required state reconciliation: Periodically, perform a self-initiated Unprepare() of previously partially prepared claims. Perform periodically: - Read checkpoint - Iterate through RCs in PrepareStarted state - For each: RC still known in API server? If not: 1) initiate an Unprepare 2) Remove from checkpoint file if unprepr was successful Relevance: Unpreparing any partially performed claim preparation might revert a state mutation that would otherwise be permanently inconsistent with API server state (e.g., this could remove a node label). Signed-off-by: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com> commit 83b8249 Merge: 5235bed e22cdba Author: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com> Date: Tue Oct 14 15:11:35 2025 +0200 Merge pull request kubernetes-sigs#674 from jgehrcke/jp/use-custom-config-dir-for-daemon CD daemon: /imexd instead of /etc/nvidia-imex commit e22cdba Author: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com> Date: Tue Oct 14 07:15:09 2025 +0000 CD daemon: /imexd instead of /etc/nvidia-imex Signed-off-by: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com> commit 5235bed Merge: 7b5e2cd aa15924 Author: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com> Date: Tue Oct 14 12:50:05 2025 +0200 Merge pull request kubernetes-sigs#658 from jgehrcke/jp/log-full-component-config-on-startup Log full startup config in all CLIs in `Before` hook commit aa15924 Author: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com> Date: Sat Oct 11 21:09:36 2025 +0000 tests: confirm startup config logged on lvl 0 Signed-off-by: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com> commit e2ea590 Author: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com> Date: Mon Sep 29 13:24:00 2025 +0000 Introduce LogStartupConfig(), use in all CLIs in Before() hook Signed-off-by: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com> commit 4e5cdf2 Author: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Date: Mon Oct 13 08:53:56 2025 +0000 build(deps): bump google.golang.org/grpc from 1.75.1 to 1.76.0 Bumps [google.golang.org/grpc](https://github.com/grpc/grpc-go) from 1.75.1 to 1.76.0. - [Release notes](https://github.com/grpc/grpc-go/releases) - [Commits](grpc/grpc-go@v1.75.1...v1.76.0) --- updated-dependencies: - dependency-name: google.golang.org/grpc dependency-version: 1.76.0 dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <support@github.com> commit 7b5e2cd Merge: a1d2fd7 11f6c02 Author: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com> Date: Mon Oct 13 10:37:08 2025 +0200 Merge pull request kubernetes-sigs#670 from NVIDIA/dependabot/go_modules/main/golang.org/x/time-0.14.0 build(deps): bump golang.org/x/time from 0.9.0 to 0.14.0 commit a1d2fd7 Merge: c614e61 6b2af09 Author: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com> Date: Mon Oct 13 10:32:50 2025 +0200 Merge pull request kubernetes-sigs#671 from NVIDIA/dependabot/go_modules/main/github.com/NVIDIA/nvidia-container-toolkit-1.18.0-rc.6 build(deps): bump github.com/NVIDIA/nvidia-container-toolkit from 1.18.0-rc.5 to 1.18.0-rc.6 commit 6b2af09 Author: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Date: Sun Oct 12 17:02:23 2025 +0000 build(deps): bump github.com/NVIDIA/nvidia-container-toolkit Bumps [github.com/NVIDIA/nvidia-container-toolkit](https://github.com/NVIDIA/nvidia-container-toolkit) from 1.18.0-rc.5 to 1.18.0-rc.6. - [Release notes](https://github.com/NVIDIA/nvidia-container-toolkit/releases) - [Changelog](https://github.com/NVIDIA/nvidia-container-toolkit/blob/main/CHANGELOG.md) - [Commits](NVIDIA/nvidia-container-toolkit@v1.18.0-rc.5...v1.18.0-rc.6) --- updated-dependencies: - dependency-name: github.com/NVIDIA/nvidia-container-toolkit dependency-version: 1.18.0-rc.6 dependency-type: direct:production update-type: version-update:semver-patch ... Signed-off-by: dependabot[bot] <support@github.com> commit 11f6c02 Author: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Date: Sun Oct 12 17:02:18 2025 +0000 build(deps): bump golang.org/x/time from 0.9.0 to 0.14.0 Bumps [golang.org/x/time](https://github.com/golang/time) from 0.9.0 to 0.14.0. - [Commits](golang/time@v0.9.0...v0.14.0) --- updated-dependencies: - dependency-name: golang.org/x/time dependency-version: 0.14.0 dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <support@github.com> commit c614e61 Merge: a79a9fd 4ced422 Author: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com> Date: Sat Oct 11 16:54:20 2025 +0200 Merge pull request kubernetes-sigs#633 from jgehrcke/jp/verbosity-vs-debuggability-improvements Add `logVerbosity` Helm chart parameter, reduce default log verbosity commit 4ced422 Author: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com> Date: Sat Oct 11 14:45:52 2025 +0000 Remove newline, document env-based log verb flip Signed-off-by: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com> commit 4cf3d9b Author: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com> Date: Fri Oct 10 17:57:00 2025 +0000 Fix a typo in an error message Signed-off-by: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com> commit 2c943f7 Author: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com> Date: Sat Oct 11 13:48:35 2025 +0000 tests: remove sinful duplicate env strategy This also had a side effect on subsequent tests, with the controller starting with _no_ LOG_VERBOSITY environment variable set. I don't understand that, but that must be a funky Helm-ism. Signed-off-by: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com> commit 3d5c51f Author: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com> Date: Sat Oct 11 13:01:21 2025 +0000 tests: fix: wait for controller flip Signed-off-by: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com> commit b172342 Author: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com> Date: Sat Oct 11 12:21:07 2025 +0000 tests: replace hard-coded sleep with dynamic wait Signed-off-by: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com> commit 9748095 Author: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com> Date: Fri Oct 10 19:36:24 2025 +0000 tests: cover CD daemon log levels Signed-off-by: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com> commit 4767092 Author: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com> Date: Thu Oct 9 11:38:40 2025 +0000 Helm logVerbosity param: add docs, start building tests Helm values.yaml: defaultLogVerbosity incl. docs Signed-off-by: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com> values.yaml: tweak, based on in log level insights Signed-off-by: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com> improve helm chart artifact commentary Signed-off-by: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com> squash: tweak docs Signed-off-by: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com> Rename chart var, start building tests Signed-off-by: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com> tests: cover log verbosity set per-component via env Signed-off-by: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com> helm: rename defaultLogVerbosity to logVerbosity Signed-off-by: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com> commit 3828da9 Author: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com> Date: Fri Oct 10 18:28:50 2025 +0000 CD daemon: change verbosity of "wait for nodes update" message Signed-off-by: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com> commit 6d35ac1 Author: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com> Date: Fri Oct 10 17:56:31 2025 +0000 CD controller: make CD daemon verbosity a required arg Signed-off-by: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com> commit 84530ab Author: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com> Date: Thu Oct 9 14:07:33 2025 +0000 CD controller: log manager config on startup Signed-off-by: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com> commit bb16c33 Author: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com> Date: Thu Oct 9 11:31:36 2025 +0000 CD controller/plugins/daemon: introduce LOG_VERBOSITY Signed-off-by: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com> commit 7e89b22 Author: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com> Date: Thu Oct 9 11:29:46 2025 +0000 CD controller: introduce LOG_VERBOSITY_CD_DAEMON Signed-off-by: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com> commit c5b147b Author: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com> Date: Thu Oct 9 14:10:33 2025 +0000 tests: add note about instability around chart flip Signed-off-by: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com> commit 4cc705a Author: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com> Date: Thu Oct 9 11:46:10 2025 +0000 Helm: expose kubelet plugin env via chart variables Signed-off-by: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com> commit 5f143b2 Author: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com> Date: Thu Oct 9 15:53:07 2025 +0000 Upper-case log msg, no explicit verb 0 Signed-off-by: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com> commit 8321983 Author: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com> Date: Tue Sep 30 12:16:17 2025 +0000 Change log message levels according to new system Signed-off-by: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com> commit a36e214 Author: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com> Date: Tue Sep 30 12:12:38 2025 +0000 Add logVerbosity Helm chart parameter Signed-off-by: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com> commit a79a9fd Merge: 3903df7 6e56823 Author: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com> Date: Sat Oct 11 13:43:17 2025 +0200 Merge pull request kubernetes-sigs#646 from jgehrcke/jp/no-clique-update-cd-node-status Release workload on a non-MNNVL node in a CD commit 6e56823 Author: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com> Date: Fri Oct 10 19:47:48 2025 +0000 CD plugin: move CDI edit gen into computeDomainDaemonSettings Signed-off-by: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com> make diff smaller, rename func Signed-off-by: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com> commit f7e4a45 Author: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com> Date: Fri Oct 10 16:30:27 2025 +0000 CD daemon: always mount in IMEX daemon config files Signed-off-by: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com> CD plugin: always prepare IMEX config on the host and mount it in Signed-off-by: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com> commit c040429 Author: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com> Date: Fri Oct 10 15:59:07 2025 +0000 Fix typos in comments and log message Signed-off-by: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com> commit deccb4d Author: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com> Date: Tue Oct 7 11:39:33 2025 +0000 CD plugin: always inject CD details via CDI Signed-off-by: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com> Rename 'domain' to 'domainID' Signed-off-by: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com> squash: review feedback Signed-off-by: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com> shorten comment Signed-off-by: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com> commit 023e7f9 Author: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com> Date: Tue Oct 7 11:38:23 2025 +0000 Enrich error message with CD detail when CD not found Signed-off-by: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com> commit 32180ad Author: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com> Date: Tue Oct 7 11:37:43 2025 +0000 CD daemon: unconditionally write IMEX daemon config Signed-off-by: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com> Break out of select/case, MkdirAll() before writing file Signed-off-by: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com> commit 13df4da Author: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com> Date: Tue Oct 7 09:58:50 2025 +0000 CD daemon: init node status as NotReady, misc log msg & comment tweaks Signed-off-by: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com> commit 3cbd5a4 Author: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com> Date: Tue Oct 7 09:55:47 2025 +0000 CD daemon: keep business logic in no-IMEX-daemon noop mode Signed-off-by: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com> commit fffcea2 Author: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com> Date: Tue Oct 7 09:50:30 2025 +0000 Introduce maxNodesPerIMEXDomain special case for empty cliqueID Signed-off-by: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com> commit e0b8990 Author: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com> Date: Tue Oct 7 09:49:07 2025 +0000 Update code comments Signed-off-by: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com> commit 3903df7 Merge: 14dc9fe 72e39e9 Author: Kevin Klues <kklues@nvidia.com> Date: Fri Oct 10 13:22:31 2025 +0200 Merge pull request kubernetes-sigs#661 from jgehrcke/jp/flush-logs-on-shutdown Flush logs in CLI app `After` hook commit 14dc9fe Merge: 8788dd1 d34a12f Author: Kevin Klues <kklues@nvidia.com> Date: Fri Oct 10 13:16:53 2025 +0200 Merge pull request kubernetes-sigs#656 from jgehrcke/jp/custom-rate-limiting Introduce DefaultPrepUnprepRateLimiter (less aggressive) commit 8788dd1 Merge: 23d205f 0770c0a Author: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com> Date: Fri Oct 10 12:43:33 2025 +0200 Merge pull request kubernetes-sigs#666 from klueska/rbac-update Separate controller and kubeletplugin into separate RBAC permissions commit 0770c0a Author: Kevin Klues <kklues@nvidia.com> Date: Thu Oct 9 13:41:03 2025 +0000 Separate controller and kubeletplugin into separate RBAC permissions Signed-off-by: Kevin Klues <kklues@nvidia.com> commit 23d205f Merge: fca1c08 816c7a1 Author: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com> Date: Thu Oct 9 10:01:28 2025 +0200 Merge pull request kubernetes-sigs#664 from NVIDIA/dependabot/docker/deployments/container/main/nvidia/distroless/cc-v3.1.13-dev build(deps): bump nvidia/distroless/cc from v3.1.12-dev to v3.1.13-dev in /deployments/container commit fca1c08 Merge: e089759 b15d633 Author: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com> Date: Thu Oct 9 09:56:21 2025 +0200 Merge pull request kubernetes-sigs#665 from NVIDIA/dependabot/docker/deployments/devel/main/golang-1.25.2 build(deps): bump golang from 1.25.1 to 1.25.2 in /deployments/devel commit b15d633 Author: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Date: Wed Oct 8 17:15:07 2025 +0000 build(deps): bump golang from 1.25.1 to 1.25.2 in /deployments/devel Bumps golang from 1.25.1 to 1.25.2. --- updated-dependencies: - dependency-name: golang dependency-version: 1.25.2 dependency-type: direct:production update-type: version-update:semver-patch ... Signed-off-by: dependabot[bot] <support@github.com> commit 816c7a1 Author: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Date: Wed Oct 8 17:15:03 2025 +0000 build(deps): bump nvidia/distroless/cc in /deployments/container Bumps nvidia/distroless/cc from v3.1.12-dev to v3.1.13-dev. --- updated-dependencies: - dependency-name: nvidia/distroless/cc dependency-version: v3.1.13-dev dependency-type: direct:production ... Signed-off-by: dependabot[bot] <support@github.com> commit 72e39e9 Author: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com> Date: Tue Sep 30 09:42:57 2025 +0000 Flush logs in CLI app `After` hook Signed-off-by: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com> commit d34a12f Author: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com> Date: Wed Oct 8 15:18:53 2025 +0200 Adjust go.mod to recent changes Signed-off-by: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com> commit 7e18c33 Author: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com> Date: Tue Sep 30 12:18:41 2025 +0000 Introduce DefaultPrepUnprepRateLimiter (less aggressive) Signed-off-by: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com> commit e089759 Merge: 765892d e9f647e Author: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com> Date: Wed Oct 8 13:09:32 2025 +0200 Merge pull request kubernetes-sigs#651 from jgehrcke/jp/issue-694 CD daemon: coordinate CD updates on shutdown via mutation cache commit e9f647e Author: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com> Date: Tue Oct 7 17:55:26 2025 +0000 tests: cover CD daemon cleanup-on-shutdown Signed-off-by: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com> commit 980a6a1 Author: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com> Date: Tue Oct 7 17:06:42 2025 +0000 CD daemon: pod mngr: store UpdateStatus return value in mutation cache This makes sure that fast incremental mutations on the same CD object performed during shutdown are done conflict-free (i.e., in actual, incremental fashion using intermediate state returned by the API server). Without this patch: I1007 16:49:01.678050 1 podmanager.go:196] Successfully updated node gb-nvl-043-compute06 status to NotReady E1007 16:49:01.681345 1 computedomain.go:161] Failed to remove node from ComputeDomain during shutdown: [...] \ "the object has been modified" [...] With this patch: I1007 16:59:55.350436 1 podmanager.go:200] Successfully updated node gb-nvl-043-compute07 status to NotReady I1007 16:59:55.353551 1 computedomain.go:402] Successfully removed node with IP 192.168.34.153 from ComputeDomain default/imex-channel-injection Signed-off-by: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com> commit 4b91fce Author: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com> Date: Tue Oct 7 15:50:06 2025 +0000 CD daemon: coordinate CD updates on shutdown via mutationcache Signed-off-by: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com> commit 765892d Merge: 2b7e899 754a758 Author: Kevin Klues <kklues@nvidia.com> Date: Wed Oct 8 09:52:51 2025 +0200 Merge pull request kubernetes-sigs#650 from NVIDIA/dependabot/github_actions/github/codeql-action-4 build(deps): bump github/codeql-action from 3 to 4 commit 754a758 Author: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Date: Tue Oct 7 17:08:56 2025 +0000 build(deps): bump github/codeql-action from 3 to 4 Bumps [github/codeql-action](https://github.com/github/codeql-action) from 3 to 4. - [Release notes](https://github.com/github/codeql-action/releases) - [Changelog](https://github.com/github/codeql-action/blob/main/CHANGELOG.md) - [Commits](github/codeql-action@v3...v4) --- updated-dependencies: - dependency-name: github/codeql-action dependency-version: '4' dependency-type: direct:production update-type: version-update:semver-major ... Signed-off-by: dependabot[bot] <support@github.com>
Resolves #625.
Slightly less trivial than I thought.
maxNodesPerIMEXDomainfor the group of non-MNNVL nodes (can become large)The group of nodes that does not perform MNNVL may be large. With this change, all of these nodes will contribute to the update-with-retry-on-conflict race, which is important to consider. I added a few code comments along the way.
I tested this manually, with this patch to simulate the scenario:
CD daemon log excerpt, from start-to-shutdown:
With the
getCliqueID()patch removed, I saw a passing test suite: