Add `logVerbosity` Helm chart parameter, reduce default log verbosity #633

jgehrcke · 2025-09-30T12:41:20Z

Resolves #609.

This PR has a set of related changes that we can also discuss in separate PRs if preferred:

For better debuggability: log full component config (flags object) upon component startup (with go-render to quickly get to a stringified version of a more or less complex, nested struct -- open do using a different strategy).
Edit: now here: Log full startup config in all CLIs in Before hook #658
For control: introduction of a component-global logVerbosity Helm chart parameter, including documentation laying out the starting point for a verbosity system (comments very welcome)
For less noise in default config:
- Not as much chatter anymore during runtime (with level 1 being the default, and some proposals made for what level 1 should contain -- see logVerbosity level documentation).
- Re-parametrization of the work queue rate limiter for prepare/unprepare retries to be slightly slower -- we do not need to retry ~10 times per second initially when expected time-to-completion is O(1 s) or slower anyway. I made this change with log verbosity being a motivation, but I believe architecturally this change also makes sense and might even be important (discussed elsewhere). I chose the current parameters almost flying blind. If in the future we ever find the work queue rate limiter to introduce unnecessary latency (as part of latency/performance tuning): let's of course adjust this again. Edit: now here: Introduce DefaultPrepUnprepRateLimiter (less aggressive) #656
For robustness, explicit log flushing as part of component shutdown (I think we missed this so far).
Edit: now here: Flush logs in CLI app After hook #661

One interesting change that I propose here is to not have those "updated/added object callback" confirmations logged on the default log level in the CD controller -- I think we should try to not scale log volume with number of objects created (at least in default config).

copy-pr-bot · 2025-09-30T12:41:25Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

cmd/compute-domain-daemon/main.go

cmd/compute-domain-kubelet-plugin/main.go

cmd/compute-domain-controller/cleanup.go

cmd/compute-domain-controller/main.go

jgehrcke · 2025-10-07T07:43:47Z

I will chunk this PR into individual patches.

klueska · 2025-10-07T10:57:30Z

deployments/helm/nvidia-dra-driver-gpu/values.yaml

+# - Checkpoint file updates
+# - Kubelet plugin GRPC request/response detail
+#
+logVerbosity: "1"


I'm hesitant to add this as a top-level helm variable. Can we start with just exposing an envvar on the relevant components (without any helm value at all), with the default set to (1) in the code if no envvar is passed.

start with just exposing an envvar on the relevant components

We can do that. But let's work towards a specific goal:

I'd love to have a simple way to set the log verbosity across all components (either upon Helm install, or right thereafter).

With that env var: are those ergonomics good enough for us, for actually using that?

How will we (you, I) use this e.g. in a test suite, and in debugging?

Where/how and when do we set that env var?

hesitant to add this as a top-level helm variable

and a (smiling!) side note is that we thought of this task as "allow verbosity of kubelet plugins and controller to be set via helm" (#609)

You still set the envvars via helm. There is just no explicit helm value added to the helm API.

Maybe we take a hybrid approach. Rename this to defaultLogVerbosity and then allow it to be overridden by an envvar on a component by component basis.

You still set the envvars via helm

Sorry for digging in -- but how exactly would we imagine a user to do that?

Are you thinking about e.g.

k8s-dra-driver-gpu/deployments/helm/nvidia-dra-driver-gpu/values.yaml

Line 136 in e089759

env: []

and then doing something like --set controller.containers.computeDomain.env.LOG_VERBOSITY=5?

Rename this to defaultLogVerbosity

We can do that.

allow it to be overridden by an envvar on a component by component basis.

We can also do that! But we can maybe also figure that one out in another patch.

I looked into doing this (set verbosity via environment) when starting to work on this patch. What I found rather disappointing: for regular klog/v2, setting the -v CLI flag is the only way to set the verbosity, and there is no API for changing it during runtime.

Of course, there is a way to mutate the parsed CLI flags say (based on an env var) before having klog interpret them. But, that kind of approach may almost be appropriate for an obfuscation contest.

Edit: maybe something like this works (using an env var in the command specification)?

command: - foo - --v=$(LOG_VERBOSITY) env: - name: LOG_VERBOSITY value: "4"

Because I just found:

Kubernetes uses round parentheses for environment variable substitution in Pod command and args fields

Related topic: dynamically changing verbosity at runtime: I have seen a few pointers here and maybe nowadays there may be a better (even if not yet properly documented) way to dynamically set the verbosity (such as based on env var).

I want to explore other ways for changing the verbosity during runtime, and for that I think a starting point would be kubernetes/klog#368 and related resources -- klog is really limited in many ways compared to other logging libraries.

Other pointers (related, but not much of help IMO):

Need a way to update the log level dynamically kubernetes/klog#255

https://stackoverflow.com/a/79161517/145400

Rename this to defaultLogVerbosity

I have done this now.

But are we sure about that variable name? I think the "default" prefix doesn't really help.

In configuration management, often there is an order of precedence -- the same parameter can be specified on various layers. Those layers then have a well-known hierarchical order.

I want to say that it's not uncommon to have an outer "logVerbosity" setting that can be overridden via some more "inner" mechanism, using the same parameter name.

(I mean, of course you know all that -- I just want to spell it out explicitly here, that I think this is one such case)

Another point of view, terminology: when not specifying --set defaultLogVerbosity=X -- what takes effect then? The default defaultLogVerbosity 🤔.

doing something like --set controller.containers.computeDomain.env.LOG_VERBOSITY=5?

Update: well, that doesn't quite work.

I have now adjusted the patch that one can do a per-component override.

The clumsy way to do that (tested):

helm install ... \ --set controller.containers.computeDomain.env[0].name=LOG_VERBOSITY \ --set-string controller.containers.computeDomain.env[0].value=6 \ ...

Well, --set-string or

--set 'controller.containers.computeDomain.env[0].value="6"'

Note to self. Also viable:

kubectl set env pod/<pname> name=value

Triggers a pod restart.

klueska · 2025-10-07T10:59:44Z

pkg/workqueue/workqueue.go

+// history), checks the global rate against the token bucket, and picks the
+// longest delay from either strategy, ensuring that both per-item and overall
+// queue health are respected.
+func DefaultPrepUnprepRateLimiter() workqueue.TypedRateLimiter[any] {


Can we call this DefaultKubeletPluginRateLimiter to be symmetrical with DefaultControllerRateLimiter.
Also, can you explain how this change is relevant to a PR focused on logging?

can you explain how this change is relevant to a PR focused on logging?

As many times, I kindly refer to my PR description :-)

#633 (comment), point (3).

As indicated in that PR description and also in another comment: I am happy to distribute the commits across different PRs to keep things separate.

took this part of the patch to Introduce DefaultPrepUnprepRateLimiter (less aggressive) #656

took this part of the discussion to https://github.com/NVIDIA/k8s-dra-driver-gpu/pull/656/files#r2413832932

klueska · 2025-10-07T11:01:09Z

cmd/gpu-kubelet-plugin/main.go

+			// Runs after `Action` (regardless of success/error). In urfave cli
+			// v2, the final error reported will be from either Action, Before,
+			// or After (whichever is non-nil and last executed).
+			klog.Infof("shutdown")


Suggested change

klog.Infof("shutdown")

klog.Infof("Shutdown")

klueska · 2025-10-07T11:01:29Z

cmd/gpu-kubelet-plugin/main.go

 			return flags.loggingConfig.Apply()
 		},
 		Action: func(c *cli.Context) error {
+			klog.Infof("config: %v", flags)


Suggested change

klog.Infof("config: %v", flags)

klog.Infof("Config: %v", flags)

klueska · 2025-10-07T11:01:38Z

cmd/compute-domain-kubelet-plugin/main.go

+			// Runs after `Action` (regardless of success/error). In urfave cli
+			// v2, the final error reported will be from either Action, Before,
+			// or After (whichever is non-nil and last executed).
+			klog.Infof("shutdown")


Suggested change

klog.Infof("shutdown")

klog.Infof("Shutdown")

klueska · 2025-10-07T11:01:49Z

cmd/compute-domain-kubelet-plugin/main.go

 		},
 		Action: func(c *cli.Context) error {
-			ctx := c.Context
+			klog.Infof("config: %v", render.Render(flags))


Suggested change

klog.Infof("config: %v", render.Render(flags))

klog.Infof("Config: %v", render.Render(flags))

klueska · 2025-10-07T11:02:50Z

cmd/compute-domain-kubelet-plugin/health.go

-// Check implements [grpc_health_v1.HealthServer].
+// Check implements [grpc_health_v1.HealthServer.Check].


No -- the comment explains that this implements the grpc_health_v1.HealthServer interface

But Check does not implement the full HealthServer interface, right? If something claims to implement an interface, I'd expect it not only implement a part of that interface. That seems to be a language / convention aspect.

carried to #657

klueska · 2025-10-07T11:04:32Z

cmd/compute-domain-kubelet-plugin/health.go

 }

-func startHealthcheck(ctx context.Context, config *Config) (*healthcheck, error) {
+func setupHealthcheckPrimitives(ctx context.Context, config *Config) (*healthcheck, error) {


Why this change? Does it not start the healthcheck server?

Notably, this does not start a health check, as I found by following the code.

Which is why I do not like the name "startHealthcheck" (I was misled by that, I thought this starts a health check and I wondered why).

carried to #657

klueska · 2025-10-07T11:05:19Z

cmd/compute-domain-kubelet-plugin/driver.go

 				results[claim.UID] = err
 				wg.Done()
+				if err != nil {
+					klog.V(0).Infof("Permanent error unpreparing devices for claim %v: %v", claim.UID, err)


Can the V(0) be dropped?

Also, can the log be done inside nodeUnprepareResource so as not to clutter things here?

Yes, level 0 is implicit when doing klog.Infof().

Can the V(0) be dropped?

I remember: I did this to make this explicit -- to set the precedent for always using an explicit level, to enhance code readability.

Because this is a question one naturally has when reading code: what level does Info on log by default? One needs to have that additional knowledge.

But I will remove this now again to eradicate a potential point of friction.

can the log be done inside nodeUnprepareResource

In d.nodeUnprepareResource(ctx, claim) we return isPermanentError(err) directly w/o inspecting its return value. We only look at done here, at the call site.

We can change this if course, but let's not do this here.

klueska · 2025-10-07T11:06:17Z

cmd/compute-domain-kubelet-plugin/device_state.go

 		// Prepare+Checkpoint are done transactionally). Note that
 		// claimRef.String() contains namespace, name, UID.
-		klog.Infof("unprepare noop: claim not found in checkpoint data: %v", claimRef.String())
+		klog.V(2).Infof("unprepare noop: claim not found in checkpoint data: %v", claimRef.String())


Suggested change

klog.V(2).Infof("unprepare noop: claim not found in checkpoint data: %v", claimRef.String())

klog.V(2).Infof("Unprepare noop: claim not found in checkpoint data: %v", claimRef.String())

klueska · 2025-10-07T11:07:44Z

cmd/compute-domain-controller/main.go

+			// Runs after `Action` (regardless of success/error). In urfave cli
+			// v2, the final error reported will be from either Action, Before,
+			// or After (whichever is non-nil and last executed).
+			klog.Infof("shutdown")


Suggested change

klog.Infof("shutdown")

klog.Infof("Shutdown")

I like consistency.

We have currently many log messages starting with a lower-case character:

$ grep -nR 'klog' | grep -E '\(\"[[:lower:]]' | wc -l 62

That's spread across levels:

$ grep -nR 'klog' | grep -E '\(\"[[:lower:]]' | grep Info | wc -l 35

We should do this in a follow-up; and maybe add a lint / check in CI.

klueska · 2025-10-07T11:07:58Z

cmd/compute-domain-controller/main.go

 			return flags.loggingConfig.Apply()
 		},
 		Action: func(c *cli.Context) error {
+			klog.Infof("config: %v", render.Render(flags))


Suggested change

klog.Infof("config: %v", render.Render(flags))

klog.Infof("Config: %v", render.Render(flags))

Signed-off-by: Dr. Jan-Philip Gehrcke <[email protected]>

jgehrcke · 2025-10-11T13:03:03Z

tests/bats/helpers.sh

+  kubectl get pod \
+    -l nvidia-dra-driver-gpu-component=controller \
+    -n nvidia-dra-driver-gpu \
+      | grep -iv "NAME" \


note: ignore column headers

jgehrcke · 2025-10-11T13:03:27Z

tests/bats/tests.bats

  assert_output --partial "channel2047"
  assert_output --partial "channel222"
  kubectl delete -f demo/specs/imex/channel-injection-all.yaml
+  kubectl wait --for=delete pods imex-channel-injection-all --timeout=10s


flyby: wait for cleanup after test.

jgehrcke · 2025-10-11T13:03:30Z

tests/bats/tests.bats

  # Clean up.
  kubectl delete "${POD}"
  kubectl delete resourceclaim batssuite-rc-bad-opaque-config
+  kubectl wait --for=delete "${POD}" --timeout=10s


flyby: wait for cleanup after test.

jgehrcke · 2025-10-11T14:16:51Z

tests/bats/tests.bats

+  # new LOG_VERBOSITY_CD_DAEMON setting applies), and make sure controller
+  # deployment is still READY before moving on (make sure 1/1 READY).
+  CPOD_OLD="$(get_current_controller_pod_name)"
+  kubectl set env deployment nvidia-dra-driver-gpu-controller -n nvidia-dra-driver-gpu  LOG_VERBOSITY_CD_DAEMON=0


Until we have a flip-verbosity-at-runtime mechanism, this is probably the method we will recommend.

I started to document it here: https://github.com/NVIDIA/k8s-dra-driver-gpu/wiki/Troubleshooting#controlling-log-verbosity

Signed-off-by: Dr. Jan-Philip Gehrcke <[email protected]>

Helm values.yaml: defaultLogVerbosity incl. docs Signed-off-by: Dr. Jan-Philip Gehrcke <[email protected]> values.yaml: tweak, based on in log level insights Signed-off-by: Dr. Jan-Philip Gehrcke <[email protected]> improve helm chart artifact commentary Signed-off-by: Dr. Jan-Philip Gehrcke <[email protected]> squash: tweak docs Signed-off-by: Dr. Jan-Philip Gehrcke <[email protected]> Rename chart var, start building tests Signed-off-by: Dr. Jan-Philip Gehrcke <[email protected]> tests: cover log verbosity set per-component via env Signed-off-by: Dr. Jan-Philip Gehrcke <[email protected]> helm: rename defaultLogVerbosity to logVerbosity Signed-off-by: Dr. Jan-Philip Gehrcke <[email protected]>

Signed-off-by: Dr. Jan-Philip Gehrcke <[email protected]>

This also had a side effect on subsequent tests, with the controller starting with _no_ LOG_VERBOSITY environment variable set. I don't understand that, but that must be a funky Helm-ism. Signed-off-by: Dr. Jan-Philip Gehrcke <[email protected]>

Signed-off-by: Dr. Jan-Philip Gehrcke <[email protected]>

deployments/helm/nvidia-dra-driver-gpu/values.yaml

cmd/compute-domain-controller/controller.go

Signed-off-by: Dr. Jan-Philip Gehrcke <[email protected]>

jgehrcke · 2025-10-11T14:51:32Z

Since last review:

Squashed commits (was: 27).
Updated tests (mainly: more robust waiting for controller flip).

Test suite passes: 18 tests, 0 failures in 201 seconds.

jgehrcke

Exhaustively discussed with Kevin and agreed that this is good to go. Landing this. Nice. (also need to move on from the topic, this was a lot :)).

Let's learn from these changes by observing how things behave in practice.

Let's anticipate more changes on this front in the near term.

github-project-automation bot added this to Planning Board: k8s-dra-driver-gpu Sep 30, 2025

github-project-automation bot moved this to Backlog in Planning Board: k8s-dra-driver-gpu Sep 30, 2025

jgehrcke commented Sep 30, 2025

View reviewed changes

cmd/compute-domain-daemon/main.go Outdated Show resolved Hide resolved

jgehrcke commented Sep 30, 2025

View reviewed changes

cmd/compute-domain-daemon/main.go Show resolved Hide resolved

jgehrcke commented Sep 30, 2025

View reviewed changes

cmd/compute-domain-kubelet-plugin/main.go Outdated Show resolved Hide resolved

jgehrcke moved this from Backlog to In Progress in Planning Board: k8s-dra-driver-gpu Sep 30, 2025

jgehrcke added this to the v25.8.0 milestone Sep 30, 2025

jgehrcke added usability issue/pr related to UX debuggability issue/pr related to the ability to debug the system labels Sep 30, 2025

klueska assigned jgehrcke Sep 30, 2025

elezar reviewed Oct 2, 2025

View reviewed changes

cmd/compute-domain-controller/cleanup.go Show resolved Hide resolved

elezar reviewed Oct 2, 2025

View reviewed changes

cmd/compute-domain-controller/main.go Outdated Show resolved Hide resolved

elezar reviewed Oct 2, 2025

View reviewed changes

cmd/compute-domain-controller/main.go Show resolved Hide resolved

klueska reviewed Oct 7, 2025

View reviewed changes

This was referenced Oct 7, 2025

log message consistency: start all messages with an upper-case character #645

Open

Use custom rate limiter for kubelet plugin (un)prepare retries #655

Closed

CD health check: adjust naming and log messages for clarity #657

Open

jgehrcke added 2 commits October 11, 2025 11:44

Add logVerbosity Helm chart parameter

a36e214

Signed-off-by: Dr. Jan-Philip Gehrcke <[email protected]>

Change log message levels according to new system

8321983

Signed-off-by: Dr. Jan-Philip Gehrcke <[email protected]>

jgehrcke force-pushed the jp/verbosity-vs-debuggability-improvements branch from 916b42d to cac5cd9 Compare October 11, 2025 13:01

jgehrcke commented Oct 11, 2025

View reviewed changes

jgehrcke added 14 commits October 11, 2025 14:28

Upper-case log msg, no explicit verb 0

5f143b2

Signed-off-by: Dr. Jan-Philip Gehrcke <[email protected]>

Helm: expose kubelet plugin env via chart variables

4cc705a

Signed-off-by: Dr. Jan-Philip Gehrcke <[email protected]>

tests: add note about instability around chart flip

c5b147b

Signed-off-by: Dr. Jan-Philip Gehrcke <[email protected]>

CD controller: introduce LOG_VERBOSITY_CD_DAEMON

7e89b22

Signed-off-by: Dr. Jan-Philip Gehrcke <[email protected]>

CD controller/plugins/daemon: introduce LOG_VERBOSITY

bb16c33

Signed-off-by: Dr. Jan-Philip Gehrcke <[email protected]>

CD controller: log manager config on startup

84530ab

Signed-off-by: Dr. Jan-Philip Gehrcke <[email protected]>

CD controller: make CD daemon verbosity a required arg

6d35ac1

Signed-off-by: Dr. Jan-Philip Gehrcke <[email protected]>

CD daemon: change verbosity of "wait for nodes update" message

3828da9

Signed-off-by: Dr. Jan-Philip Gehrcke <[email protected]>

tests: cover CD daemon log levels

9748095

Signed-off-by: Dr. Jan-Philip Gehrcke <[email protected]>

tests: replace hard-coded sleep with dynamic wait

b172342

Signed-off-by: Dr. Jan-Philip Gehrcke <[email protected]>

tests: fix: wait for controller flip

3d5c51f

Signed-off-by: Dr. Jan-Philip Gehrcke <[email protected]>

Fix a typo in an error message

4cf3d9b

Signed-off-by: Dr. Jan-Philip Gehrcke <[email protected]>

jgehrcke force-pushed the jp/verbosity-vs-debuggability-improvements branch from a6ef300 to 4cf3d9b Compare October 11, 2025 14:41

jgehrcke commented Oct 11, 2025

View reviewed changes

deployments/helm/nvidia-dra-driver-gpu/values.yaml Outdated Show resolved Hide resolved

jgehrcke commented Oct 11, 2025

View reviewed changes

cmd/compute-domain-controller/controller.go Show resolved Hide resolved

Remove newline, document env-based log verb flip

4ced422

Signed-off-by: Dr. Jan-Philip Gehrcke <[email protected]>

jgehrcke force-pushed the jp/verbosity-vs-debuggability-improvements branch from 20f9365 to 4ced422 Compare October 11, 2025 14:48

jgehrcke commented Oct 11, 2025

View reviewed changes

jgehrcke merged commit c614e61 into NVIDIA:main Oct 11, 2025
7 checks passed

github-project-automation bot moved this from In Progress to Closed in Planning Board: k8s-dra-driver-gpu Oct 11, 2025

	klog.Infof("config: %v", flags)
	klog.Infof("Config: %v", flags)

	klog.Infof("config: %v", render.Render(flags))
	klog.Infof("Config: %v", render.Render(flags))

		// Check implements [grpc_health_v1.HealthServer].
		// Check implements [grpc_health_v1.HealthServer.Check].

	klog.V(2).Infof("unprepare noop: claim not found in checkpoint data: %v", claimRef.String())
	klog.V(2).Infof("Unprepare noop: claim not found in checkpoint data: %v", claimRef.String())

Add logVerbosity Helm chart parameter, reduce default log verbosity #633

Add logVerbosity Helm chart parameter, reduce default log verbosity #633

Uh oh!

Conversation

jgehrcke commented Sep 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

copy-pr-bot bot commented Sep 30, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jgehrcke commented Oct 7, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jgehrcke Oct 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jgehrcke Oct 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jgehrcke Oct 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jgehrcke Oct 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jgehrcke Oct 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jgehrcke Oct 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jgehrcke Oct 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Add `logVerbosity` Helm chart parameter, reduce default log verbosity #633

Add `logVerbosity` Helm chart parameter, reduce default log verbosity #633

jgehrcke commented Sep 30, 2025 •

edited

Loading

jgehrcke Oct 7, 2025 •

edited

Loading

jgehrcke Oct 8, 2025 •

edited

Loading

jgehrcke Oct 9, 2025 •

edited

Loading

jgehrcke Oct 9, 2025 •

edited

Loading

jgehrcke Oct 7, 2025 •

edited

Loading

jgehrcke Oct 7, 2025 •

edited

Loading

jgehrcke Oct 7, 2025 •

edited

Loading

jgehrcke Oct 9, 2025 •

edited

Loading

jgehrcke Oct 9, 2025 •

edited

Loading

jgehrcke Oct 7, 2025 •

edited

Loading

jgehrcke left a comment •

edited

Loading