Skip to content

Conversation

@jgehrcke
Copy link
Collaborator

@jgehrcke jgehrcke commented Sep 30, 2025

Resolves #609.

This PR has a set of related changes that we can also discuss in separate PRs if preferred:

  1. For better debuggability: log full component config (flags object) upon component startup (with go-render to quickly get to a stringified version of a more or less complex, nested struct -- open do using a different strategy).
    Edit: now here: Log full startup config in all CLIs in Before hook #658

  2. For control: introduction of a component-global logVerbosity Helm chart parameter, including documentation laying out the starting point for a verbosity system (comments very welcome)

  3. For less noise in default config:

    • Not as much chatter anymore during runtime (with level 1 being the default, and some proposals made for what level 1 should contain -- see logVerbosity level documentation).
    • Re-parametrization of the work queue rate limiter for prepare/unprepare retries to be slightly slower -- we do not need to retry ~10 times per second initially when expected time-to-completion is O(1 s) or slower anyway. I made this change with log verbosity being a motivation, but I believe architecturally this change also makes sense and might even be important (discussed elsewhere). I chose the current parameters almost flying blind. If in the future we ever find the work queue rate limiter to introduce unnecessary latency (as part of latency/performance tuning): let's of course adjust this again. Edit: now here: Introduce DefaultPrepUnprepRateLimiter (less aggressive) #656
  4. For robustness, explicit log flushing as part of component shutdown (I think we missed this so far).
    Edit: now here: Flush logs in CLI app After hook #661

One interesting change that I propose here is to not have those "updated/added object callback" confirmations logged on the default log level in the CD controller -- I think we should try to not scale log volume with number of objects created (at least in default config).

@copy-pr-bot
Copy link

copy-pr-bot bot commented Sep 30, 2025

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@jgehrcke jgehrcke moved this from Backlog to In Progress in Planning Board: k8s-dra-driver-gpu Sep 30, 2025
@jgehrcke jgehrcke added this to the v25.8.0 milestone Sep 30, 2025
@jgehrcke jgehrcke added usability issue/pr related to UX debuggability issue/pr related to the ability to debug the system labels Sep 30, 2025
@jgehrcke
Copy link
Collaborator Author

jgehrcke commented Oct 7, 2025

I will chunk this PR into individual patches.

# - Checkpoint file updates
# - Kubelet plugin GRPC request/response detail
#
logVerbosity: "1"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm hesitant to add this as a top-level helm variable. Can we start with just exposing an envvar on the relevant components (without any helm value at all), with the default set to (1) in the code if no envvar is passed.

Copy link
Collaborator Author

@jgehrcke jgehrcke Oct 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

start with just exposing an envvar on the relevant components

We can do that. But let's work towards a specific goal:

I'd love to have a simple way to set the log verbosity across all components (either upon Helm install, or right thereafter).

With that env var: are those ergonomics good enough for us, for actually using that?

How will we (you, I) use this e.g. in a test suite, and in debugging?

Where/how and when do we set that env var?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hesitant to add this as a top-level helm variable

and a (smiling!) side note is that we thought of this task as "allow verbosity of kubelet plugins and controller to be set via helm" (#609)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You still set the envvars via helm. There is just no explicit helm value added to the helm API.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we take a hybrid approach. Rename this to defaultLogVerbosity and then allow it to be overridden by an envvar on a component by component basis.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You still set the envvars via helm

Sorry for digging in -- but how exactly would we imagine a user to do that?

Are you thinking about e.g.

and then doing something like --set controller.containers.computeDomain.env.LOG_VERBOSITY=5?

Rename this to defaultLogVerbosity

We can do that.

Copy link
Collaborator Author

@jgehrcke jgehrcke Oct 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

allow it to be overridden by an envvar on a component by component basis.

We can also do that! But we can maybe also figure that one out in another patch.

I looked into doing this (set verbosity via environment) when starting to work on this patch. What I found rather disappointing: for regular klog/v2, setting the -v CLI flag is the only way to set the verbosity, and there is no API for changing it during runtime.

Of course, there is a way to mutate the parsed CLI flags say (based on an env var) before having klog interpret them. But, that kind of approach may almost be appropriate for an obfuscation contest.

Edit: maybe something like this works (using an env var in the command specification)?

command:
- foo
- --v=$(LOG_VERBOSITY)
env:
- name: LOG_VERBOSITY
  value: "4"

Because I just found:

Kubernetes uses round parentheses for environment variable substitution in Pod command and args fields

Related topic: dynamically changing verbosity at runtime: I have seen a few pointers here and maybe nowadays there may be a better (even if not yet properly documented) way to dynamically set the verbosity (such as based on env var).

I want to explore other ways for changing the verbosity during runtime, and for that I think a starting point would be kubernetes/klog#368 and related resources -- klog is really limited in many ways compared to other logging libraries.

Other pointers (related, but not much of help IMO):

Copy link
Collaborator Author

@jgehrcke jgehrcke Oct 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rename this to defaultLogVerbosity

I have done this now.

But are we sure about that variable name? I think the "default" prefix doesn't really help.

In configuration management, often there is an order of precedence -- the same parameter can be specified on various layers. Those layers then have a well-known hierarchical order.

I want to say that it's not uncommon to have an outer "logVerbosity" setting that can be overridden via some more "inner" mechanism, using the same parameter name.

(I mean, of course you know all that -- I just want to spell it out explicitly here, that I think this is one such case)

Another point of view, terminology: when not specifying --set defaultLogVerbosity=X -- what takes effect then? The default defaultLogVerbosity 🤔.

Copy link
Collaborator Author

@jgehrcke jgehrcke Oct 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

doing something like --set controller.containers.computeDomain.env.LOG_VERBOSITY=5?

Update: well, that doesn't quite work.

I have now adjusted the patch that one can do a per-component override.

The clumsy way to do that (tested):

helm install ... \
  --set controller.containers.computeDomain.env[0].name=LOG_VERBOSITY \
  --set-string controller.containers.computeDomain.env[0].value=6 \
  ...

Well, --set-string or

--set 'controller.containers.computeDomain.env[0].value="6"'

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note to self. Also viable:

kubectl set env pod/<pname> name=value

Triggers a pod restart.

// history), checks the global rate against the token bucket, and picks the
// longest delay from either strategy, ensuring that both per-item and overall
// queue health are respected.
func DefaultPrepUnprepRateLimiter() workqueue.TypedRateLimiter[any] {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we call this DefaultKubeletPluginRateLimiter to be symmetrical with DefaultControllerRateLimiter.
Also, can you explain how this change is relevant to a PR focused on logging?

Copy link
Collaborator Author

@jgehrcke jgehrcke Oct 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you explain how this change is relevant to a PR focused on logging?

As many times, I kindly refer to my PR description :-)

#633 (comment), point (3).

As indicated in that PR description and also in another comment: I am happy to distribute the commits across different PRs to keep things separate.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

// Runs after `Action` (regardless of success/error). In urfave cli
// v2, the final error reported will be from either Action, Before,
// or After (whichever is non-nil and last executed).
klog.Infof("shutdown")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
klog.Infof("shutdown")
klog.Infof("Shutdown")

return flags.loggingConfig.Apply()
},
Action: func(c *cli.Context) error {
klog.Infof("config: %v", flags)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
klog.Infof("config: %v", flags)
klog.Infof("Config: %v", flags)

// Runs after `Action` (regardless of success/error). In urfave cli
// v2, the final error reported will be from either Action, Before,
// or After (whichever is non-nil and last executed).
klog.Infof("shutdown")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
klog.Infof("shutdown")
klog.Infof("Shutdown")

},
Action: func(c *cli.Context) error {
ctx := c.Context
klog.Infof("config: %v", render.Render(flags))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
klog.Infof("config: %v", render.Render(flags))
klog.Infof("Config: %v", render.Render(flags))

Comment on lines 117 to 121
// Check implements [grpc_health_v1.HealthServer].
// Check implements [grpc_health_v1.HealthServer.Check].
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No -- the comment explains that this implements the grpc_health_v1.HealthServer interface

Copy link
Collaborator Author

@jgehrcke jgehrcke Oct 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But Check does not implement the full HealthServer interface, right? If something claims to implement an interface, I'd expect it not only implement a part of that interface. That seems to be a language / convention aspect.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

carried to #657

}

func startHealthcheck(ctx context.Context, config *Config) (*healthcheck, error) {
func setupHealthcheckPrimitives(ctx context.Context, config *Config) (*healthcheck, error) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why this change? Does it not start the healthcheck server?

Copy link
Collaborator Author

@jgehrcke jgehrcke Oct 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Notably, this does not start a health check, as I found by following the code.

Which is why I do not like the name "startHealthcheck" (I was misled by that, I thought this starts a health check and I wondered why).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

carried to #657

results[claim.UID] = err
wg.Done()
if err != nil {
klog.V(0).Infof("Permanent error unpreparing devices for claim %v: %v", claim.UID, err)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can the V(0) be dropped?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, can the log be done inside nodeUnprepareResource so as not to clutter things here?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, level 0 is implicit when doing klog.Infof().

Copy link
Collaborator Author

@jgehrcke jgehrcke Oct 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can the V(0) be dropped?

I remember: I did this to make this explicit -- to set the precedent for always using an explicit level, to enhance code readability.

Because this is a question one naturally has when reading code: what level does Info on log by default? One needs to have that additional knowledge.

But I will remove this now again to eradicate a potential point of friction.

Copy link
Collaborator Author

@jgehrcke jgehrcke Oct 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can the log be done inside nodeUnprepareResource

In d.nodeUnprepareResource(ctx, claim) we return isPermanentError(err) directly w/o inspecting its return value. We only look at done here, at the call site.

We can change this if course, but let's not do this here.

// Prepare+Checkpoint are done transactionally). Note that
// claimRef.String() contains namespace, name, UID.
klog.Infof("unprepare noop: claim not found in checkpoint data: %v", claimRef.String())
klog.V(2).Infof("unprepare noop: claim not found in checkpoint data: %v", claimRef.String())
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
klog.V(2).Infof("unprepare noop: claim not found in checkpoint data: %v", claimRef.String())
klog.V(2).Infof("Unprepare noop: claim not found in checkpoint data: %v", claimRef.String())

// Runs after `Action` (regardless of success/error). In urfave cli
// v2, the final error reported will be from either Action, Before,
// or After (whichever is non-nil and last executed).
klog.Infof("shutdown")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
klog.Infof("shutdown")
klog.Infof("Shutdown")

Copy link
Collaborator Author

@jgehrcke jgehrcke Oct 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like consistency.

We have currently many log messages starting with a lower-case character:

$ grep -nR 'klog' | grep -E '\(\"[[:lower:]]' | wc -l
62

That's spread across levels:

$ grep -nR 'klog' | grep -E '\(\"[[:lower:]]' | grep Info | wc -l
35

We should do this in a follow-up; and maybe add a lint / check in CI.

return flags.loggingConfig.Apply()
},
Action: func(c *cli.Context) error {
klog.Infof("config: %v", render.Render(flags))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
klog.Infof("config: %v", render.Render(flags))
klog.Infof("Config: %v", render.Render(flags))

@jgehrcke jgehrcke force-pushed the jp/verbosity-vs-debuggability-improvements branch from 916b42d to cac5cd9 Compare October 11, 2025 13:01
kubectl get pod \
-l nvidia-dra-driver-gpu-component=controller \
-n nvidia-dra-driver-gpu \
| grep -iv "NAME" \
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

note: ignore column headers

assert_output --partial "channel2047"
assert_output --partial "channel222"
kubectl delete -f demo/specs/imex/channel-injection-all.yaml
kubectl wait --for=delete pods imex-channel-injection-all --timeout=10s
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

flyby: wait for cleanup after test.

# Clean up.
kubectl delete "${POD}"
kubectl delete resourceclaim batssuite-rc-bad-opaque-config
kubectl wait --for=delete "${POD}" --timeout=10s
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

flyby: wait for cleanup after test.

# new LOG_VERBOSITY_CD_DAEMON setting applies), and make sure controller
# deployment is still READY before moving on (make sure 1/1 READY).
CPOD_OLD="$(get_current_controller_pod_name)"
kubectl set env deployment nvidia-dra-driver-gpu-controller -n nvidia-dra-driver-gpu LOG_VERBOSITY_CD_DAEMON=0
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Until we have a flip-verbosity-at-runtime mechanism, this is probably the method we will recommend.

I started to document it here: https://github.com/NVIDIA/k8s-dra-driver-gpu/wiki/Troubleshooting#controlling-log-verbosity

Signed-off-by: Dr. Jan-Philip Gehrcke <[email protected]>
Signed-off-by: Dr. Jan-Philip Gehrcke <[email protected]>
Helm values.yaml: defaultLogVerbosity incl. docs

Signed-off-by: Dr. Jan-Philip Gehrcke <[email protected]>

values.yaml: tweak, based on in log level insights

Signed-off-by: Dr. Jan-Philip Gehrcke <[email protected]>

improve helm chart artifact commentary

Signed-off-by: Dr. Jan-Philip Gehrcke <[email protected]>

squash: tweak docs

Signed-off-by: Dr. Jan-Philip Gehrcke <[email protected]>

Rename chart var, start building tests

Signed-off-by: Dr. Jan-Philip Gehrcke <[email protected]>

tests: cover log verbosity set per-component via env

Signed-off-by: Dr. Jan-Philip Gehrcke <[email protected]>

helm: rename defaultLogVerbosity to logVerbosity

Signed-off-by: Dr. Jan-Philip Gehrcke <[email protected]>
Signed-off-by: Dr. Jan-Philip Gehrcke <[email protected]>
Signed-off-by: Dr. Jan-Philip Gehrcke <[email protected]>
This also had a side effect on subsequent tests, with the
controller starting with _no_ LOG_VERBOSITY environment variable
set. I don't understand that, but that must be a funky Helm-ism.

Signed-off-by: Dr. Jan-Philip Gehrcke <[email protected]>
Signed-off-by: Dr. Jan-Philip Gehrcke <[email protected]>
@jgehrcke jgehrcke force-pushed the jp/verbosity-vs-debuggability-improvements branch from a6ef300 to 4cf3d9b Compare October 11, 2025 14:41
@jgehrcke jgehrcke force-pushed the jp/verbosity-vs-debuggability-improvements branch from 20f9365 to 4ced422 Compare October 11, 2025 14:48
@jgehrcke
Copy link
Collaborator Author

Since last review:

  • Squashed commits (was: 27).
  • Updated tests (mainly: more robust waiting for controller flip).

Test suite passes: 18 tests, 0 failures in 201 seconds.

Copy link
Collaborator Author

@jgehrcke jgehrcke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Exhaustively discussed with Kevin and agreed that this is good to go. Landing this. Nice. (also need to move on from the topic, this was a lot :)).

Let's learn from these changes by observing how things behave in practice.

Let's anticipate more changes on this front in the near term.

@jgehrcke jgehrcke merged commit c614e61 into NVIDIA:main Oct 11, 2025
7 checks passed
@github-project-automation github-project-automation bot moved this from In Progress to Closed in Planning Board: k8s-dra-driver-gpu Oct 11, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

debuggability issue/pr related to the ability to debug the system usability issue/pr related to UX

Projects

Development

Successfully merging this pull request may close these issues.

Allow verbosity of kubelet plugins and controller to be set via helm

3 participants