-
Notifications
You must be signed in to change notification settings - Fork 94
Add logVerbosity Helm chart parameter, reduce default log verbosity
#633
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add logVerbosity Helm chart parameter, reduce default log verbosity
#633
Conversation
|
I will chunk this PR into individual patches. |
| # - Checkpoint file updates | ||
| # - Kubelet plugin GRPC request/response detail | ||
| # | ||
| logVerbosity: "1" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm hesitant to add this as a top-level helm variable. Can we start with just exposing an envvar on the relevant components (without any helm value at all), with the default set to (1) in the code if no envvar is passed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
start with just exposing an envvar on the relevant components
We can do that. But let's work towards a specific goal:
I'd love to have a simple way to set the log verbosity across all components (either upon Helm install, or right thereafter).
With that env var: are those ergonomics good enough for us, for actually using that?
How will we (you, I) use this e.g. in a test suite, and in debugging?
Where/how and when do we set that env var?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hesitant to add this as a top-level helm variable
and a (smiling!) side note is that we thought of this task as "allow verbosity of kubelet plugins and controller to be set via helm" (#609)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You still set the envvars via helm. There is just no explicit helm value added to the helm API.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we take a hybrid approach. Rename this to defaultLogVerbosity and then allow it to be overridden by an envvar on a component by component basis.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You still set the envvars via helm
Sorry for digging in -- but how exactly would we imagine a user to do that?
Are you thinking about e.g.
| env: [] |
and then doing something like --set controller.containers.computeDomain.env.LOG_VERBOSITY=5?
Rename this to defaultLogVerbosity
We can do that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
allow it to be overridden by an envvar on a component by component basis.
We can also do that! But we can maybe also figure that one out in another patch.
I looked into doing this (set verbosity via environment) when starting to work on this patch. What I found rather disappointing: for regular klog/v2, setting the -v CLI flag is the only way to set the verbosity, and there is no API for changing it during runtime.
Of course, there is a way to mutate the parsed CLI flags say (based on an env var) before having klog interpret them. But, that kind of approach may almost be appropriate for an obfuscation contest.
Edit: maybe something like this works (using an env var in the command specification)?
command:
- foo
- --v=$(LOG_VERBOSITY)
env:
- name: LOG_VERBOSITY
value: "4"Because I just found:
Kubernetes uses round parentheses for environment variable substitution in Pod command and args fields
Related topic: dynamically changing verbosity at runtime: I have seen a few pointers here and maybe nowadays there may be a better (even if not yet properly documented) way to dynamically set the verbosity (such as based on env var).
I want to explore other ways for changing the verbosity during runtime, and for that I think a starting point would be kubernetes/klog#368 and related resources -- klog is really limited in many ways compared to other logging libraries.
Other pointers (related, but not much of help IMO):
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rename this to defaultLogVerbosity
I have done this now.
But are we sure about that variable name? I think the "default" prefix doesn't really help.
In configuration management, often there is an order of precedence -- the same parameter can be specified on various layers. Those layers then have a well-known hierarchical order.
I want to say that it's not uncommon to have an outer "logVerbosity" setting that can be overridden via some more "inner" mechanism, using the same parameter name.
(I mean, of course you know all that -- I just want to spell it out explicitly here, that I think this is one such case)
Another point of view, terminology: when not specifying --set defaultLogVerbosity=X -- what takes effect then? The default defaultLogVerbosity 🤔.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
doing something like
--set controller.containers.computeDomain.env.LOG_VERBOSITY=5?
Update: well, that doesn't quite work.
I have now adjusted the patch that one can do a per-component override.
The clumsy way to do that (tested):
helm install ... \
--set controller.containers.computeDomain.env[0].name=LOG_VERBOSITY \
--set-string controller.containers.computeDomain.env[0].value=6 \
...Well, --set-string or
--set 'controller.containers.computeDomain.env[0].value="6"'
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note to self. Also viable:
kubectl set env pod/<pname> name=value
Triggers a pod restart.
| // history), checks the global rate against the token bucket, and picks the | ||
| // longest delay from either strategy, ensuring that both per-item and overall | ||
| // queue health are respected. | ||
| func DefaultPrepUnprepRateLimiter() workqueue.TypedRateLimiter[any] { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we call this DefaultKubeletPluginRateLimiter to be symmetrical with DefaultControllerRateLimiter.
Also, can you explain how this change is relevant to a PR focused on logging?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you explain how this change is relevant to a PR focused on logging?
As many times, I kindly refer to my PR description :-)
#633 (comment), point (3).
As indicated in that PR description and also in another comment: I am happy to distribute the commits across different PRs to keep things separate.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- took this part of the patch to Introduce DefaultPrepUnprepRateLimiter (less aggressive) #656
- took this part of the discussion to https://github.com/NVIDIA/k8s-dra-driver-gpu/pull/656/files#r2413832932
| // Runs after `Action` (regardless of success/error). In urfave cli | ||
| // v2, the final error reported will be from either Action, Before, | ||
| // or After (whichever is non-nil and last executed). | ||
| klog.Infof("shutdown") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| klog.Infof("shutdown") | |
| klog.Infof("Shutdown") |
cmd/gpu-kubelet-plugin/main.go
Outdated
| return flags.loggingConfig.Apply() | ||
| }, | ||
| Action: func(c *cli.Context) error { | ||
| klog.Infof("config: %v", flags) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| klog.Infof("config: %v", flags) | |
| klog.Infof("Config: %v", flags) |
| // Runs after `Action` (regardless of success/error). In urfave cli | ||
| // v2, the final error reported will be from either Action, Before, | ||
| // or After (whichever is non-nil and last executed). | ||
| klog.Infof("shutdown") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| klog.Infof("shutdown") | |
| klog.Infof("Shutdown") |
| }, | ||
| Action: func(c *cli.Context) error { | ||
| ctx := c.Context | ||
| klog.Infof("config: %v", render.Render(flags)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| klog.Infof("config: %v", render.Render(flags)) | |
| klog.Infof("Config: %v", render.Render(flags)) |
| // Check implements [grpc_health_v1.HealthServer]. | ||
| // Check implements [grpc_health_v1.HealthServer.Check]. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No -- the comment explains that this implements the grpc_health_v1.HealthServer interface
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But Check does not implement the full HealthServer interface, right? If something claims to implement an interface, I'd expect it not only implement a part of that interface. That seems to be a language / convention aspect.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
carried to #657
| } | ||
|
|
||
| func startHealthcheck(ctx context.Context, config *Config) (*healthcheck, error) { | ||
| func setupHealthcheckPrimitives(ctx context.Context, config *Config) (*healthcheck, error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why this change? Does it not start the healthcheck server?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Notably, this does not start a health check, as I found by following the code.
Which is why I do not like the name "startHealthcheck" (I was misled by that, I thought this starts a health check and I wondered why).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
carried to #657
| results[claim.UID] = err | ||
| wg.Done() | ||
| if err != nil { | ||
| klog.V(0).Infof("Permanent error unpreparing devices for claim %v: %v", claim.UID, err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can the V(0) be dropped?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, can the log be done inside nodeUnprepareResource so as not to clutter things here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, level 0 is implicit when doing klog.Infof().
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can the V(0) be dropped?
I remember: I did this to make this explicit -- to set the precedent for always using an explicit level, to enhance code readability.
Because this is a question one naturally has when reading code: what level does Info on log by default? One needs to have that additional knowledge.
But I will remove this now again to eradicate a potential point of friction.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can the log be done inside nodeUnprepareResource
In d.nodeUnprepareResource(ctx, claim) we return isPermanentError(err) directly w/o inspecting its return value. We only look at done here, at the call site.
We can change this if course, but let's not do this here.
| // Prepare+Checkpoint are done transactionally). Note that | ||
| // claimRef.String() contains namespace, name, UID. | ||
| klog.Infof("unprepare noop: claim not found in checkpoint data: %v", claimRef.String()) | ||
| klog.V(2).Infof("unprepare noop: claim not found in checkpoint data: %v", claimRef.String()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| klog.V(2).Infof("unprepare noop: claim not found in checkpoint data: %v", claimRef.String()) | |
| klog.V(2).Infof("Unprepare noop: claim not found in checkpoint data: %v", claimRef.String()) |
| // Runs after `Action` (regardless of success/error). In urfave cli | ||
| // v2, the final error reported will be from either Action, Before, | ||
| // or After (whichever is non-nil and last executed). | ||
| klog.Infof("shutdown") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| klog.Infof("shutdown") | |
| klog.Infof("Shutdown") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like consistency.
We have currently many log messages starting with a lower-case character:
$ grep -nR 'klog' | grep -E '\(\"[[:lower:]]' | wc -l
62
That's spread across levels:
$ grep -nR 'klog' | grep -E '\(\"[[:lower:]]' | grep Info | wc -l
35
We should do this in a follow-up; and maybe add a lint / check in CI.
| return flags.loggingConfig.Apply() | ||
| }, | ||
| Action: func(c *cli.Context) error { | ||
| klog.Infof("config: %v", render.Render(flags)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| klog.Infof("config: %v", render.Render(flags)) | |
| klog.Infof("Config: %v", render.Render(flags)) |
Signed-off-by: Dr. Jan-Philip Gehrcke <[email protected]>
Signed-off-by: Dr. Jan-Philip Gehrcke <[email protected]>
916b42d to
cac5cd9
Compare
| kubectl get pod \ | ||
| -l nvidia-dra-driver-gpu-component=controller \ | ||
| -n nvidia-dra-driver-gpu \ | ||
| | grep -iv "NAME" \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
note: ignore column headers
| assert_output --partial "channel2047" | ||
| assert_output --partial "channel222" | ||
| kubectl delete -f demo/specs/imex/channel-injection-all.yaml | ||
| kubectl wait --for=delete pods imex-channel-injection-all --timeout=10s |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
flyby: wait for cleanup after test.
| # Clean up. | ||
| kubectl delete "${POD}" | ||
| kubectl delete resourceclaim batssuite-rc-bad-opaque-config | ||
| kubectl wait --for=delete "${POD}" --timeout=10s |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
flyby: wait for cleanup after test.
| # new LOG_VERBOSITY_CD_DAEMON setting applies), and make sure controller | ||
| # deployment is still READY before moving on (make sure 1/1 READY). | ||
| CPOD_OLD="$(get_current_controller_pod_name)" | ||
| kubectl set env deployment nvidia-dra-driver-gpu-controller -n nvidia-dra-driver-gpu LOG_VERBOSITY_CD_DAEMON=0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Until we have a flip-verbosity-at-runtime mechanism, this is probably the method we will recommend.
I started to document it here: https://github.com/NVIDIA/k8s-dra-driver-gpu/wiki/Troubleshooting#controlling-log-verbosity
Signed-off-by: Dr. Jan-Philip Gehrcke <[email protected]>
Signed-off-by: Dr. Jan-Philip Gehrcke <[email protected]>
Signed-off-by: Dr. Jan-Philip Gehrcke <[email protected]>
Signed-off-by: Dr. Jan-Philip Gehrcke <[email protected]>
Signed-off-by: Dr. Jan-Philip Gehrcke <[email protected]>
Signed-off-by: Dr. Jan-Philip Gehrcke <[email protected]>
Signed-off-by: Dr. Jan-Philip Gehrcke <[email protected]>
Signed-off-by: Dr. Jan-Philip Gehrcke <[email protected]>
Helm values.yaml: defaultLogVerbosity incl. docs Signed-off-by: Dr. Jan-Philip Gehrcke <[email protected]> values.yaml: tweak, based on in log level insights Signed-off-by: Dr. Jan-Philip Gehrcke <[email protected]> improve helm chart artifact commentary Signed-off-by: Dr. Jan-Philip Gehrcke <[email protected]> squash: tweak docs Signed-off-by: Dr. Jan-Philip Gehrcke <[email protected]> Rename chart var, start building tests Signed-off-by: Dr. Jan-Philip Gehrcke <[email protected]> tests: cover log verbosity set per-component via env Signed-off-by: Dr. Jan-Philip Gehrcke <[email protected]> helm: rename defaultLogVerbosity to logVerbosity Signed-off-by: Dr. Jan-Philip Gehrcke <[email protected]>
Signed-off-by: Dr. Jan-Philip Gehrcke <[email protected]>
Signed-off-by: Dr. Jan-Philip Gehrcke <[email protected]>
Signed-off-by: Dr. Jan-Philip Gehrcke <[email protected]>
This also had a side effect on subsequent tests, with the controller starting with _no_ LOG_VERBOSITY environment variable set. I don't understand that, but that must be a funky Helm-ism. Signed-off-by: Dr. Jan-Philip Gehrcke <[email protected]>
Signed-off-by: Dr. Jan-Philip Gehrcke <[email protected]>
a6ef300 to
4cf3d9b
Compare
Signed-off-by: Dr. Jan-Philip Gehrcke <[email protected]>
20f9365 to
4ced422
Compare
|
Since last review:
Test suite passes: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Exhaustively discussed with Kevin and agreed that this is good to go. Landing this. Nice. (also need to move on from the topic, this was a lot :)).
Let's learn from these changes by observing how things behave in practice.
Let's anticipate more changes on this front in the near term.
Resolves #609.
This PR has a set of related changes that we can also discuss in separate PRs if preferred:
For better debuggability: log full component config (
flagsobject) upon component startup (withgo-renderto quickly get to a stringified version of a more or less complex, nested struct -- open do using a different strategy).Edit: now here: Log full startup config in all CLIs in
Beforehook #658For control: introduction of a component-global
logVerbosityHelm chart parameter, including documentation laying out the starting point for a verbosity system (comments very welcome)For less noise in default config:
logVerbositylevel documentation).For robustness, explicit log flushing as part of component shutdown (I think we missed this so far).
Edit: now here: Flush logs in CLI app
Afterhook #661One interesting change that I propose here is to not have those "updated/added object callback" confirmations logged on the default log level in the CD controller -- I think we should try to not scale log volume with number of objects created (at least in default config).