remove shell dependency from the driver image#972
remove shell dependency from the driver image#972leiyiz wants to merge 2 commits intokubernetes-sigs:mainfrom
Conversation
cmd/gpu-kubelet-plugin/prestart.go
Outdated
| // once the driver becomes mounted (e.g., once GPU operator provides the driver | ||
| // on the host at /run/nvidia/driver). | ||
| fmt.Printf("create symlink: /driver-root -> %s\n", driverRootPath) | ||
| _ = os.Remove("/driver-root") |
There was a problem hiding this comment.
If a syscall fails we should at least log the failure instead of silently swallowing it.
There was a problem hiding this comment.
yeah, actually thinking about this a bit more, removing this line would be better match to the ln -s behavior in shell script. wdyt
| // Main intent: help users to self-troubleshoot when the GPU driver is not set up | ||
| // properly before installing this DRA driver. In that case, the log of the init | ||
| // container running this script is meant to yield an actionable error message. | ||
| // For now, rely on k8s to implement a high-level retry with back-off. |
There was a problem hiding this comment.
the comment might be out of date
There was a problem hiding this comment.
rely on k8s to implement a high-level retry with back-off
this bit does seem to be out of date with the forever loop. That's true for the shell script too, i can modify both to reflect that.
do you have other modifications in mind? other bits seems fine to me
cmd/gpu-kubelet-plugin/prestart.go
Outdated
| } | ||
|
|
||
| driverRootParent := "/driver-root-parent" | ||
| // Remove trailing slash (if existing) and get last path element. |
There was a problem hiding this comment.
Is
$(basename "${NVIDIA_DRIVER_ROOT%/}")
really equivalent to
filepath.Base(nvidiaDriverRoot)
?
Can you update the code comment to make it more helpful in this current go implementation? I am not sure if discussing trailing slashes is even still relevant here.
There was a problem hiding this comment.
Yeah I did some simple shell testing before and they seem to be identical. filepath.Base removes trailing slashes by default: https://pkg.go.dev/path/filepath#Base.
discussing trailing slashes is even still relevant here
I don't think they are, but this comment is copied from the original shell script and I wanted to keep as much of things the same as possible. I do agree it's not as helpful here.. perhaps "filepath.Base removes trailing slashes" would be better.
There was a problem hiding this comment.
in a grant scheme of things I'll do a pass over comments and make sure they are updated to the golang impl
| // MaskNvidiaDriverParams conditionally masks the params file to prevent this container from | ||
| // recreating any missing GPU device nodes. This is necessary, for example, when running | ||
| // under nvkind to limit the set GPUs governed by the plugin even though it has cgroup | ||
| // access to all of them. |
There was a problem hiding this comment.
Is this still relevant?
If this is relevant, I think this should be tested at least manually before we can confidently land the patch.
There was a problem hiding this comment.
It wouldn't be outlandish to have a unit test around this with mock source/dst and real(istic) contents in it.
|
@leiyiz thanks for proposing something! I hope you don't take offense when I ask how much you've used AI for doing the translation, and if you have used AI I wonder how much you've double-checked every part of this diff yourself. Knowing exactly how this patch came to be and how much testing work and self-review you've applied is important for me/us to understand how much attention we have to pay to details. I don't want to be a PITA here and I trust that you're doing the right thing. However, I think it's particularly important that we are extra careful here in terms of review. We absolutely don't want to break an environment that currently works. None of the logic is covered by automated tests, and manual tests are a lot of work. That is, if we break something here, it's very likely that we won't catch that during QA. As part of a careful re-write, I think it's important to also amend code comments where required! Thank you. |
|
Of course! I've used AI to generate this PR and read each line of code it generated side by side with the original shell script myself and rewrote parts where mistake were found. But I'll have to admit I'm not very familiar with why/how/what the scripts were doing and the details like expected filesystem folder structure etc. So i've mostly tried to stick closely to how the original shell is written today.
Apologies! I kinda assumed that there'll be e2e testing on real k8s clusters with GPUs for different code paths so I didn't think deeply about automated tests around this. With golang version of the code I think unit testing would be a good add that I can work on, but in the mean time, what kinds of tests were covering the shell scripts? I'd love some pointers so I can understand what coverage are lost. |
|
I also realized that I had some helm testing image tags sneak through.. will address that |
rewrite bash scripts with golang Signed-off-by: Léiyì Zhang <leiyiz@google.com>
Signed-off-by: Léiyì Zhang <leiyiz@google.com>
Manual tests (I can only speak for the init container script though). Because in absence of automated tests, we still need to verify that things are working, right? :-) You can have a look at #389 to understand more about context/history. |
No worries! The absence of exhaustive test coverage of course makes this task special. I have been wondering about this change in terms of impact per effort especially because there are no automated tests. We try to gain something here that I think is hard to quantify; and there is a lot we can lose (we may break users, we may lose debuggability). In my review, I want to help keeping the "what we can lose" dimension under control. By putting in some additional testing and reasoning work. After we've reasoned through the potential of breaking something, we still need to reason through debuggability. We had a brief discussion about debuggability around here. My point is that someone really should try/do a bit of debugging with the new model. So that the first time doing it will not be in prod. |
|
Unknown CLA label state. Rechecking for CLA labels. Send feedback to sig-contributor-experience at kubernetes/community. /check-cla |
|
PR needs rebase. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
needs a rebase please! |
Feature proposal: #912
manually tested:
ResourceSlicenvidia-smiinside its container and exited.This commit aims to faithfully rewrite the shell scripts and bash in-line pre-start scripts in golang while keeping the error handling and debuggability the same.