Skip to content

remove shell dependency from the driver image#972

Open
leiyiz wants to merge 2 commits intokubernetes-sigs:mainfrom
leiyiz:sh-to-go
Open

remove shell dependency from the driver image#972
leiyiz wants to merge 2 commits intokubernetes-sigs:mainfrom
leiyiz:sh-to-go

Conversation

@leiyiz
Copy link
Copy Markdown
Contributor

@leiyiz leiyiz commented Mar 30, 2026

Feature proposal: #912

manually tested:

  1. built and deployed to a k8s cluster, devices show up ok in ResourceSlice
  2. created a pod using such gpu resource and it is successfully scheduled , ran nvidia-smi inside its container and exited.

This commit aims to faithfully rewrite the shell scripts and bash in-line pre-start scripts in golang while keeping the error handling and debuggability the same.

@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot bot commented Mar 30, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

// once the driver becomes mounted (e.g., once GPU operator provides the driver
// on the host at /run/nvidia/driver).
fmt.Printf("create symlink: /driver-root -> %s\n", driverRootPath)
_ = os.Remove("/driver-root")
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If a syscall fails we should at least log the failure instead of silently swallowing it.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, actually thinking about this a bit more, removing this line would be better match to the ln -s behavior in shell script. wdyt

// Main intent: help users to self-troubleshoot when the GPU driver is not set up
// properly before installing this DRA driver. In that case, the log of the init
// container running this script is meant to yield an actionable error message.
// For now, rely on k8s to implement a high-level retry with back-off.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the comment might be out of date

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rely on k8s to implement a high-level retry with back-off

this bit does seem to be out of date with the forever loop. That's true for the shell script too, i can modify both to reflect that.

do you have other modifications in mind? other bits seems fine to me

}

driverRootParent := "/driver-root-parent"
// Remove trailing slash (if existing) and get last path element.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is

$(basename "${NVIDIA_DRIVER_ROOT%/}")

really equivalent to

filepath.Base(nvidiaDriverRoot)

?

Can you update the code comment to make it more helpful in this current go implementation? I am not sure if discussing trailing slashes is even still relevant here.

Copy link
Copy Markdown
Contributor Author

@leiyiz leiyiz Mar 31, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I did some simple shell testing before and they seem to be identical. filepath.Base removes trailing slashes by default: https://pkg.go.dev/path/filepath#Base.

discussing trailing slashes is even still relevant here

I don't think they are, but this comment is copied from the original shell script and I wanted to keep as much of things the same as possible. I do agree it's not as helpful here.. perhaps "filepath.Base removes trailing slashes" would be better.

Copy link
Copy Markdown
Contributor Author

@leiyiz leiyiz Mar 31, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in a grant scheme of things I'll do a pass over comments and make sure they are updated to the golang impl

// MaskNvidiaDriverParams conditionally masks the params file to prevent this container from
// recreating any missing GPU device nodes. This is necessary, for example, when running
// under nvkind to limit the set GPUs governed by the plugin even though it has cgroup
// access to all of them.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this still relevant?

If this is relevant, I think this should be tested at least manually before we can confidently land the patch.

Copy link
Copy Markdown
Contributor Author

@leiyiz leiyiz Mar 31, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't checked if this is relevant and just trusted the bash would do the right thing and matched the golang to it.

@klueska I see you wrote this part in #259, can you give me some contexts or pointers to why is this needed / what did you do to test it

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Connecting some dots: #366

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It wouldn't be outlandish to have a unit test around this with mock source/dst and real(istic) contents in it.

@jgehrcke
Copy link
Copy Markdown
Contributor

@leiyiz thanks for proposing something! I hope you don't take offense when I ask how much you've used AI for doing the translation, and if you have used AI I wonder how much you've double-checked every part of this diff yourself.

Knowing exactly how this patch came to be and how much testing work and self-review you've applied is important for me/us to understand how much attention we have to pay to details.

I don't want to be a PITA here and I trust that you're doing the right thing. However, I think it's particularly important that we are extra careful here in terms of review. We absolutely don't want to break an environment that currently works. None of the logic is covered by automated tests, and manual tests are a lot of work. That is, if we break something here, it's very likely that we won't catch that during QA.

As part of a careful re-write, I think it's important to also amend code comments where required!

Thank you.

@leiyiz
Copy link
Copy Markdown
Contributor Author

leiyiz commented Mar 31, 2026

Of course! I've used AI to generate this PR and read each line of code it generated side by side with the original shell script myself and rewrote parts where mistake were found. But I'll have to admit I'm not very familiar with why/how/what the scripts were doing and the details like expected filesystem folder structure etc. So i've mostly tried to stick closely to how the original shell is written today.

None of the logic is covered by automated tests, and manual tests are a lot of work.

Apologies! I kinda assumed that there'll be e2e testing on real k8s clusters with GPUs for different code paths so I didn't think deeply about automated tests around this.

With golang version of the code I think unit testing would be a good add that I can work on, but in the mean time, what kinds of tests were covering the shell scripts? I'd love some pointers so I can understand what coverage are lost.

@leiyiz
Copy link
Copy Markdown
Contributor Author

leiyiz commented Mar 31, 2026

I also realized that I had some helm testing image tags sneak through.. will address that

leiyiz added 2 commits March 31, 2026 16:49
rewrite bash scripts with golang

Signed-off-by: Léiyì Zhang <leiyiz@google.com>
Signed-off-by: Léiyì Zhang <leiyiz@google.com>
@jgehrcke
Copy link
Copy Markdown
Contributor

jgehrcke commented Apr 1, 2026

what kinds of tests were covering the shell scripts?

Manual tests (I can only speak for the init container script though). Because in absence of automated tests, we still need to verify that things are working, right? :-)

You can have a look at #389 to understand more about context/history.

@jgehrcke
Copy link
Copy Markdown
Contributor

jgehrcke commented Apr 1, 2026

Apologies! I kinda assumed that there'll be e2e testing

No worries! The absence of exhaustive test coverage of course makes this task special.

I have been wondering about this change in terms of impact per effort especially because there are no automated tests.

We try to gain something here that I think is hard to quantify; and there is a lot we can lose (we may break users, we may lose debuggability). In my review, I want to help keeping the "what we can lose" dimension under control. By putting in some additional testing and reasoning work. After we've reasoned through the potential of breaking something, we still need to reason through debuggability.

We had a brief discussion about debuggability around here. My point is that someone really should try/do a bit of debugging with the new model. So that the first time doing it will not be in prod.

@k8s-triage-robot
Copy link
Copy Markdown

Unknown CLA label state. Rechecking for CLA labels.

Send feedback to sig-contributor-experience at kubernetes/community.

/check-cla
/easycla

@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Apr 2, 2026
@k8s-ci-robot
Copy link
Copy Markdown
Contributor

PR needs rebase.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Apr 4, 2026
@dims
Copy link
Copy Markdown
Member

dims commented Apr 5, 2026

needs a rebase please!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants