fix: allow multiple CIFS krb5 mounts with the same CRUID#1052
fix: allow multiple CIFS krb5 mounts with the same CRUID#1052kaidohTips wants to merge 14 commits intokubernetes-csi:masterfrom
Conversation
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: kaidohTips The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
|
Welcome @kaidohTips! |
|
Hi @kaidohTips. Thanks for your PR. I'm waiting for a kubernetes-csi member to verify that this patch is reasonable to test. If it is, they should reply with Regular contributors should join the org to skip this step. Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
There was a problem hiding this comment.
Pull request overview
This PR updates the SMB CSI node server’s Kerberos cache/symlink setup to avoid race conditions when kubelet mounts multiple sec=krb5 volumes in parallel using the same cruid.
Changes:
- Avoids deleting an existing
krb5cc_<cruid>path when it already exists. - Treats
os.Symlink()“already exists” errors as non-fatal to tolerate concurrent creators.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
@kaidohTips can you sign the easycla first? (select individual contributor) |
Hello @andyzhangx, Done. |
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 1 out of 1 changed files in this pull request and generated 1 comment.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 1 out of 1 changed files in this pull request and generated 4 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
os.Stat follows the symlink and reports ENOENT when the target is gone, causing ensureKerberosCache to skip the Remove path and then fail with "file exists" in os.Symlink. Switch to os.Lstat so the symlink itself is inspected regardless of target validity. Add TestEnsureKerberosCacheConcurrent: runs parallel goroutines against ensureKerberosCache with a shared CRUID and asserts the final symlink points at a valid volume-specific cache file, locking in the per-CRUID serialization contract.
ensureKerberosCache relies on Linux-specific CIFS/Kerberos semantics (os.Chown with cruid as gid, symlink handling). Restrict the test to Linux only to avoid failures on macOS and Windows CI runners.
- Use os.Lstat + os.Readlink to properly validate symlinks instead of os.Stat - Accept any existing symlink pointing to a valid cache file (concurrent mount safe) - Use atomic temp-symlink + os.Rename instead of remove-then-create (no gap) - Remove TryAcquire lock that caused concurrent mounts to fail with Aborted - Include rich context (link name, target, credUID) in all error messages - Ensure consistent tab indentation (gofmt) - Update tests: valid symlink from previous volume is now kept, concurrent test no longer expects codes.Aborted
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
- Update function comment to accurately describe symlink reuse behavior - Only retry rename on os.IsExist; fail fast on permission/IO errors - Update test comment to reflect filesystem-based concurrency (no CRUID lock)
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
- Remove pre-removal of krb5CacheFileName; rely on os.Rename atomic replace - Skip ensureKerberosCache tests when non-root and uid != gid (Chown EPERM)
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
If krb5CacheFileName is unexpectedly a directory, os.Rename cannot atomically replace it with a symlink. Explicitly remove directories before attempting the temp-symlink + rename pattern.
|
@kaidohTips: The following tests failed, say
Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
What type of PR is this?
/kind bug
What this PR does / why we need it:
This PR fixes a critical bug in the
ensureKerberosCachefunction that occurs when a single pod attempt to mount multiple -kerberos volumes simultaneously using the samecruidProblem:
When kubelet triggers parallel mounts for multiple volumes sharing the same cruid, multiple threads enter
ensureKerberosCachesimultaneously:os.Remove()thekrb5cc_<cruid>symlink if it exists.This means one thread can destroy a valid symlink just before another thread executes the Linux
mountcommand, resulting inmount error(126): Required key not available.os.Symlink()at the exact same millisecond, causing subsequent threads to crash with afile existserror, which aborts the mount process.Solution implemented:
os.Statcheck to verify if the symlink already exists and points to a valid file. If it does, the thread leaves it intact instead of removing it, preventing threads from sabotaging each other.os.IsExist(err)error duringos.Symlink(). If the error is raised, it simply means a concurrent thread just created the symlink a microsecond earlier, allowing the mount process to continue successfully.Which issue(s) this PR fixes:
Fixes # None
Requirements:
Special notes for your reviewer:
This fix was extensively tested on a production cluster
I scheduled a pod in a(JupyterHub environment) configured with 4 simultaneous volumes
PersistentVolumesusingsec=krb5and the samecruid.file existsorRequired key not availableerrorsRelease note: