Skip to content

fix: allow multiple CIFS krb5 mounts with the same CRUID#1052

Open
kaidohTips wants to merge 14 commits intokubernetes-csi:masterfrom
kaidohTips:master
Open

fix: allow multiple CIFS krb5 mounts with the same CRUID#1052
kaidohTips wants to merge 14 commits intokubernetes-csi:masterfrom
kaidohTips:master

Conversation

@kaidohTips
Copy link
Copy Markdown

@kaidohTips kaidohTips commented Apr 2, 2026

What type of PR is this?

/kind bug

What this PR does / why we need it:

This PR fixes a critical bug in the ensureKerberosCache function that occurs when a single pod attempt to mount multiple -kerberos volumes simultaneously using the same cruid

Problem:

When kubelet triggers parallel mounts for multiple volumes sharing the same cruid, multiple threads enter ensureKerberosCache simultaneously:

  1. The original code attempts to os.Remove() the krb5cc_<cruid> symlink if it exists.
    This means one thread can destroy a valid symlink just before another thread executes the Linux mount command, resulting in mount error(126): Required key not available.
  2. Multiple threads attempt to call os.Symlink() at the exact same millisecond, causing subsequent threads to crash with a file exists error, which aborts the mount process.

Solution implemented:

  • Added an os.Stat check to verify if the symlink already exists and points to a valid file. If it does, the thread leaves it intact instead of removing it, preventing threads from sabotaging each other.
  • Added handling of the os.IsExist(err) error during os.Symlink(). If the error is raised, it simply means a concurrent thread just created the symlink a microsecond earlier, allowing the mount process to continue successfully.

Which issue(s) this PR fixes:
Fixes # None

Requirements:

Special notes for your reviewer:
This fix was extensively tested on a production cluster
I scheduled a pod in a(JupyterHub environment) configured with 4 simultaneous volumes PersistentVolumes using sec=krb5 and the same cruid.

  • Before this patch: The parallel mounts consistently failed with file exists or Required key not available errors
  • After this patch: All 4 volumes mount instantly and reliably without any collision

Release note:

Fix kerberos cache symlink creation causing multiples SMB mounts to fail

@k8s-ci-robot k8s-ci-robot added kind/bug Categorizes issue or PR as related to a bug. do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. labels Apr 2, 2026
@k8s-ci-robot
Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: kaidohTips
Once this PR has been reviewed and has the lgtm label, please assign msau42 for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@linux-foundation-easycla
Copy link
Copy Markdown

linux-foundation-easycla Bot commented Apr 2, 2026

CLA Signed

The committers listed above are authorized under a signed CLA.

@k8s-ci-robot
Copy link
Copy Markdown
Contributor

Welcome @kaidohTips!

It looks like this is your first PR to kubernetes-csi/csi-driver-smb 🎉. Please refer to our pull request process documentation to help your PR have a smooth ride to approval.

You will be prompted by a bot to use commands during the review process. Do not be afraid to follow the prompts! It is okay to experiment. Here is the bot commands documentation.

You can also check if kubernetes-csi/csi-driver-smb has its own contribution guidelines.

You may want to refer to our testing guide if you run into trouble with your tests not passing.

If you are having difficulty getting your pull request seen, please follow the recommended escalation practices. Also, for tips and tricks in the contribution process you may want to read the Kubernetes contributor cheat sheet. We want to make sure your contribution gets all the attention it needs!

Thank you, and welcome to Kubernetes. 😃

@k8s-ci-robot k8s-ci-robot added cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Apr 2, 2026
@k8s-ci-robot
Copy link
Copy Markdown
Contributor

Hi @kaidohTips. Thanks for your PR.

I'm waiting for a kubernetes-csi member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work.

Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label Apr 2, 2026
@andyzhangx andyzhangx requested a review from Copilot April 3, 2026 03:29
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR updates the SMB CSI node server’s Kerberos cache/symlink setup to avoid race conditions when kubelet mounts multiple sec=krb5 volumes in parallel using the same cruid.

Changes:

  • Avoids deleting an existing krb5cc_<cruid> path when it already exists.
  • Treats os.Symlink() “already exists” errors as non-fatal to tolerate concurrent creators.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread pkg/smb/nodeserver.go Outdated
Comment thread pkg/smb/nodeserver.go Outdated
Comment thread pkg/smb/nodeserver.go Outdated
Comment thread pkg/smb/nodeserver.go Outdated
Comment thread pkg/smb/nodeserver.go Outdated
Comment thread pkg/smb/nodeserver.go Outdated
@andyzhangx
Copy link
Copy Markdown
Member

@kaidohTips can you sign the easycla first? (select individual contributor)

@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. and removed cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. labels Apr 3, 2026
@kaidohTips
Copy link
Copy Markdown
Author

@kaidohTips can you sign the easycla first? (select individual contributor)

Hello @andyzhangx,

Done.

@k8s-ci-robot k8s-ci-robot added cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. size/S Denotes a PR that changes 10-29 lines, ignoring generated files. and removed cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Apr 3, 2026
@andyzhangx andyzhangx requested a review from Copilot April 3, 2026 09:28
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 1 out of 1 changed files in this pull request and generated 1 comment.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread pkg/smb/nodeserver.go Outdated
@k8s-ci-robot k8s-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Apr 3, 2026
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 1 out of 1 changed files in this pull request and generated 4 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread pkg/smb/nodeserver.go Outdated
Comment thread pkg/smb/nodeserver.go Outdated
Comment thread pkg/smb/nodeserver.go Outdated
Comment thread pkg/smb/nodeserver.go
@andyzhangx andyzhangx changed the title Allow multiple CIFS krb5 mounts with the same CRUID fix: allow multiple CIFS krb5 mounts with the same CRUID Apr 11, 2026
os.Stat follows the symlink and reports ENOENT when the target is gone,
causing ensureKerberosCache to skip the Remove path and then fail with
"file exists" in os.Symlink. Switch to os.Lstat so the symlink itself is
inspected regardless of target validity.

Add TestEnsureKerberosCacheConcurrent: runs parallel goroutines against
ensureKerberosCache with a shared CRUID and asserts the final symlink
points at a valid volume-specific cache file, locking in the per-CRUID
serialization contract.
@k8s-ci-robot k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Apr 15, 2026
ensureKerberosCache relies on Linux-specific CIFS/Kerberos semantics
(os.Chown with cruid as gid, symlink handling). Restrict the test to
Linux only to avoid failures on macOS and Windows CI runners.
Copy link
Copy Markdown
Member

@andyzhangx andyzhangx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/ok-to-test

@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Apr 18, 2026
@andyzhangx andyzhangx removed the do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. label Apr 20, 2026
- Use os.Lstat + os.Readlink to properly validate symlinks instead of os.Stat
- Accept any existing symlink pointing to a valid cache file (concurrent mount safe)
- Use atomic temp-symlink + os.Rename instead of remove-then-create (no gap)
- Remove TryAcquire lock that caused concurrent mounts to fail with Aborted
- Include rich context (link name, target, credUID) in all error messages
- Ensure consistent tab indentation (gofmt)
- Update tests: valid symlink from previous volume is now kept, concurrent test
  no longer expects codes.Aborted
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread pkg/smb/nodeserver.go Outdated
Comment thread pkg/smb/nodeserver.go
Comment thread pkg/smb/nodeserver_test.go Outdated
- Update function comment to accurately describe symlink reuse behavior
- Only retry rename on os.IsExist; fail fast on permission/IO errors
- Update test comment to reflect filesystem-based concurrency (no CRUID lock)
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread pkg/smb/nodeserver.go Outdated
Comment thread pkg/smb/nodeserver_test.go
- Remove pre-removal of krb5CacheFileName; rely on os.Rename atomic replace
- Skip ensureKerberosCache tests when non-root and uid != gid (Chown EPERM)
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread pkg/smb/nodeserver_test.go
@andyzhangx andyzhangx requested a review from Copilot April 20, 2026 10:45
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread pkg/smb/nodeserver.go Outdated
If krb5CacheFileName is unexpectedly a directory, os.Rename cannot
atomically replace it with a symlink. Explicitly remove directories
before attempting the temp-symlink + rename pattern.
@k8s-ci-robot k8s-ci-robot added the release-note Denotes a PR that will be considered when it comes time to generate release notes. label Apr 20, 2026
@k8s-ci-robot
Copy link
Copy Markdown
Contributor

@kaidohTips: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
pull-csi-driver-smb-e2e-windows-2022-hostprocess 4142d2d link true /test pull-csi-driver-smb-e2e-windows-2022-hostprocess
pull-csi-driver-smb-e2e-windows-2022 4142d2d link false /test pull-csi-driver-smb-e2e-windows-2022

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/bug Categorizes issue or PR as related to a bug. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. release-note Denotes a PR that will be considered when it comes time to generate release notes. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants