nvmeof: Add GroupLock to coordinate stage and unstage operations and e2e tests#6210
nvmeof: Add GroupLock to coordinate stage and unstage operations and e2e tests#6210gadididi wants to merge 3 commits intoceph:develfrom
Conversation
There was a problem hiding this comment.
Pull request overview
Adds a cross-volume “group mutual exclusion” lock to the NVMe-oF node server to prevent NodeStageVolume and NodeUnstageVolume from running concurrently, avoiding races that can lead to premature NVMe controller disconnects.
Changes:
- Introduces a
stageUnstageLock(lock.GroupLock) field on the NVMe-oFNodeServer. - Wraps
NodeStageVolume()with Group A acquire/release andNodeUnstageVolume()with Group B acquire/release. - Adds the
internal/util/lockimport to support the new locking behavior.
3b71e31 to
fe2a852
Compare
added group mutual lock into `NodeStageVolume()` and `NodeUnStageVolume()` , because these 2 operations cannot run together. but few calls of same type can run together. Signed-off-by: gadi-didi <gadi.didi@ibm.com>
fe2a852 to
96907dc
Compare
|
/test ci/centos/mini-e2e/k8s-1.35/nvmeof |
| // Acquire GroupLock - wrap the mounting + connection logic in a GroupLock | ||
| // to prevent staging and unstaging from happening at the same time, | ||
| // as they can interfere with each other. | ||
| // This allows multiple staging operations to run concurrently, | ||
| // and multiple unstaging operations to run concurrently, | ||
| // but prevents staging and unstaging from running at the same time. | ||
| ns.stageUnstageLock.AcquireGroupA() | ||
| defer ns.stageUnstageLock.ReleaseGroupA() |
There was a problem hiding this comment.
lock.GroupLock is explicitly documented as not guaranteeing fairness (potential starvation). Using it to guard long-running CSI RPC handlers means a steady stream of NodeStage calls could indefinitely delay NodeUnstage (or vice-versa), potentially stalling pod teardown. Consider switching to a fair group mutual-exclusion implementation (e.g., track waiting counts and alternate preference when the active group drains) and/or making acquisition context-aware so cancellations/timeouts don’t leave requests stuck waiting forever.
There was a problem hiding this comment.
that's why I am testing parallel Go routines with delete\create , to verify there is starvation .
There was a problem hiding this comment.
it is not expected that this results in a problem. starvation will only happen when there are a huge number of volumes staged at the same time, while other volumes are unstaged. The staging that cause unstaging to be blocked (or the other way around), is extremely unlikely to cause problematic delays.
There was a problem hiding this comment.
I run the "mixed test" with batches of 5 pods (deletion\creation) .
I could make it with huge number, but of course it will increase the time.. do you want me to do that, or leave it as is?
2928da4 to
c2c15b1
Compare
|
/test ci/centos/mini-e2e/k8s-1.35/nvmeof |
| // Acquire GroupLock - wrap the mounting + connection logic in a GroupLock | ||
| // to prevent staging and unstaging from happening at the same time, | ||
| // as they can interfere with each other. | ||
| // This allows multiple staging operations to run concurrently, | ||
| // and multiple unstaging operations to run concurrently, | ||
| // but prevents staging and unstaging from running at the same time. | ||
| ns.stageUnstageLock.AcquireGroupA() | ||
| defer ns.stageUnstageLock.ReleaseGroupA() |
There was a problem hiding this comment.
it is not expected that this results in a problem. starvation will only happen when there are a huge number of volumes staged at the same time, while other volumes are unstaged. The staging that cause unstaging to be blocked (or the other way around), is extremely unlikely to cause problematic delays.
e2e/nvmeof_helper.go
Outdated
| @@ -0,0 +1,328 @@ | |||
| /* | |||
| Copyright 2025 The Ceph-CSI Authors. | |||
test on tentacle ceph v20 Signed-off-by: gadi-didi <gadi.didi@ibm.com>
a214c1b to
6519c04
Compare
|
/test ci/centos/mini-e2e/k8s-1.35/nvmeof |
6519c04 to
9094db7
Compare
|
/test ci/centos/mini-e2e/k8s-1.35/nvmeof |
9094db7 to
f2300bb
Compare
|
/test ci/centos/mini-e2e/k8s-1.35/nvmeof |
Add e2e tests to validate nvmeof NodeServer GroupLock implementation under concurrent NodeStage (Group A) and NodeUnstage (Group B) operations. The tests ensure no deadlock occurs when multiple PVCs and Pods are created and deleted simultaneously. New helper file (nvmeof_helper.go) provides reusable functions for concurrent PVC/Pod operations with proper error tracking. Two test cases cover: 1) sequential concurrent batches (create all, then delete all) 2) mixed operations with pre-created batch to guarantee continuous Group A/B switching.. Signed-off-by: gadi-didi <gadi.didi@ibm.com>
f2300bb to
df0f0ec
Compare
|
/test ci/centos/mini-e2e/k8s-1.35/nvmeof |
| ROOK_VERSION=v1.18.4 | ||
| # Provide ceph image path | ||
| ROOK_CEPH_CLUSTER_IMAGE=quay.io/ceph/ceph:v19.2.2 | ||
| ROOK_CEPH_CLUSTER_IMAGE=quay.io/ceph/ceph:v20 |
There was a problem hiding this comment.
ROOK_CEPH_CLUSTER_IMAGE is now set to quay.io/ceph/ceph:v20 (major-only tag). This makes CI/e2e less reproducible because the image contents can change over time as new v20.x releases are pushed. Consider pinning to a specific v20.x.y tag (or documenting why floating within the major is required).
| ROOK_CEPH_CLUSTER_IMAGE=quay.io/ceph/ceph:v20 | |
| ROOK_CEPH_CLUSTER_IMAGE=quay.io/ceph/ceph:v20.2.0 |
There was a problem hiding this comment.
it is just for running the Jenkins e2e test with nvmeof .. after reviewing , this change will be reverted
Describe what this PR does
This PR adds a GroupLock to prevent race conditions between stage and unstage operations that could lead to premature NVMe controller disconnects.
Added group mutual lock into
NodeStageVolume()andNodeUnStageVolume(), because these 2 operations cannot run together. but few calls of same type can run together.The Problem
Without coordination, a stage operation can connect to NVMe controllers while an unstage operation is simultaneously checking if it's safe to disconnect them. This creates a race:
Result: Stage fails with "device not found" errors.
The Solution
Add a GroupLock that allows:
Three Levels of Locking
(after the PR: nvmeof: Add mount cache and locking for safe nvme disconnect will be merged)
The code will have three levels of locks working together:
Level 1: volumeLocks (per-volume mutex)
this already exists in the code.
Level 2: stageUnstageLock (GroupLock)
the current PR introduces it.
Level 3: mountCache.mu (cache mutex)
this PR nvmeof: Add mount cache and locking for safe nvme disconnect
Lock Acquisition Order
Both NodeStageVolume and NodeUnstageVolume follow the same order:
There are unit tests for group locking here: https://github.com/ceph/ceph-csi/blob/devel/internal/util/lock/group_lock_test.go
Also, e2e tests were added.
Checklist:
guidelines in the developer
guide.
Request
notes
updated with breaking and/or notable changes for the next major release.
Show available bot commands
These commands are normally not required, but in case of issues, leave any of
the following bot commands in an otherwise empty comment in this PR:
/retest ci/centos/<job-name>: retest the<job-name>after unrelatedfailure (please report the failure too!)