Skip to content

nvmeof: Add GroupLock to coordinate stage and unstage operations and e2e tests#6210

Open
gadididi wants to merge 3 commits intoceph:develfrom
gadididi:nvmeof/add_locking_nodeserver
Open

nvmeof: Add GroupLock to coordinate stage and unstage operations and e2e tests#6210
gadididi wants to merge 3 commits intoceph:develfrom
gadididi:nvmeof/add_locking_nodeserver

Conversation

@gadididi
Copy link
Copy Markdown
Contributor

@gadididi gadididi commented Mar 30, 2026

Describe what this PR does

This PR adds a GroupLock to prevent race conditions between stage and unstage operations that could lead to premature NVMe controller disconnects.
Added group mutual lock into NodeStageVolume() and NodeUnStageVolume() , because these 2 operations cannot run together. but few calls of same type can run together.

The Problem

Without coordination, a stage operation can connect to NVMe controllers while an unstage operation is simultaneously checking if it's safe to disconnect them. This creates a race:

  1. Stage operation connects to controllers
  2. Unstage operation checks cache, sees no devices
  3. Unstage operation disconnects controllers
  4. Stage operation tries to mount, but device is already disconnected

Result: Stage fails with "device not found" errors.

The Solution

Add a GroupLock that allows:

  • Multiple stage operations to run concurrently (Group A)
  • Multiple unstage operations to run concurrently (Group B)
  • But prevents stage and unstage from running at the same time

Three Levels of Locking

(after the PR: nvmeof: Add mount cache and locking for safe nvme disconnect will be merged)

The code will have three levels of locks working together:

Level 1: volumeLocks (per-volume mutex)
this already exists in the code.

  • Scope: Single volume
  • Purpose: Prevents concurrent operations on the same volume
  • Example: Two NodeStageVolume calls for vol-123 cannot run at the same time

Level 2: stageUnstageLock (GroupLock)
the current PR introduces it.

  • Scope: All volumes, operation type
  • Purpose: Prevents stage and unstage from interfering with each other
  • Allows: Multiple stages together, multiple unstages together
  • Prevents: Stage and unstage at the same time

Level 3: mountCache.mu (cache mutex)
this PR nvmeof: Add mount cache and locking for safe nvme disconnect

  • Scope: Cache data structure
  • Purpose: Protects cache reads and writes
  • Duration: Nanoseconds (just for map operations)

Lock Acquisition Order

Both NodeStageVolume and NodeUnstageVolume follow the same order:

  1. volumeLocks.Lock(volumeID) // Lock this specific volume
  2. stageUnstageLock.AcquireGroup() // Lock operation type (A or B)
  3. mountCache operations // Cache mutex acquired internally as needed
  4. stageUnstageLock.ReleaseGroup() // Release operation type
  5. volumeLocks.Release(volumeID) // Release volume lock

There are unit tests for group locking here: https://github.com/ceph/ceph-csi/blob/devel/internal/util/lock/group_lock_test.go

Also, e2e tests were added.

Checklist:

  • Commit Message Formatting: Commit titles and messages follow
    guidelines in the developer
    guide
    .
  • Reviewed the developer guide on Submitting a Pull
    Request
  • Pending release
    notes

    updated with breaking and/or notable changes for the next major release.
  • Documentation has been updated, if necessary.
  • Unit tests have been added, if necessary.
  • Integration tests have been added, if necessary.

Show available bot commands

These commands are normally not required, but in case of issues, leave any of
the following bot commands in an otherwise empty comment in this PR:

  • /retest ci/centos/<job-name>: retest the <job-name> after unrelated
    failure (please report the failure too!)

@gadididi gadididi self-assigned this Mar 30, 2026
@gadididi gadididi added the component/nvme-of Issues and PRs related to NVMe-oF. label Mar 30, 2026
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a cross-volume “group mutual exclusion” lock to the NVMe-oF node server to prevent NodeStageVolume and NodeUnstageVolume from running concurrently, avoiding races that can lead to premature NVMe controller disconnects.

Changes:

  • Introduces a stageUnstageLock (lock.GroupLock) field on the NVMe-oF NodeServer.
  • Wraps NodeStageVolume() with Group A acquire/release and NodeUnstageVolume() with Group B acquire/release.
  • Adds the internal/util/lock import to support the new locking behavior.

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 1 out of 1 changed files in this pull request and generated no new comments.

added group mutual lock into `NodeStageVolume()` and
`NodeUnStageVolume()` , because these 2 operations cannot
run together. but few calls of same type can run together.

Signed-off-by: gadi-didi <gadi.didi@ibm.com>
@gadididi
Copy link
Copy Markdown
Contributor Author

/test ci/centos/mini-e2e/k8s-1.35/nvmeof

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.

Comment on lines +173 to +180
// Acquire GroupLock - wrap the mounting + connection logic in a GroupLock
// to prevent staging and unstaging from happening at the same time,
// as they can interfere with each other.
// This allows multiple staging operations to run concurrently,
// and multiple unstaging operations to run concurrently,
// but prevents staging and unstaging from running at the same time.
ns.stageUnstageLock.AcquireGroupA()
defer ns.stageUnstageLock.ReleaseGroupA()
Copy link

Copilot AI Mar 31, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lock.GroupLock is explicitly documented as not guaranteeing fairness (potential starvation). Using it to guard long-running CSI RPC handlers means a steady stream of NodeStage calls could indefinitely delay NodeUnstage (or vice-versa), potentially stalling pod teardown. Consider switching to a fair group mutual-exclusion implementation (e.g., track waiting counts and alternate preference when the active group drains) and/or making acquisition context-aware so cancellations/timeouts don’t leave requests stuck waiting forever.

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that's why I am testing parallel Go routines with delete\create , to verify there is starvation .

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it is not expected that this results in a problem. starvation will only happen when there are a huge number of volumes staged at the same time, while other volumes are unstaged. The staging that cause unstaging to be blocked (or the other way around), is extremely unlikely to cause problematic delays.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I run the "mixed test" with batches of 5 pods (deletion\creation) .
I could make it with huge number, but of course it will increase the time.. do you want me to do that, or leave it as is?

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated no new comments.

@gadididi
Copy link
Copy Markdown
Contributor Author

/test ci/centos/mini-e2e/k8s-1.35/nvmeof

Copy link
Copy Markdown
Member

@nixpanic nixpanic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm, thanks!

Comment on lines +173 to +180
// Acquire GroupLock - wrap the mounting + connection logic in a GroupLock
// to prevent staging and unstaging from happening at the same time,
// as they can interfere with each other.
// This allows multiple staging operations to run concurrently,
// and multiple unstaging operations to run concurrently,
// but prevents staging and unstaging from running at the same time.
ns.stageUnstageLock.AcquireGroupA()
defer ns.stageUnstageLock.ReleaseGroupA()
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it is not expected that this results in a problem. starvation will only happen when there are a huge number of volumes staged at the same time, while other volumes are unstaged. The staging that cause unstaging to be blocked (or the other way around), is extremely unlikely to cause problematic delays.

@@ -0,0 +1,328 @@
/*
Copyright 2025 The Ceph-CSI Authors.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we're in 2026 already!

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

every time.. 🤭

test on tentacle ceph v20

Signed-off-by: gadi-didi <gadi.didi@ibm.com>
@gadididi gadididi force-pushed the nvmeof/add_locking_nodeserver branch from a214c1b to 6519c04 Compare April 1, 2026 09:44
@gadididi gadididi requested a review from Copilot April 1, 2026 09:44
@gadididi
Copy link
Copy Markdown
Contributor Author

gadididi commented Apr 1, 2026

/test ci/centos/mini-e2e/k8s-1.35/nvmeof

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 4 out of 4 changed files in this pull request and generated 6 comments.

@gadididi gadididi force-pushed the nvmeof/add_locking_nodeserver branch from 6519c04 to 9094db7 Compare April 1, 2026 09:57
@gadididi gadididi requested a review from Copilot April 1, 2026 09:58
@gadididi
Copy link
Copy Markdown
Contributor Author

gadididi commented Apr 1, 2026

/test ci/centos/mini-e2e/k8s-1.35/nvmeof

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 4 out of 4 changed files in this pull request and generated 1 comment.

@gadididi gadididi force-pushed the nvmeof/add_locking_nodeserver branch from 9094db7 to f2300bb Compare April 1, 2026 11:05
@gadididi gadididi requested a review from Copilot April 1, 2026 11:06
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 4 out of 4 changed files in this pull request and generated 1 comment.

@gadididi
Copy link
Copy Markdown
Contributor Author

gadididi commented Apr 5, 2026

/test ci/centos/mini-e2e/k8s-1.35/nvmeof

Add e2e tests to validate nvmeof NodeServer GroupLock implementation
under concurrent NodeStage (Group A) and NodeUnstage (Group B)
operations. The tests ensure no deadlock occurs when multiple PVCs and
Pods are created and deleted simultaneously.

New helper file (nvmeof_helper.go) provides reusable functions for
concurrent PVC/Pod operations with proper error tracking.
Two test cases cover:
1) sequential concurrent batches (create all, then delete
all)

2) mixed operations with pre-created batch to guarantee
continuous Group A/B switching..

Signed-off-by: gadi-didi <gadi.didi@ibm.com>
@gadididi gadididi force-pushed the nvmeof/add_locking_nodeserver branch from f2300bb to df0f0ec Compare April 5, 2026 06:23
@gadididi
Copy link
Copy Markdown
Contributor Author

gadididi commented Apr 5, 2026

/test ci/centos/mini-e2e/k8s-1.35/nvmeof

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 4 out of 4 changed files in this pull request and generated 1 comment.

ROOK_VERSION=v1.18.4
# Provide ceph image path
ROOK_CEPH_CLUSTER_IMAGE=quay.io/ceph/ceph:v19.2.2
ROOK_CEPH_CLUSTER_IMAGE=quay.io/ceph/ceph:v20
Copy link

Copilot AI Apr 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ROOK_CEPH_CLUSTER_IMAGE is now set to quay.io/ceph/ceph:v20 (major-only tag). This makes CI/e2e less reproducible because the image contents can change over time as new v20.x releases are pushed. Consider pinning to a specific v20.x.y tag (or documenting why floating within the major is required).

Suggested change
ROOK_CEPH_CLUSTER_IMAGE=quay.io/ceph/ceph:v20
ROOK_CEPH_CLUSTER_IMAGE=quay.io/ceph/ceph:v20.2.0

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it is just for running the Jenkins e2e test with nvmeof .. after reviewing , this change will be reverted

@gadididi gadididi requested a review from nixpanic April 5, 2026 06:29
@gadididi gadididi marked this pull request as ready for review April 5, 2026 06:29
@gadididi gadididi requested a review from a team April 5, 2026 06:33
@gadididi gadididi changed the title nvmeof: Add GroupLock to coordinate stage and unstage operations nvmeof: Add GroupLock to coordinate stage and unstage operations and e2e tests Apr 5, 2026
@mergify mergify bot added the component/testing Additional test cases or CI work label Apr 5, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

component/nvme-of Issues and PRs related to NVMe-oF. component/testing Additional test cases or CI work

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants