Skip to content

fix(spdk): make engine frontend recovery async to prevent pod crash loop#539

Merged
derekbit merged 2 commits into
longhorn:masterfrom
derekbit:issue-13185
May 25, 2026
Merged

fix(spdk): make engine frontend recovery async to prevent pod crash loop#539
derekbit merged 2 commits into
longhorn:masterfrom
derekbit:issue-13185

Conversation

@derekbit
Copy link
Copy Markdown
Member

Which issue(s) this PR fixes:

Issue longhorn/longhorn#13185

What this PR does / why we need it:

When an instance-manager pod restarts with persisted engine frontend records referencing unreachable NVMe-TCP targets, the synchronous recoverEngineFrontends() blocks gRPC startup, exceeding the liveness probe threshold and causing an infinite pod create/delete loop.

The fix runs recovery as a goroutine so gRPC serves immediately. A TCP dial pre-check (5s timeout) in the recovery path detects unreachable targets early, avoiding the full 60s nvme-connect retry loop. Since recovery now runs concurrently with incoming gRPC requests, EngineFrontendCreate may race with the recovery goroutine for the same volume. To handle this safely, Create checks whether a conflicting map entry is still in Pending state (i.e. mid-recovery) and if so marks it Terminating and removes it from the map — letting the recovery goroutine perform the actual resource teardown sequentially after RecoverFromHost returns, which avoids concurrent access to the NVMe initiator. The recovery loop itself verifies map ownership (pointer equality) before deleting entries in both the failure and success paths, so it never accidentally removes a new frontend that was registered by a concurrent Create under the same name or volume.

Special notes for your reviewer:

Additional documentation or context

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR addresses instance-manager pod crash loops caused by synchronous engine-frontend recovery blocking gRPC startup when persisted NVMe-TCP targets are unreachable. It makes recovery asynchronous and adds additional safeguards to reduce long connect retry delays and handle races between recovery and new EngineFrontendCreate calls.

Changes:

  • Run recoverEngineFrontends() asynchronously during server startup to allow gRPC to start serving immediately.
  • Add race-safety checks in recovery and EngineFrontendCreate to avoid deleting/replacing newer frontends when names/volumes collide during concurrent recovery.
  • Add a TCP reachability pre-check (5s dial timeout) in the recovery reconnect path to avoid lengthy NVMe reconnect retry loops for unreachable targets.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 4 comments.

File Description
pkg/spdk/server.go Starts engine frontend recovery asynchronously during server initialization.
pkg/spdk/server_enginefrontend.go Adjusts create logic to evict Pending (recovering) frontends safely on name/volume conflicts.
pkg/spdk/enginefrontend.go Adds recovery reachability pre-check and prevents deferred recovery state updates from clobbering concurrent state changes.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread pkg/spdk/server.go Outdated
Comment thread pkg/spdk/server_enginefrontend.go
Comment thread pkg/spdk/enginefrontend.go Outdated
Comment thread pkg/spdk/enginefrontend.go
@derekbit derekbit force-pushed the issue-13185 branch 2 times, most recently from f48d136 to deeb6e9 Compare May 24, 2026 08:28
@derekbit derekbit requested a review from Copilot May 24, 2026 08:30
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 6 out of 6 changed files in this pull request and generated 6 comments.

Comment thread pkg/spdk/server_enginefrontend.go Outdated
Comment thread pkg/spdk/server.go Outdated
Comment thread pkg/spdk/server.go Outdated
Comment thread pkg/spdk/server.go Outdated
Comment thread pkg/spdk/server.go Outdated
Comment thread pkg/spdk/enginefrontend.go
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 7 out of 7 changed files in this pull request and generated 4 comments.

Comment thread pkg/spdk/server_enginefrontend.go
Comment thread pkg/spdk/server.go Outdated
Comment thread pkg/spdk/server_verify_test.go
Comment thread pkg/spdk/server.go Outdated
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 7 out of 7 changed files in this pull request and generated 1 comment.

Comment thread pkg/spdk/server_enginefrontend.go Outdated
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 7 out of 7 changed files in this pull request and generated no new comments.

derekbit added 2 commits May 24, 2026 23:09
…ation support

Make recoverEngineFrontends() run in a separate goroutine so that gRPC
servers can start immediately. This prevents the liveness probe from
killing the pod when persisted targets are unreachable.

Add per-volume host locks (volumeHostLocks) to serialize host-level
NVMe/dm operations for the same volume. Recovery and all frontend
lifecycle RPCs that mutate host NVMe controllers or dm devices (create,
delete, suspend, resume, expand, switchover) acquire the per-volume lock
so that these operations cannot overlap on one volume.

Longhorn 13185

Signed-off-by: Derek Su <derek.su@suse.com>
When EngineFrontendCreate encounters an existing frontend in Pending
state (still being recovered asynchronously), evict it instead of
returning AlreadyExists. This allows the new Create to proceed
immediately without waiting for the potentially slow recovery to
complete or fail.

Longhorn 13185

Signed-off-by: Derek Su <derek.su@suse.com>
@derekbit
Copy link
Copy Markdown
Member Author

cc @davidcheng0922 for review.

@derekbit
Copy link
Copy Markdown
Member Author

@mergify backport v1.12.x

@mergify
Copy link
Copy Markdown

mergify Bot commented May 25, 2026

backport v1.12.x

✅ Backports have been created

Details

Copy link
Copy Markdown
Contributor

@davidcheng0922 davidcheng0922 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@derekbit derekbit merged commit dfac075 into longhorn:master May 25, 2026
7 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants