fix(spdk): make engine frontend recovery async to prevent pod crash loop by derekbit · Pull Request #539 · longhorn/longhorn-spdk-engine

derekbit · 2026-05-24T07:13:45Z

Which issue(s) this PR fixes:

What this PR does / why we need it:

When an instance-manager pod restarts with persisted engine frontend records referencing unreachable NVMe-TCP targets, the synchronous recoverEngineFrontends() blocks gRPC startup, exceeding the liveness probe threshold and causing an infinite pod create/delete loop.

The fix runs recovery as a goroutine so gRPC serves immediately. A TCP dial pre-check (5s timeout) in the recovery path detects unreachable targets early, avoiding the full 60s nvme-connect retry loop. Since recovery now runs concurrently with incoming gRPC requests, EngineFrontendCreate may race with the recovery goroutine for the same volume. To handle this safely, Create checks whether a conflicting map entry is still in Pending state (i.e. mid-recovery) and if so marks it Terminating and removes it from the map — letting the recovery goroutine perform the actual resource teardown sequentially after RecoverFromHost returns, which avoids concurrent access to the NVMe initiator. The recovery loop itself verifies map ownership (pointer equality) before deleting entries in both the failure and success paths, so it never accidentally removes a new frontend that was registered by a concurrent Create under the same name or volume.

Special notes for your reviewer:

Additional documentation or context

Copilot

Pull request overview

This PR addresses instance-manager pod crash loops caused by synchronous engine-frontend recovery blocking gRPC startup when persisted NVMe-TCP targets are unreachable. It makes recovery asynchronous and adds additional safeguards to reduce long connect retry delays and handle races between recovery and new EngineFrontendCreate calls.

Changes:

Run recoverEngineFrontends() asynchronously during server startup to allow gRPC to start serving immediately.
Add race-safety checks in recovery and EngineFrontendCreate to avoid deleting/replacing newer frontends when names/volumes collide during concurrent recovery.
Add a TCP reachability pre-check (5s dial timeout) in the recovery reconnect path to avoid lengthy NVMe reconnect retry loops for unreachable targets.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 4 comments.

File	Description
pkg/spdk/server.go	Starts engine frontend recovery asynchronously during server initialization.
pkg/spdk/server_enginefrontend.go	Adjusts create logic to evict Pending (recovering) frontends safely on name/volume conflicts.
pkg/spdk/enginefrontend.go	Adds recovery reachability pre-check and prevents deferred recovery state updates from clobbering concurrent state changes.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot

Pull request overview

Copilot reviewed 6 out of 6 changed files in this pull request and generated 6 comments.

Copilot

Pull request overview

Copilot reviewed 7 out of 7 changed files in this pull request and generated 4 comments.

Copilot

Pull request overview

Copilot reviewed 7 out of 7 changed files in this pull request and generated 1 comment.

Copilot

Pull request overview

Copilot reviewed 7 out of 7 changed files in this pull request and generated no new comments.

…ation support Make recoverEngineFrontends() run in a separate goroutine so that gRPC servers can start immediately. This prevents the liveness probe from killing the pod when persisted targets are unreachable. Add per-volume host locks (volumeHostLocks) to serialize host-level NVMe/dm operations for the same volume. Recovery and all frontend lifecycle RPCs that mutate host NVMe controllers or dm devices (create, delete, suspend, resume, expand, switchover) acquire the per-volume lock so that these operations cannot overlap on one volume. Longhorn 13185 Signed-off-by: Derek Su <derek.su@suse.com>

When EngineFrontendCreate encounters an existing frontend in Pending state (still being recovered asynchronously), evict it instead of returning AlreadyExists. This allows the new Create to proceed immediately without waiting for the potentially slow recovery to complete or fail. Longhorn 13185 Signed-off-by: Derek Su <derek.su@suse.com>

derekbit · 2026-05-25T00:54:50Z

https://ci.longhorn.io/job/private/job/longhorn-e2e-test/7964/
https://ci.longhorn.io/job/private/job/longhorn-e2e-test/7965/

https://ci.longhorn.io/job/private/job/longhorn-e2e-test/7966/
https://ci.longhorn.io/job/private/job/longhorn-e2e-test/7967

https://ci.longhorn.io/job/private/job/longhorn-tests-regression/10894/
https://ci.longhorn.io/job/private/job/longhorn-tests-regression/10895/

derekbit · 2026-05-25T03:24:39Z

cc @davidcheng0922 for review.

derekbit · 2026-05-25T03:24:48Z

@mergify backport v1.12.x

mergify · 2026-05-25T03:24:53Z

backport v1.12.x

✅ Backports have been created

Details

#543 fix(spdk): make engine frontend recovery async to prevent pod crash loop (backport #539) has been created for branch v1.12.x

davidcheng0922

LGTM

derekbit self-assigned this May 24, 2026

derekbit requested review from Copilot and davidcheng0922 May 24, 2026 07:18

Copilot started reviewing on behalf of derekbit May 24, 2026 07:18 View session