fix: prevent resource leaks in backup, restore, and backing image cleanup paths (backport #531)#535
Merged
Merged
Conversation
…anup paths Introduce the nvmeInitiator interface and testable I/O hook variables (openFile, newNVMeTCPInitiator, etc.) to enable unit testing of error paths that interact with NVMe devices and SPDK subsystems. Backup: - Store portAllocator in Backup and release the allocated port on NewExecutor failure to prevent port leaks. - Capture replicaAddress at construction time so ReplicaBackupStatus does not dereference b.replica after releaseHeavyResourcesLocked nils it. - Convert CloseSnapshot to best-effort cleanup with nil guards so all resources (device file, NVMe initiator, exposed bdev, port) are released even when an earlier step fails. - Add releaseHeavyResourcesLocked to nil out heavy references (fragmap, initiator, executor, replica, port), making cleanup idempotent. - Add OpenSnapshot error-path cleanup via deferred rollback that closes the device, stops the initiator, and unexposes the bdev. - Guard UpdateBackupStatus against repeated terminal transitions and fire an onTerminal callback exactly once, recording a terminalAt timestamp synchronously under the backup lock. - Add timestamp-based backup retention pruning (pruneRetainedBackupsLocked) that sorts terminal entries by terminalAt and evicts the oldest beyond configurable retain limits (Complete: 100, Error: 100), releasing heavy resources before deletion. - Wire the onTerminal callback in ReplicaBackupCreate to trigger pruning, and release resources on synchronous BackupCreate failure. Restore: - Convert CloseVolumeDev to best-effort cleanup and fix the inverted IsExposed guard that attempted to unexpose an already-unexposed replica; only clear IsExposed on successful unexpose. - Add OpenVolumeDev error-path cleanup via deferred rollback that closes the device, stops the initiator, and unexposes the bdev. - Move the completeBackupRestore goroutine launch after successful restore initiation to avoid goroutine leak on error. Backing image: - Extract direct calls to initiator.NewInitiator, os.OpenFile, commonnet.GetIPForPod, and SPDK client methods into package-level hook variables for testability. - Reorder defer blocks in prepareBackingImageSnapshot and prepareFromSync so the NVMe initiator is stopped before (after in defer order) the file handle is closed, matching the correct teardown sequence. Longhorn 13114 Signed-off-by: Derek Su <derek.su@suse.com> (cherry picked from commit 9880d7b)
Fix backup cleanup state tracking so partially cleaned snapshot resources do not get treated as fully released. Keep backup resource fields only when cleanup steps succeed, delay terminal marking until cleanup fully completes, and prevent retained-backup pruning from evicting entries that still hold active snapshot resources or an unreleased backup port. For BackupCreate failures, keep the backup in backupMap when rollback leaves active resources behind, mark it as a terminal error for observability, and remove it only when cleanup is complete. Longhorn 13114 Signed-off-by: Derek Su <derek.su@suse.com> (cherry picked from commit 3d16ac9)
Longhorn 13114 Signed-off-by: Derek Su <derek.su@suse.com> (cherry picked from commit 63f3502)
Signed-off-by: Derek Su <derek.su@suse.com> (cherry picked from commit 6c37fa2)
derekbit
approved these changes
May 18, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Which issue(s) this PR fixes:
Issue longhorn/longhorn#13114
What this PR does / why we need it:
Harden the backup and restore open paths so partial setup does not leave NVMe initiators or exposed bdevs behind.
Special notes for your reviewer:
Additional documentation or context
This is an automatic backport of pull request #531 done by [Mergify](https://mergify.com).