Skip to content

fix: prevent resource leaks in backup, restore, and backing image cleanup paths (backport #531)#535

Merged
derekbit merged 4 commits into
v1.12.xfrom
mergify/bp/v1.12.x/pr-531
May 18, 2026
Merged

fix: prevent resource leaks in backup, restore, and backing image cleanup paths (backport #531)#535
derekbit merged 4 commits into
v1.12.xfrom
mergify/bp/v1.12.x/pr-531

Conversation

@mergify
Copy link
Copy Markdown

@mergify mergify Bot commented May 18, 2026

Which issue(s) this PR fixes:

Issue longhorn/longhorn#13114

What this PR does / why we need it:

Harden the backup and restore open paths so partial setup does not leave NVMe initiators or exposed bdevs behind.

Special notes for your reviewer:

Additional documentation or context


This is an automatic backport of pull request #531 done by [Mergify](https://mergify.com).

derekbit added 4 commits May 18, 2026 10:46
…anup paths

Introduce the nvmeInitiator interface and testable I/O hook variables
(openFile, newNVMeTCPInitiator, etc.) to enable unit testing of error
paths that interact with NVMe devices and SPDK subsystems.

Backup:
- Store portAllocator in Backup and release the allocated port on
  NewExecutor failure to prevent port leaks.
- Capture replicaAddress at construction time so ReplicaBackupStatus
  does not dereference b.replica after releaseHeavyResourcesLocked
  nils it.
- Convert CloseSnapshot to best-effort cleanup with nil guards so all
  resources (device file, NVMe initiator, exposed bdev, port) are
  released even when an earlier step fails.
- Add releaseHeavyResourcesLocked to nil out heavy references
  (fragmap, initiator, executor, replica, port), making cleanup
  idempotent.
- Add OpenSnapshot error-path cleanup via deferred rollback that
  closes the device, stops the initiator, and unexposes the bdev.
- Guard UpdateBackupStatus against repeated terminal transitions and
  fire an onTerminal callback exactly once, recording a terminalAt
  timestamp synchronously under the backup lock.
- Add timestamp-based backup retention pruning
  (pruneRetainedBackupsLocked) that sorts terminal entries by
  terminalAt and evicts the oldest beyond configurable retain limits
  (Complete: 100, Error: 100), releasing heavy resources before deletion.
- Wire the onTerminal callback in ReplicaBackupCreate to trigger
  pruning, and release resources on synchronous BackupCreate failure.

Restore:
- Convert CloseVolumeDev to best-effort cleanup and fix the inverted
  IsExposed guard that attempted to unexpose an already-unexposed
  replica; only clear IsExposed on successful unexpose.
- Add OpenVolumeDev error-path cleanup via deferred rollback that
  closes the device, stops the initiator, and unexposes the bdev.
- Move the completeBackupRestore goroutine launch after successful
  restore initiation to avoid goroutine leak on error.

Backing image:
- Extract direct calls to initiator.NewInitiator, os.OpenFile,
  commonnet.GetIPForPod, and SPDK client methods into package-level
  hook variables for testability.
- Reorder defer blocks in prepareBackingImageSnapshot and
  prepareFromSync so the NVMe initiator is stopped before (after in
  defer order) the file handle is closed, matching the correct
  teardown sequence.

Longhorn 13114

Signed-off-by: Derek Su <derek.su@suse.com>
(cherry picked from commit 9880d7b)
Fix backup cleanup state tracking so partially cleaned snapshot resources
do not get treated as fully released.

Keep backup resource fields only when cleanup steps succeed, delay terminal
marking until cleanup fully completes, and prevent retained-backup pruning
from evicting entries that still hold active snapshot resources or an
unreleased backup port.

For BackupCreate failures, keep the backup in backupMap when rollback
leaves active resources behind, mark it as a terminal error for
observability, and remove it only when cleanup is complete.

Longhorn 13114

Signed-off-by: Derek Su <derek.su@suse.com>
(cherry picked from commit 3d16ac9)
Longhorn 13114

Signed-off-by: Derek Su <derek.su@suse.com>
(cherry picked from commit 63f3502)
Signed-off-by: Derek Su <derek.su@suse.com>
(cherry picked from commit 6c37fa2)
@derekbit derekbit merged commit eecd886 into v1.12.x May 18, 2026
7 checks passed
@derekbit derekbit deleted the mergify/bp/v1.12.x/pr-531 branch May 18, 2026 10:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant