Skip to content

feat(core): offline rebuild for unpublished degraded volumes#1108

Open
yugchaudhari wants to merge 1 commit into
openebs:developfrom
yugchaudhari:feat/offline-rebuild-iter2
Open

feat(core): offline rebuild for unpublished degraded volumes#1108
yugchaudhari wants to merge 1 commit into
openebs:developfrom
yugchaudhari:feat/offline-rebuild-iter2

Conversation

@yugchaudhari
Copy link
Copy Markdown
Contributor

Iteration 2 of offline volume rebuild (openebs/openebs#4208), builds on the detection-only reconciler from #1103.

When an unpublished volume goes degraded, the reconciler now creates a temporary unshared nexus (target_config with protocol: None) after a grace period. That makes the volume look published to the existing HotSpareReconciler, which rebuilds the faulted replicas. Once the volume is Online again, the reconciler tears the nexus down and the volume returns to unpublished.

Key point: no rebuild logic is reimplemented, we lean on HotSpare. We just stand up and tear down the nexus.

New config: --offline-rebuild-grace-period (default 10m), behind the existing --offline-rebuild-enabled flag.

BDD tests cover happy path, feature-disabled, and never-published precondition.

Create a temporary unshared nexus for degraded unpublished volumes so
the existing HotSpareReconciler can rebuild faulted replicas. Tear down
the nexus once the volume returns to Online.

Add configurable grace period (--offline-rebuild-grace-period, default
10m) and BDD tests covering happy path, feature-disabled, and
never-published precondition.

Signed-off-by: yugchaudhari <[email protected]>
@yugchaudhari yugchaudhari force-pushed the feat/offline-rebuild-iter2 branch from a4da3e4 to d8cc119 Compare May 28, 2026 16:25
@yugchaudhari
Copy link
Copy Markdown
Contributor Author

Tested this on a 3-node cluster, both the BDD suite and a manual run.

BDD tests pass (cargo test -p agents --test core offline_rebuild): happy path, feature-disabled, and the never-published precondition.

For the manual run I created a 2-replica volume, published then unpublished it (to establish the health info), and killed one of the replica's io-engine nodes. The volume state over time:

Degraded  target=None                ← node killed, volume degraded
Degraded  target=io-engine-3/none    ← offline rebuild created the temp unshared nexus (protocol=none)
Online    target=None                ← rebuild done, nexus torn down, back to unpublished
Online    target=None                ← stays healed

And the reconciler's own logs line up with that:

16:39:39  DEBUG  Offline rebuild waiting for grace period, remaining: 7.97s
16:39:46  DEBUG  Offline rebuild waiting for grace period, remaining: 956ms
16:39:47  INFO   Initiating offline rebuild: creating non-shared nexus
16:39:47  INFO   Offline rebuild nexus created; HotSpareReconciler will handle the rebuild
16:39:50  INFO   Offline rebuild complete; tearing down temporary nexus
16:39:50  INFO   Temporary nexus destroyed; volume returned to unpublished state

So the full lifecycle works end to end: degraded unpublished volume → temp nexus → HotSpare rebuild → teardown → back to Online/unpublished.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant