feat(core): offline rebuild for unpublished degraded volumes#1108
Open
yugchaudhari wants to merge 1 commit into
Open
feat(core): offline rebuild for unpublished degraded volumes#1108yugchaudhari wants to merge 1 commit into
yugchaudhari wants to merge 1 commit into
Conversation
7 tasks
Create a temporary unshared nexus for degraded unpublished volumes so the existing HotSpareReconciler can rebuild faulted replicas. Tear down the nexus once the volume returns to Online. Add configurable grace period (--offline-rebuild-grace-period, default 10m) and BDD tests covering happy path, feature-disabled, and never-published precondition. Signed-off-by: yugchaudhari <[email protected]>
a4da3e4 to
d8cc119
Compare
Contributor
Author
|
Tested this on a 3-node cluster, both the BDD suite and a manual run. BDD tests pass ( For the manual run I created a 2-replica volume, published then unpublished it (to establish the health info), and killed one of the replica's io-engine nodes. The volume state over time: And the reconciler's own logs line up with that: So the full lifecycle works end to end: degraded unpublished volume → temp nexus → HotSpare rebuild → teardown → back to Online/unpublished. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Iteration 2 of offline volume rebuild (openebs/openebs#4208), builds on the detection-only reconciler from #1103.
When an unpublished volume goes degraded, the reconciler now creates a temporary unshared nexus (target_config with
protocol: None) after a grace period. That makes the volume look published to the existing HotSpareReconciler, which rebuilds the faulted replicas. Once the volume is Online again, the reconciler tears the nexus down and the volume returns to unpublished.Key point: no rebuild logic is reimplemented, we lean on HotSpare. We just stand up and tear down the nexus.
New config:
--offline-rebuild-grace-period(default 10m), behind the existing--offline-rebuild-enabledflag.BDD tests cover happy path, feature-disabled, and never-published precondition.