Describe the bug
We hit this in production — a volume resize from 150 GiB to 250 GiB got stuck permanently because one of the three replicas failed to expand (likely a transient gRPC issue), but the other two succeeded. After that, the reconciler keeps retrying resize_nexus every 5 minutes, which always fails because the one undersized replica is never re-expanded. The volume sat in this broken state for 7+ hours until we manually intervened.
The core issue: fixup_nexus_size in the nexus reconciler only retries the nexus resize — it never checks whether all child replicas are actually at the required size before doing so.
To Reproduce
- Have a 3-replica volume
- Trigger a volume resize (e.g., 150 GiB → 250 GiB)
- Induce a failure on one replica's resize (transient gRPC timeout, io-engine briefly unreachable, etc.) while the other two succeed
- Volume spec gets committed at 250 GiB
- Reconciler picks up the nexus size mismatch and retries resize_nexus — fails with:
Child nvmf://10.x.x.x:8420/nqn.2019-05.io.openebs: of nexus is too small: size = 314572800 x 512, required = 524288000 x 512
- This repeats every 5 minutes forever. No resize_replica is ever issued for the stuck replica.
Expected behavior
The reconciler should notice the undersized replica and expand it before retrying the nexus resize. The volume should self-heal.
** OS info (please complete the following information):**
- Ubuntu 22.04
- Mayastor 2.10.0
- 6-node io-engine cluster, 3-replica volumes
- Kubernetes 1.28+
Additional context
One thing that made debugging harder: there's no REST endpoint to manually resize a replica (/v0/replicas/{id}/resize returns 404). The only workaround we found was deleting the replica and letting the rebuild happen at the correct size. Exposing a replica resize endpoint would be a nice operational escape hatch too.
Describe the bug
We hit this in production — a volume resize from 150 GiB to 250 GiB got stuck permanently because one of the three replicas failed to expand (likely a transient gRPC issue), but the other two succeeded. After that, the reconciler keeps retrying resize_nexus every 5 minutes, which always fails because the one undersized replica is never re-expanded. The volume sat in this broken state for 7+ hours until we manually intervened.
The core issue: fixup_nexus_size in the nexus reconciler only retries the nexus resize — it never checks whether all child replicas are actually at the required size before doing so.
To Reproduce
Child nvmf://10.x.x.x:8420/nqn.2019-05.io.openebs: of nexus is too small: size = 314572800 x 512, required = 524288000 x 512
Expected behavior
The reconciler should notice the undersized replica and expand it before retrying the nexus resize. The volume should self-heal.
** OS info (please complete the following information):**
Additional context
One thing that made debugging harder: there's no REST endpoint to manually resize a replica (/v0/replicas/{id}/resize returns 404). The only workaround we found was deleting the replica and letting the rebuild happen at the correct size. Exposing a replica resize endpoint would be a nice operational escape hatch too.