Volume resize reconciler retries only nexus resize, never re-attempts failed replica resize

**Describe the bug**
We hit this in production — a volume resize from 150 GiB to 250 GiB got stuck permanently because one of the three replicas failed to expand (likely a transient gRPC issue), but the other two succeeded. After that, the reconciler keeps retrying resize_nexus every 5 minutes, which always fails because the one undersized replica is never re-expanded. The volume sat in this broken state for 7+ hours until we manually intervened.

The core issue: fixup_nexus_size in the nexus reconciler only retries the nexus resize — it never checks whether all child replicas are actually at the required size before doing so.

**To Reproduce**

1. Have a 3-replica volume
2. Trigger a volume resize (e.g., 150 GiB → 250 GiB)
3. Induce a failure on one replica's resize (transient gRPC timeout, io-engine briefly unreachable, etc.) while the other two succeed
4. Volume spec gets committed at 250 GiB
5. Reconciler picks up the nexus size mismatch and retries resize_nexus — fails with:
Child nvmf://10.x.x.x:8420/nqn.2019-05.io.openebs:<replica-uuid> of nexus <nexus-uuid> is too small: size = 314572800 x 512, required = 524288000 x 512
6. This repeats every 5 minutes forever. No resize_replica is ever issued for the stuck replica.

**Expected behavior**
The reconciler should notice the undersized replica and expand it before retrying the nexus resize. The volume should self-heal.


** OS info (please complete the following information):**

- Ubuntu 22.04
- Mayastor 2.10.0
- 6-node io-engine cluster, 3-replica volumes
- Kubernetes 1.28+

**Additional context**
 One thing that made debugging harder: there's no REST endpoint to manually resize a replica (/v0/replicas/{id}/resize returns 404). The only workaround we found was deleting the replica and letting the rebuild happen at the correct size. Exposing a replica resize endpoint would be a nice operational escape hatch too.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Volume resize reconciler retries only nexus resize, never re-attempts failed replica resize #1989

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Volume resize reconciler retries only nexus resize, never re-attempts failed replica resize #1989

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions