Skip to content

Volume resize reconciler retries only nexus resize, never re-attempts failed replica resize #1989

@cmrmahesh

Description

@cmrmahesh

Describe the bug
We hit this in production — a volume resize from 150 GiB to 250 GiB got stuck permanently because one of the three replicas failed to expand (likely a transient gRPC issue), but the other two succeeded. After that, the reconciler keeps retrying resize_nexus every 5 minutes, which always fails because the one undersized replica is never re-expanded. The volume sat in this broken state for 7+ hours until we manually intervened.

The core issue: fixup_nexus_size in the nexus reconciler only retries the nexus resize — it never checks whether all child replicas are actually at the required size before doing so.

To Reproduce

  1. Have a 3-replica volume
  2. Trigger a volume resize (e.g., 150 GiB → 250 GiB)
  3. Induce a failure on one replica's resize (transient gRPC timeout, io-engine briefly unreachable, etc.) while the other two succeed
  4. Volume spec gets committed at 250 GiB
  5. Reconciler picks up the nexus size mismatch and retries resize_nexus — fails with:
    Child nvmf://10.x.x.x:8420/nqn.2019-05.io.openebs: of nexus is too small: size = 314572800 x 512, required = 524288000 x 512
  6. This repeats every 5 minutes forever. No resize_replica is ever issued for the stuck replica.

Expected behavior
The reconciler should notice the undersized replica and expand it before retrying the nexus resize. The volume should self-heal.

** OS info (please complete the following information):**

  • Ubuntu 22.04
  • Mayastor 2.10.0
  • 6-node io-engine cluster, 3-replica volumes
  • Kubernetes 1.28+

Additional context
One thing that made debugging harder: there's no REST endpoint to manually resize a replica (/v0/replicas/{id}/resize returns 404). The only workaround we found was deleting the replica and letting the rebuild happen at the correct size. Exposing a replica resize endpoint would be a nice operational escape hatch too.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions