Cross-pool snapshot restore fallback when source pool has insufficient space

**Is your feature request related to a problem? Please describe.**

When creating a volume from a snapshot (CSI `CreateVolume` with snapshot source), Mayastor requires the new volume to be placed on the **same pool** as the snapshot's replica (because CoW snapshots are blobstore-local). If that pool is full, the operation fails with `507 Insufficient Storage`, even when other pools in the cluster have terabytes of free space.

In our production KubeVirt environment, we use golden image snapshots to clone VM disks. The golden image's replicas were on a pool with 3.5 TiB used / 3.5 TiB capacity. Every VM provisioning attempt failed indefinitely, while other pools on the same and different nodes had 2-7 TiB free. The CSI driver kept retrying the same doomed restore for 38+ minutes (144 retry events) until manual intervention.

**Describe the solution you'd like**

Add fallback logic in the `CreateVolume` handler when restoring from a snapshot:

1. Identify snapshot replica pools
2. For each pool, check available capacity vs requested volume size
3. If sufficient space → proceed with local CoW restore (existing fast path, instant)
4. If insufficient space on ALL snapshot replica pools → fall back to "full copy restore":
   - Allocate the new volume on a different pool that matches the volume's topology constraints and has sufficient capacity
   - Copy data from the snapshot replica to the new volume's replica over the network
   - Complete the `CreateVolume` call successfully

This is architecturally feasible because Mayastor already has the building blocks:
- **Cross-node/cross-pool data copy**: used during replica rebuilds (self-heal)
- **Network-based replication**: replicas are routinely copied between pools on different nodes
- **HotSpareReconciler infrastructure**: detects a missing replica and copies data from a healthy replica to a new one on a different pool

What's missing is the logic to use that same mechanism during snapshot restore. The `CreateVolume` from snapshot handler currently has a hardcoded assumption: "restore must happen on the same pool as the snapshot."

**Describe alternatives you've considered**

1. **Return a non-retriable error code**: If cross-pool restore is too complex, return `FailedPrecondition` instead of `ResourceExhausted` so the external-provisioner stops retrying and the orchestrator (CDI) can take alternative action. Currently `ResourceExhausted` is treated as retriable, causing infinite retry loops.

2. **CDI-level fallback**: I've filed a comment on [kubevirt/containerized-data-importer#4068](https://github.com/kubevirt/containerized-data-importer/issues/4068) requesting CDI to fall back to host-assisted copy when snapshot clone fails. This works but is slower (involves a copy pod) and doesn't leverage Mayastor's native replication.

3. **Operational workaround**: Ensure source image pools always have free space. This is what we do today, but it's fragile and requires constant capacity monitoring.

**Additional context**

| Approach | Speed | Space requirement |
|----------|-------|-------------------|
| Same-pool CoW restore (current) | Instant (metadata only) | Same pool must have space |
| Cross-pool full copy (proposed fallback) | Slower (network copy) | Any pool with space works |

The fallback is slower but **succeeds**, which is better than indefinite failure.

**Environment:**
- Mayastor: v2.10.0
- Kubernetes: v1.34.3
- CDI: v1.64.0
- Pool configuration: 14 pools across 6 nodes (3.5 TiB and 7 TiB pools)

**Related:** #1895 (snapshot rebuild when pool is offline, similar problem, different trigger)


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cross-pool snapshot restore fallback when source pool has insufficient space #1987

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Approach	Speed	Space requirement
Same-pool CoW restore (current)	Instant (metadata only)	Same pool must have space
Cross-pool full copy (proposed fallback)	Slower (network copy)	Any pool with space works

Cross-pool snapshot restore fallback when source pool has insufficient space #1987

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions