Skip to content

Commit 591ed01

Browse files
committed
docs: record BackupOps wiring and the integration suite in the design doc
1 parent 7e2678b commit 591ed01

1 file changed

Lines changed: 36 additions & 0 deletions

File tree

docs/plans/2026-06-11-grpc-process-split-design.md

Lines changed: 36 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -270,3 +270,39 @@ capabilities, gRPC idiom) was implemented as a commit series on this branch:
270270
nameservers still materialise supervisor-side.
271271
- Spontaneous guest-death detection feeding WatchEvents (no component
272272
observes VMM process exit today).
273+
274+
## 8. BackupOps + integration suite (2026-06-11, third pass)
275+
276+
**BackupOps wired.** The six `BackupOps` methods were stubs; the gRPC
277+
plumbing (proto RPCs, server handlers, client methods, conversions) already
278+
existed. `InProcessSupervisor` now implements them on top of
279+
`controllers/qemu/backup.py` (the machinery the agent's operator HTTP views
280+
use):
281+
282+
- One async backup job per VM, serialized per-VM against restore;
283+
idempotent against a running job and against a non-expired archive
284+
(24h TTL, mirroring the operator endpoint). Optional best-effort guest
285+
fs-freeze (`quiesce_guest`).
286+
- Backups cover the rootfs disk only — symmetric with what restore can put
287+
back. Supervisor-issued backup ids use microsecond timestamps (id = tar
288+
stem; a retry after a failure must get a fresh id).
289+
- Completed archives live on disk as the source of truth; only in-flight
290+
and failed runs are held in memory. Download streams 1 MiB offset-tagged
291+
chunks. Restore extracts the rootfs member (member-streamed, no
292+
extractall), verifies it, stops the VM with forget-on-stop defused, swaps
293+
the rootfs atomically and restarts; emits down-then-up events.
294+
295+
**Integration suite** (`tests/integration/`, opt-in via `AVM_ITEST=1`):
296+
drives a real supervisor daemon over its UDS gRPC contract, agent-free —
297+
specs built inline from local artifacts. Self-gating: Firecracker tests run
298+
unprivileged (vsock-channel reachability); QEMU tests need root + a cloud
299+
image (IP/SSH reachability, persistent lifecycle via a systemd drop-in that
300+
points `aleph-vm-controller@` at the source tree under test). Covers
301+
creation, management (logs/reboot/events/port-forwards/stop-start),
302+
deletion + resource release (processes, files, TAPs, nftables, units), and
303+
the full backup→mutate→restore cycle.
304+
305+
**Found by the suite:** the pool's forget-on-stop task deleted by hash, not
306+
identity — a reboot (or delete+create) that recreated the VM under the same
307+
vm_id could have its new execution removed from the pool by the old
308+
execution's reap task. Fixed in `_schedule_forget_on_stop`.

0 commit comments

Comments
 (0)