@@ -270,3 +270,39 @@ capabilities, gRPC idiom) was implemented as a commit series on this branch:
270270 nameservers still materialise supervisor-side.
271271- Spontaneous guest-death detection feeding WatchEvents (no component
272272 observes VMM process exit today).
273+
274+ ## 8. BackupOps + integration suite (2026-06-11, third pass)
275+
276+ ** BackupOps wired.** The six ` BackupOps ` methods were stubs; the gRPC
277+ plumbing (proto RPCs, server handlers, client methods, conversions) already
278+ existed. ` InProcessSupervisor ` now implements them on top of
279+ ` controllers/qemu/backup.py ` (the machinery the agent's operator HTTP views
280+ use):
281+
282+ - One async backup job per VM, serialized per-VM against restore;
283+ idempotent against a running job and against a non-expired archive
284+ (24h TTL, mirroring the operator endpoint). Optional best-effort guest
285+ fs-freeze (` quiesce_guest ` ).
286+ - Backups cover the rootfs disk only — symmetric with what restore can put
287+ back. Supervisor-issued backup ids use microsecond timestamps (id = tar
288+ stem; a retry after a failure must get a fresh id).
289+ - Completed archives live on disk as the source of truth; only in-flight
290+ and failed runs are held in memory. Download streams 1 MiB offset-tagged
291+ chunks. Restore extracts the rootfs member (member-streamed, no
292+ extractall), verifies it, stops the VM with forget-on-stop defused, swaps
293+ the rootfs atomically and restarts; emits down-then-up events.
294+
295+ ** Integration suite** (` tests/integration/ ` , opt-in via ` AVM_ITEST=1 ` ):
296+ drives a real supervisor daemon over its UDS gRPC contract, agent-free —
297+ specs built inline from local artifacts. Self-gating: Firecracker tests run
298+ unprivileged (vsock-channel reachability); QEMU tests need root + a cloud
299+ image (IP/SSH reachability, persistent lifecycle via a systemd drop-in that
300+ points ` aleph-vm-controller@ ` at the source tree under test). Covers
301+ creation, management (logs/reboot/events/port-forwards/stop-start),
302+ deletion + resource release (processes, files, TAPs, nftables, units), and
303+ the full backup→mutate→restore cycle.
304+
305+ ** Found by the suite:** the pool's forget-on-stop task deleted by hash, not
306+ identity — a reboot (or delete+create) that recreated the VM under the same
307+ vm_id could have its new execution removed from the pool by the old
308+ execution's reap task. Fixed in ` _schedule_forget_on_stop ` .
0 commit comments